Back to blog
Product Updates 2 min read
New · Argument mining benchmark

Find the reasons behind every position.

Outcome reports need more than support and opposition counts. They need the reasons people gave, the evidence behind each reason, and defensible prevalence numbers leaders can stand behind.

Why topic tags are not enough

Keyword shortcut

Counts words, not arguments

If a resident quotes a parking objection only to reject it, a search workflow still counts parking as the reason.

The wrong reason lands in the report because the word appeared, not because the resident advanced it.

Communiti argument mining

Extracts reasons residents advance

Reasons are recorded only when the resident is actually making that argument, and every reason is grounded in the original words.

The report count follows meaning, not just matching terms.

What shipped

Reasons now become report-ready evidence, not loose theme labels.

Reason extraction

96.0%

Strict reason F1 on audited feedback where the argument label and source evidence both had to match.

Grounding

100%

Every extracted reason is tied back to source text so a reviewer can check the claim.

Report counts

0.32 pts

Mean prevalence error across canonical arguments, which is the number that matters in outcome reports.

Tested before release

Benchmarked on the cases that break simple topic tagging.

The benchmark covers audited reason units, refuted arguments, implied reasons, bare stances, rambling multi-reason comments, emotional language, and multilingual feedback. Last run: .

0.0%
strict reason F1 on the audited benchmark
0%
of extracted reasons grounded in source text
0.00 pts
mean prevalence error across canonical arguments
0.0%
F1 when the argument taxonomy was induced from the consultation text

Watch the failure mode

A keyword is not a reason. A quoted objection is not always an objection.

Outcome reports do not just need support and opposition counts. They need the reasons behind those positions: safety, access, cost, disruption, amenity, housing need. If the reasons are wrong, the response can be wrong even when the headline count is right. This benchmark tests the cases that break simple topic tagging: arguments quoted only to be rejected, implied reasons, stance-only comments with no reason, rambling multi-reason responses, emotional language, and multilingual feedback.

Synthetic benchmark examples

Two comments that look easy until you count them.

Search-style tagging can see words, but it cannot tell whether the resident is advancing the argument, rejecting it, or implying it without naming the topic.

Reading response

Example 1

Quoted to reject

Keyword present, wrong argument
Resident comment: The traders keep saying the parking removal will kill the strip. It won't - the Summer Hill data showed turnover flat to up - and meanwhile riders keep ending up in emergency. Build it for the safety alone.

Search shortcut

Counts parking loss

The report overstates parking opposition and misses the reason the resident actually gave.

Communiti records

Rider safety

Evidence: “riders keep ending up in emergency”

Example 2

Implied, not named

No obvious topic word
Resident comment: My husband has two crook knees and a walking frame. Tell me how he gets from a side street to the physio at number 214. That's my whole submission.

Search shortcut

No parking reason found

The accessibility reason disappears because the resident never says parking.

Communiti records

Parking and access loss

Evidence: “My husband has two crook knees and a walking frame”

The benchmark includes refuted arguments, implied reasons, bare stances, rambling multi-reason comments, emotional language, and untranslated multilingual responses. Communiti cleared every publication gate in the page-grade suite.

Results

The gap is whether the report can say why and prove it.

96.0% strict reason F1

73 responses, 96 audited reason units, 36 canonical arguments

The gap is not just accuracy. It is whether a report can say why people objected or supported without a reviewer re-reading every row.

100% on implied reasons

The reason is present, but the resident never uses the official topic words

Residents rarely write in report headings. Communiti still records the actual argument with the words that prove it.

95.9% full-workflow F1

Argument taxonomy induced from the consultation text, then used to classify responses

For a new consultation, the system can discover the argument structure first then measure it with grounded evidence.

Accuracy held on the cases that make reason coding hard

Strict F1 by page-grade proof point

The important result is not that clean comments work. It is that the system keeps the reason attached to the resident's actual meaning when wording, language, and context make the task difficult.

Why it matters

If the reasons are wrong, the response can be wrong even when the headline count is right.

For community members

The reason they gave is counted

Implied access, safety, cost, and disruption concerns do not disappear because a resident used different words.

For ELT and stakeholders

Reports can explain why, not only how many

Leaders can see which reasons drove support or opposition, with the evidence needed to defend a response.

For analysts

Keyword noise is separated from meaning

Quoted objections, refuted claims, and bare stances are handled without inflating reason counts.

At a glance

What changes when reasons are measured directly

Comparison of current analysis shortcuts and Communiti reason mining on the failure modes that matter for consultation reporting.
Capability Manual review Analyst + spreadsheet Search/topic tagging Words counted as reasons Communiti Intelligence Reason mining
Quoted objections Caught

A careful reader can see the resident is rejecting the quoted argument.

Miscounted

The keyword is present, so the wrong reason is counted.

Separated

Only reasons the resident advances are recorded.

Implied reasons Possible

Accurate when reviewers have enough time and apply the same rules.

Missed

No topic word means no reason to count.

Captured

The reason is inferred from the resident's wording and grounded in a quote.

Bare stance Clean

A reviewer can leave the reason field empty.

Noisy

May invent a reason from nearby words or spreadsheet context.

Empty

Zero false reasons on stance-only responses.

Prevalence counts Slow

Counts are defensible only after every row is read and reconciled.

Skewed

Refuted and implied reasons distort the totals.

Report-ready

Mean prevalence error was 0.32 percentage points across 36 arguments.

New consultation Workshop

Analysts first need to discover and settle the reason taxonomy.

Ad hoc

Topic lists drift as reviewers add search terms.

Induced

The full workflow induced the taxonomy and still scored 95.9% F1.

Audit trail By hand

Quotes and coding notes have to be maintained separately.

Thin

A count with little evidence for why each row was included.

Built in

Every reason is tied to the source words reviewers can check.

For technical reviewers

The scores behind the release.

The full scorecard

Page-grade suite on a synthetic argument-mining corpus: 73 entries, 96 audited reason units, six proposals, and 36 canonical arguments.

Metric Communiti Pass line or search shortcut
Strict reason F1 96.0% 56.7% search shortcut
Precision / recall 92.3% / 100% 54.3% / 59.4% search shortcut
Evidence grounding 100% 95.0% pass line
Report prevalence error Measured across all 36 canonical arguments 0.32 pts mean / 1.84 pts max 2.0 pts mean / 5.0 pts max pass line
Implied-reason F1 100% 37.5% search shortcut
Refuted-argument F1 92.3% 42.1% search shortcut
Multilingual reason F1 96.3% 70.0% pass line
Open-taxonomy workflow Taxonomy induced from consultation text, then scored against audited gold arguments 95.9% F1 / 91.7% taxonomy coverage 80.0% F1 / 85.0% coverage pass line
Stance-composed prevalence 0.27 pts mean error 5.0 pts pass line

Headline percentages are rounded for readability. The benchmark pack includes the synthetic corpus, gold decisions, raw outputs, prompt packs, scoring code, charts, and cached verification path.

Methodology

How we measured

Test corpus

Synthetic consultation feedback only - no resident data - spanning 73 entries, 96 audited reason units, six realistic proposals, 36 canonical arguments, and untranslated multilingual responses.

Audited responses
73
Held-out and development entries covering refuted, implied, bare-stance, rambling, emotional, and multilingual cases
Gold reason units
96
Verbatim spans annotated as reasons the resident advances, not merely mentions
Canonical arguments
36
Six reporting arguments per proposal, used to score prevalence and open-taxonomy coverage

Processed in Australia

Analysis runs on AWS in Sydney and Melbourne using Australia-geographic infrastructure. Feedback is not processed offshore.

Never used to train AI

Your community's feedback is not used to train any AI model, and the model provider has no access to it - contractually guaranteed by AWS.

Evidence on request

The benchmark pack includes synthetic data, gold decisions, scoring code, raw outputs, charts, and methodology notes for technical review.

Every reason traceable

Reason labels are not loose summaries. Each reason links back to source words in the original response so reviewers can check the evidence.

The fine print we think you should read

  1. Test data. The benchmark uses synthetic consultation feedback written for testing. No resident data was used. The corpus contains 73 entries and 96 audited reason units across six proposal scenarios.
  2. Strict reason F1. A prediction only receives strict credit when the argument key is correct and the quoted evidence grounds to the original response.
  3. Current-workflow baseline. The search-style topic shortcut is included because keyword searches, spreadsheet filters, and lightweight topic tagging are common ways teams approximate reason coding today. It is not a product comparison.
  4. Open taxonomy. The open-taxonomy workflow induces the argument structure from the consultation text before scoring responses against audited gold arguments. This measures the full workflow for a new consultation, not only classification against a supplied list.
  5. Reproducibility. Every number on this page traces to a dated evidence snapshot, raw outputs, scoring code, prompt packs, and a cached verification path. Ask us for the benchmark pack.

Available now in Communiti

See the reasons in your own consultation feedback.

Bring one real, de-identified feedback export to a 30-minute walkthrough, or ask for the benchmark pack and have your technical team check the scoring path.

Stay close to the future of community engagement

Product notes, practical field guides, and evidence-led thinking for teams working under public scrutiny.

Read about ourWe care about your data in our privacy policy.

End-to-end engagement workflow