Product Updates 14 June 2026 2 min read

New · Argument mining benchmark

Find the reasons behind every position.

Outcome reports need more than support and opposition counts. They need the reasons people gave, the evidence behind each reason, and defensible prevalence numbers leaders can stand behind.

Why topic tags are not enough

Keyword shortcut

Counts words, not arguments

If a resident quotes a parking objection only to reject it, a search workflow still counts parking as the reason.

The wrong reason lands in the report because the word appeared, not because the resident advanced it.

Communiti argument mining

Extracts reasons residents advance

Reasons are recorded only when the resident is actually making that argument, and every reason is grounded in the original words.

The report count follows meaning, not just matching terms.

What shipped

Reasons now become report-ready evidence, not loose theme labels.

Reason extraction

96.0%

Strict reason F1 on audited feedback where the argument label and source evidence both had to match.

Grounding

100%

Every extracted reason is tied back to source text so a reviewer can check the claim.

Report counts

0.32 pts

Mean prevalence error across canonical arguments, which is the number that matters in outcome reports.

Tested before release

Benchmarked on the cases that break simple topic tagging.

The benchmark covers audited reason units, refuted arguments, implied reasons, bare stances, rambling multi-reason comments, emotional language, and multilingual feedback. Last run: 14 June 2026.

0.0%: strict reason F1 on the audited benchmark
0%: of extracted reasons grounded in source text
0.00 pts: mean prevalence error across canonical arguments
0.0%: F1 when the argument taxonomy was induced from the consultation text

Watch the failure mode

A keyword is not a reason. A quoted objection is not always an objection.

Outcome reports do not just need support and opposition counts. They need the reasons behind those positions: safety, access, cost, disruption, amenity, housing need. If the reasons are wrong, the response can be wrong even when the headline count is right. This benchmark tests the cases that break simple topic tagging: arguments quoted only to be rejected, implied reasons, stance-only comments with no reason, rambling multi-reason responses, emotional language, and multilingual feedback.

Synthetic benchmark examples

Two comments that look easy until you count them.

Search-style tagging can see words, but it cannot tell whether the resident is advancing the argument, rejecting it, or implying it without naming the topic.

Reading response

Example 1

Quoted to reject

Keyword present, wrong argument

Resident comment: The traders keep saying the parking removal will kill the strip. It won't - the Summer Hill data showed turnover flat to up - and meanwhile riders keep ending up in emergency. Build it for the safety alone.

Search shortcut

Counts parking loss

The report overstates parking opposition and misses the reason the resident actually gave.

Communiti records

Rider safety

Evidence: “riders keep ending up in emergency”

Example 2

Implied, not named

No obvious topic word

Resident comment: My husband has two crook knees and a walking frame. Tell me how he gets from a side street to the physio at number 214. That's my whole submission.

Search shortcut

No parking reason found

The accessibility reason disappears because the resident never says parking.

Communiti records

Parking and access loss

Evidence: “My husband has two crook knees and a walking frame”

The benchmark includes refuted arguments, implied reasons, bare stances, rambling multi-reason comments, emotional language, and untranslated multilingual responses. Communiti cleared every publication gate in the page-grade suite.

Results

The gap is whether the report can say why and prove it.

96.0% strict reason F1

73 responses, 96 audited reason units, 36 canonical arguments

Communiti reason mining 96.0%

Search-style topic tagging 56.7%

The gap is not just accuracy. It is whether a report can say why people objected or supported without a reviewer re-reading every row.

100% on implied reasons

The reason is present, but the resident never uses the official topic words

Communiti reason mining 100%

Search-style topic tagging 37.5%

Residents rarely write in report headings. Communiti still records the actual argument with the words that prove it.

95.9% full-workflow F1

Argument taxonomy induced from the consultation text, then used to classify responses

Communiti full workflow 95.9%

Benchmark pass line 80%

For a new consultation, the system can discover the argument structure first then measure it with grounded evidence.

Accuracy held on the cases that make reason coding hard

Strict F1 by page-grade proof point

Implied reasons Resident never names the topic

100%

90%

Refuted arguments Argument quoted only to reject it

92.3%

85%

Multilingual Untranslated community-language feedback

96.3%

70%

Full workflow Taxonomy induced, then applied

95.9%

80%

The important result is not that clean comments work. It is that the system keeps the reason attached to the resident's actual meaning when wording, language, and context make the task difficult.

Why it matters

If the reasons are wrong, the response can be wrong even when the headline count is right.

For community members

The reason they gave is counted

Implied access, safety, cost, and disruption concerns do not disappear because a resident used different words.

For ELT and stakeholders

Reports can explain why, not only how many

Leaders can see which reasons drove support or opposition, with the evidence needed to defend a response.

For analysts

Keyword noise is separated from meaning

Quoted objections, refuted claims, and bare stances are handled without inflating reason counts.

At a glance

What changes when reasons are measured directly

Comparison of current analysis shortcuts and Communiti reason mining on the failure modes that matter for consultation reporting.
Capability	Manual review Analyst + spreadsheet	Search/topic tagging Words counted as reasons	Communiti Intelligence Reason mining
Quoted objections	Caught A careful reader can see the resident is rejecting the quoted argument.	Miscounted The keyword is present, so the wrong reason is counted.	Separated Only reasons the resident advances are recorded.
Implied reasons	Possible Accurate when reviewers have enough time and apply the same rules.	Missed No topic word means no reason to count.	Captured The reason is inferred from the resident's wording and grounded in a quote.
Bare stance	Clean A reviewer can leave the reason field empty.	Noisy May invent a reason from nearby words or spreadsheet context.	Empty Zero false reasons on stance-only responses.
Prevalence counts	Slow Counts are defensible only after every row is read and reconciled.	Skewed Refuted and implied reasons distort the totals.	Report-ready Mean prevalence error was 0.32 percentage points across 36 arguments.
New consultation	Workshop Analysts first need to discover and settle the reason taxonomy.	Ad hoc Topic lists drift as reviewers add search terms.	Induced The full workflow induced the taxonomy and still scored 95.9% F1.
Audit trail	By hand Quotes and coding notes have to be maintained separately.	Thin A count with little evidence for why each row was included.	Built in Every reason is tied to the source words reviewers can check.

For technical reviewers

The scores behind the release.

The full scorecard

Page-grade suite on a synthetic argument-mining corpus: 73 entries, 96 audited reason units, six proposals, and 36 canonical arguments.

Metric	Communiti	Pass line or search shortcut
Strict reason F1	96.0%	56.7% search shortcut
Precision / recall	92.3% / 100%	54.3% / 59.4% search shortcut
Evidence grounding	100%	95.0% pass line
Report prevalence error Measured across all 36 canonical arguments	0.32 pts mean / 1.84 pts max	2.0 pts mean / 5.0 pts max pass line
Implied-reason F1	100%	37.5% search shortcut
Refuted-argument F1	92.3%	42.1% search shortcut
Multilingual reason F1	96.3%	70.0% pass line
Open-taxonomy workflow Taxonomy induced from consultation text, then scored against audited gold arguments	95.9% F1 / 91.7% taxonomy coverage	80.0% F1 / 85.0% coverage pass line
Stance-composed prevalence	0.27 pts mean error	5.0 pts pass line

Headline percentages are rounded for readability. The benchmark pack includes the synthetic corpus, gold decisions, raw outputs, prompt packs, scoring code, charts, and cached verification path.

Methodology

How we measured

Test corpus

Synthetic consultation feedback only - no resident data - spanning 73 entries, 96 audited reason units, six realistic proposals, 36 canonical arguments, and untranslated multilingual responses.

Audited responses: 73; Held-out and development entries covering refuted, implied, bare-stance, rambling, emotional, and multilingual cases
Gold reason units: 96; Verbatim spans annotated as reasons the resident advances, not merely mentions
Canonical arguments: 36; Six reporting arguments per proposal, used to score prevalence and open-taxonomy coverage

Processed in Australia

Analysis runs on AWS in Sydney and Melbourne using Australia-geographic infrastructure. Feedback is not processed offshore.

Never used to train AI

Your community's feedback is not used to train any AI model, and the model provider has no access to it - contractually guaranteed by AWS.

Evidence on request

The benchmark pack includes synthetic data, gold decisions, scoring code, raw outputs, charts, and methodology notes for technical review.

Every reason traceable

Reason labels are not loose summaries. Each reason links back to source words in the original response so reviewers can check the evidence.

The fine print we think you should read

Test data. The benchmark uses synthetic consultation feedback written for testing. No resident data was used. The corpus contains 73 entries and 96 audited reason units across six proposal scenarios.
Strict reason F1. A prediction only receives strict credit when the argument key is correct and the quoted evidence grounds to the original response.
Current-workflow baseline. The search-style topic shortcut is included because keyword searches, spreadsheet filters, and lightweight topic tagging are common ways teams approximate reason coding today. It is not a product comparison.
Open taxonomy. The open-taxonomy workflow induces the argument structure from the consultation text before scoring responses against audited gold arguments. This measures the full workflow for a new consultation, not only classification against a supplied list.
Reproducibility. Every number on this page traces to a dated evidence snapshot, raw outputs, scoring code, prompt packs, and a cached verification path. Ask us for the benchmark pack.

Available now in Communiti

See the reasons in your own consultation feedback.

Bring one real, de-identified feedback export to a 30-minute walkthrough, or ask for the benchmark pack and have your technical team check the scoring path.

Book a demo Request the benchmark pack Explore the full benchmark