Find the reasons behind every position.
Outcome reports need more than support and opposition counts. They need the reasons people gave, the evidence behind each reason, and defensible prevalence numbers leaders can stand behind.
Keyword shortcut
Counts words, not arguments
If a resident quotes a parking objection only to reject it, a search workflow still counts parking as the reason.
The wrong reason lands in the report because the word appeared, not because the resident advanced it.
Communiti argument mining
Extracts reasons residents advance
Reasons are recorded only when the resident is actually making that argument, and every reason is grounded in the original words.
The report count follows meaning, not just matching terms.
What shipped
Reasons now become report-ready evidence, not loose theme labels.
Reason extraction
96.0%
Strict reason F1 on audited feedback where the argument label and source evidence both had to match.
Grounding
100%
Every extracted reason is tied back to source text so a reviewer can check the claim.
Report counts
0.32 pts
Mean prevalence error across canonical arguments, which is the number that matters in outcome reports.
Tested before release
Benchmarked on the cases that break simple topic tagging.
The benchmark covers audited reason units, refuted arguments, implied reasons, bare stances, rambling multi-reason comments, emotional language, and multilingual feedback. Last run: .
- 0.0%
- strict reason F1 on the audited benchmark
- 0%
- of extracted reasons grounded in source text
- 0.00 pts
- mean prevalence error across canonical arguments
- 0.0%
- F1 when the argument taxonomy was induced from the consultation text
Watch the failure mode
A keyword is not a reason. A quoted objection is not always an objection.
Outcome reports do not just need support and opposition counts. They need the reasons behind those positions: safety, access, cost, disruption, amenity, housing need. If the reasons are wrong, the response can be wrong even when the headline count is right. This benchmark tests the cases that break simple topic tagging: arguments quoted only to be rejected, implied reasons, stance-only comments with no reason, rambling multi-reason responses, emotional language, and multilingual feedback.
Synthetic benchmark examples
Two comments that look easy until you count them.
Search-style tagging can see words, but it cannot tell whether the resident is advancing the argument, rejecting it, or implying it without naming the topic.
Example 1
Quoted to reject
Resident comment: The traders keep saying the parking removal will kill the strip. It won't - the Summer Hill data showed turnover flat to up - and meanwhile riders keep ending up in emergency. Build it for the safety alone.
Search shortcut
Counts parking loss
The report overstates parking opposition and misses the reason the resident actually gave.
Communiti records
Rider safety
Evidence: “riders keep ending up in emergency”
Example 2
Implied, not named
Resident comment: My husband has two crook knees and a walking frame. Tell me how he gets from a side street to the physio at number 214. That's my whole submission.
Search shortcut
No parking reason found
The accessibility reason disappears because the resident never says parking.
Communiti records
Parking and access loss
Evidence: “My husband has two crook knees and a walking frame”
The benchmark includes refuted arguments, implied reasons, bare stances, rambling multi-reason comments, emotional language, and untranslated multilingual responses. Communiti cleared every publication gate in the page-grade suite.
Results
The gap is whether the report can say why and prove it.
96.0% strict reason F1
73 responses, 96 audited reason units, 36 canonical arguments
The gap is not just accuracy. It is whether a report can say why people objected or supported without a reviewer re-reading every row.
100% on implied reasons
The reason is present, but the resident never uses the official topic words
Residents rarely write in report headings. Communiti still records the actual argument with the words that prove it.
95.9% full-workflow F1
Argument taxonomy induced from the consultation text, then used to classify responses
For a new consultation, the system can discover the argument structure first then measure it with grounded evidence.
Accuracy held on the cases that make reason coding hard
Strict F1 by page-grade proof point
The important result is not that clean comments work. It is that the system keeps the reason attached to the resident's actual meaning when wording, language, and context make the task difficult.
Why it matters
If the reasons are wrong, the response can be wrong even when the headline count is right.
For community members
The reason they gave is counted
Implied access, safety, cost, and disruption concerns do not disappear because a resident used different words.
For ELT and stakeholders
Reports can explain why, not only how many
Leaders can see which reasons drove support or opposition, with the evidence needed to defend a response.
For analysts
Keyword noise is separated from meaning
Quoted objections, refuted claims, and bare stances are handled without inflating reason counts.
At a glance
What changes when reasons are measured directly
| Capability | Manual review Analyst + spreadsheet | Search/topic tagging Words counted as reasons | Communiti Intelligence Reason mining |
|---|---|---|---|
| Quoted objections | Caught A careful reader can see the resident is rejecting the quoted argument. | Miscounted The keyword is present, so the wrong reason is counted. | Separated Only reasons the resident advances are recorded. |
| Implied reasons | Possible Accurate when reviewers have enough time and apply the same rules. | Missed No topic word means no reason to count. | Captured The reason is inferred from the resident's wording and grounded in a quote. |
| Bare stance | Clean A reviewer can leave the reason field empty. | Noisy May invent a reason from nearby words or spreadsheet context. | Empty Zero false reasons on stance-only responses. |
| Prevalence counts | Slow Counts are defensible only after every row is read and reconciled. | Skewed Refuted and implied reasons distort the totals. | Report-ready Mean prevalence error was 0.32 percentage points across 36 arguments. |
| New consultation | Workshop Analysts first need to discover and settle the reason taxonomy. | Ad hoc Topic lists drift as reviewers add search terms. | Induced The full workflow induced the taxonomy and still scored 95.9% F1. |
| Audit trail | By hand Quotes and coding notes have to be maintained separately. | Thin A count with little evidence for why each row was included. | Built in Every reason is tied to the source words reviewers can check. |
For technical reviewers
The scores behind the release.
The full scorecard
Page-grade suite on a synthetic argument-mining corpus: 73 entries, 96 audited reason units, six proposals, and 36 canonical arguments.
| Metric | Communiti | Pass line or search shortcut |
|---|---|---|
| Strict reason F1 | 96.0% | 56.7% search shortcut |
| Precision / recall | 92.3% / 100% | 54.3% / 59.4% search shortcut |
| Evidence grounding | 100% | 95.0% pass line |
| Report prevalence error Measured across all 36 canonical arguments | 0.32 pts mean / 1.84 pts max | 2.0 pts mean / 5.0 pts max pass line |
| Implied-reason F1 | 100% | 37.5% search shortcut |
| Refuted-argument F1 | 92.3% | 42.1% search shortcut |
| Multilingual reason F1 | 96.3% | 70.0% pass line |
| Open-taxonomy workflow Taxonomy induced from consultation text, then scored against audited gold arguments | 95.9% F1 / 91.7% taxonomy coverage | 80.0% F1 / 85.0% coverage pass line |
| Stance-composed prevalence | 0.27 pts mean error | 5.0 pts pass line |
Headline percentages are rounded for readability. The benchmark pack includes the synthetic corpus, gold decisions, raw outputs, prompt packs, scoring code, charts, and cached verification path.
Methodology
How we measured
Test corpus
Synthetic consultation feedback only - no resident data - spanning 73 entries, 96 audited reason units, six realistic proposals, 36 canonical arguments, and untranslated multilingual responses.
- Audited responses
- 73
- Held-out and development entries covering refuted, implied, bare-stance, rambling, emotional, and multilingual cases
- Gold reason units
- 96
- Verbatim spans annotated as reasons the resident advances, not merely mentions
- Canonical arguments
- 36
- Six reporting arguments per proposal, used to score prevalence and open-taxonomy coverage
Processed in Australia
Analysis runs on AWS in Sydney and Melbourne using Australia-geographic infrastructure. Feedback is not processed offshore.
Never used to train AI
Your community's feedback is not used to train any AI model, and the model provider has no access to it - contractually guaranteed by AWS.
Evidence on request
The benchmark pack includes synthetic data, gold decisions, scoring code, raw outputs, charts, and methodology notes for technical review.
Every reason traceable
Reason labels are not loose summaries. Each reason links back to source words in the original response so reviewers can check the evidence.
The fine print we think you should read
- Test data. The benchmark uses synthetic consultation feedback written for testing. No resident data was used. The corpus contains 73 entries and 96 audited reason units across six proposal scenarios.
- Strict reason F1. A prediction only receives strict credit when the argument key is correct and the quoted evidence grounds to the original response.
- Current-workflow baseline. The search-style topic shortcut is included because keyword searches, spreadsheet filters, and lightweight topic tagging are common ways teams approximate reason coding today. It is not a product comparison.
- Open taxonomy. The open-taxonomy workflow induces the argument structure from the consultation text before scoring responses against audited gold arguments. This measures the full workflow for a new consultation, not only classification against a supplied list.
- Reproducibility. Every number on this page traces to a dated evidence snapshot, raw outputs, scoring code, prompt packs, and a cached verification path. Ask us for the benchmark pack.
Available now in Communiti
See the reasons in your own consultation feedback.
Bring one real, de-identified feedback export to a 30-minute walkthrough, or ask for the benchmark pack and have your technical team check the scoring path.
