Measure what people want, not just how they sound.
Communiti separates position from tone: support, opposition, acceptance, rejection, mixed feedback, and conditional support, even when the wording sounds like something else.
Headline results
- 0.0%
- stance accuracy on the frozen test split
- 0.0%
- macro-F1 across stance classes
- 0%
- accuracy on tone-divergent feedback where mood and position split
- 0%
- stable across 115 meaning-preserving wording changes
Proof point № 1
Tone tells you how people feel. Stance tells you what they want done.
That distinction is a reporting risk, not a technical nicety. A resident can support a plan angrily. Another can reject it politely. This benchmark tests the cases where a tone-based read would distort the findings: angry support, polite opposition, conditional acceptance, mixed responses, multilingual feedback, and wording changes that should not change the result.
Synthetic benchmark examples
Two comments. Two tone traps. Opposite report outcomes.
These are the kinds of responses that make sentiment dashboards unsafe for consultation reporting. Tone points one way; the actual position points the other.
Example 1
Angry support
Resident comment: I am fed up with how dangerous this road has become, so yes, build the protected cycleway. Just get on with it before someone is hurt.
Tone shortcut
Negative tone ->
Opposition
A tone shortcut moves this resident into the opposition count.
Communiti records
Actual position ->
Support
Evidence: “yes, build the protected cycleway”
Example 2
Polite opposition
Resident comment: Thanks for the clear proposal and the work behind it. I still do not support removing the parking bays outside the shops.
Tone shortcut
Positive tone ->
Support
A tone shortcut moves this resident into the support count.
Communiti records
Actual position ->
Opposition
Evidence: “I still do not support removing the parking bays”
The benchmark contains 28 tone-divergent responses like these. Communiti scored 100% on that subset; the tone shortcut scored 25%.
100% when tone and position split
28 responses built to catch angry support and polite opposition
If tone becomes position, reports can flip support into opposition or opposition into support.
98.7% on held-out cases
79 responses scored after development, with audited gold labels
The best production-style run cleared the target with 98.7% accuracy and 97.2% macro-F1 across support, opposition, acceptance, rejection, mixed, conditional, and neutral cases.
100% condition grounding
27 audited condition quotes, scored for recall and evidence grounding
Conditional support is stored with the resident's actual condition so reviewers can check the decision.
Accuracy held as the cases got harder
Frozen benchmark tiers include clean feedback, harder wording, adversarial cases, and multilingual responses
The important finding is not that the easy cases worked. It is that the system stayed accurate on the hard, multilingual, and tone-divergent cases that usually distort consultation reporting.
Proof point № 2
Stance held across community languages
The benchmark includes ten community languages and mixed-language cases so non-English feedback is not treated as an afterthought. The best run scored 100% accuracy on the multilingual subset.
- Mandarin 中文
- Arabic العربية
- Vietnamese Tiếng Việt
- Cantonese 廣東話
- Punjabi ਪੰਜਾਬੀ
- Greek Ελληνικά
- Italian Italiano
- Hindi हिन्दी
- Te Reo Māori
- Samoan Gagana Sāmoa
These are the ten languages benchmarked in this run. Communiti supports more than 50 languages in production, with the same evidence-first review workflow.
At a glance
What changes when stance is measured directly
| Capability | Manual review Analyst + spreadsheet | Sentiment shortcut Tone treated as position | Communiti Intelligence Stance detection |
|---|---|---|---|
| Angry support | Caught A careful reader can separate frustration from the actual position. | Flipped Negative tone is treated as opposition. | Support Support is recorded, with the frustrated wording still available for review. |
| Polite opposition | Caught A reviewer can see the rejection if they read closely. | Flipped Positive tone is treated as support. | Opposition The position is separated from the politeness of the wording. |
| Conditional support | Slow The condition has to be copied into a report or tracking sheet by hand. | Flattened The response becomes positive or mixed, but the condition is not preserved. | Grounded The stance and condition are both captured, with the resident's words attached. |
| Mixed feedback | Possible Accurate when reviewers have enough time and apply the same rules. | Collapsed Multiple positions get reduced to a single mood label. | Separated Mixed stance is preserved instead of being forced into support or opposition. |
| Small wording changes | Variable Different reviewers may read borderline wording differently. | Brittle Tone words can change the label even when the position stays the same. | Stable 100% invariant across 115 meaning-preserving perturbation pairs. |
| Audit trail | By hand Review notes and quotes have to be maintained separately. | Thin A label with little evidence for why the position was assigned. | Built in Labels, conditions, confidence, and evidence can be traced back to the response. |
For your technical reviewers
The scores behind the headlines
Headline percentages are rounded for readability. These are the underlying precision, recall, and F1 figures - the same ones in the published results files.
The full scorecard
Production-style run on a 292-entry synthetic consultation corpus with a 79-entry frozen test split.
| Metric | Communiti | Pass line or baseline |
|---|---|---|
| Frozen test stance accuracy | 98.7% | 85.0% pass line |
| Macro-F1 across stance classes | 97.2% | 80.0% pass line |
| Tone-divergent stance accuracy 28 entries where tone and stance deliberately diverge | 100% | 25.0% sentiment shortcut |
| Condition recall 27 audited condition quotes in the corpus | 100% | 75.0% pass line |
| Condition grounding | 100% | 90.0% pass line |
| Perturbation invariance 115 wording changes that preserve the underlying stance | 100% | 95.0% pass line |
| Temperature stability Identical labels across repeated t=0.0 and t=0.1 runs | 100% | 98.0% pass line |
| Agreement auto-accept Agreement between two independent production-style arms | 99.6% accuracy at 97.3% coverage | 99.0% accuracy at 85.0% coverage pass line |
| Selective prediction | 98.0% accuracy at 96.2% coverage | Confidence threshold 0.85 |
Headline percentages are rounded for readability. The benchmark pack includes the synthetic corpus, gold decisions, raw outputs, scoring notebook, and cached verification path.
Methodology
How we measured
Test corpus
Synthetic consultation feedback only - no resident data - spanning 292 entries, a 79-entry frozen test split, 27 audited condition quotes, 28 tone-divergent responses, 115 perturbation pairs, and ten community languages.
- Frozen test split
- 79
- Held-out responses scored after development, including easy, medium, hard, tone-divergent, and multilingual cases
- Development split
- 213
- Synthetic consultation responses used to develop and stress the stance taxonomy
- Perturbation pairs
- 115
- Meaning-preserving wording changes used to check label stability
Processed in Australia
Analysis runs on AWS in Sydney and Melbourne using Australia-geographic infrastructure. Feedback is not processed offshore.
Never used to train AI
Your community's feedback is not used to train any AI model, and the model provider has no access to it - contractually guaranteed by AWS.
Evidence on request
The benchmark pack includes synthetic data, gold decisions, scoring code, raw outputs, charts, and methodology notes for technical review.
Every condition traceable
Conditional feedback is not only labelled. The condition is grounded in the original response so reviewers can check the evidence behind the result.
The fine print we think you should read
- Test data. The benchmark uses synthetic consultation feedback written for testing. No resident data was used. The corpus contains 292 entries, including a 79-entry frozen test split and a 213-entry development split.
- Gold decisions. Gold labels and condition decisions were audited before scoring. The benchmark pack includes the gold-decision notes used to resolve ambiguous cases.
- Sentiment shortcut. The sentiment-as-stance baseline is included because it is the most common analytical mistake in this task: mapping positive tone to support and negative tone to opposition. It is not a product comparison.
- Reproducibility. Every number on this page traces to the run summary, raw outputs, and scoring notebook. The cached verification path re-scores existing outputs without making live model calls.
Check our work
See your own consultation benchmarked this way
Bring one real, de-identified feedback export to a 30-minute walkthrough - or request the benchmark pack and have your technical team verify every number on this page.
