All benchmarks
Stance detection Communiti Intelligence · v1.0 · Last run

Measure what people want, not just how they sound.

Communiti separates position from tone: support, opposition, acceptance, rejection, mixed feedback, and conditional support, even when the wording sounds like something else.

Headline results

0.0%
stance accuracy on the frozen test split
0.0%
macro-F1 across stance classes
0%
accuracy on tone-divergent feedback where mood and position split
0%
stable across 115 meaning-preserving wording changes

Proof point № 1

Tone tells you how people feel. Stance tells you what they want done.

That distinction is a reporting risk, not a technical nicety. A resident can support a plan angrily. Another can reject it politely. This benchmark tests the cases where a tone-based read would distort the findings: angry support, polite opposition, conditional acceptance, mixed responses, multilingual feedback, and wording changes that should not change the result.

Synthetic benchmark examples

Two comments. Two tone traps. Opposite report outcomes.

These are the kinds of responses that make sentiment dashboards unsafe for consultation reporting. Tone points one way; the actual position points the other.

Reading comments

Example 1

Angry support

Negative mood, supportive position
Resident comment: I am fed up with how dangerous this road has become, so yes, build the protected cycleway. Just get on with it before someone is hurt.

Tone shortcut

Negative tone ->

Opposition

A tone shortcut moves this resident into the opposition count.

Communiti records

Actual position ->

Support

Evidence: “yes, build the protected cycleway”

Example 2

Polite opposition

Positive mood, opposing position
Resident comment: Thanks for the clear proposal and the work behind it. I still do not support removing the parking bays outside the shops.

Tone shortcut

Positive tone ->

Support

A tone shortcut moves this resident into the support count.

Communiti records

Actual position ->

Opposition

Evidence: “I still do not support removing the parking bays”

The benchmark contains 28 tone-divergent responses like these. Communiti scored 100% on that subset; the tone shortcut scored 25%.

100% when tone and position split

28 responses built to catch angry support and polite opposition

If tone becomes position, reports can flip support into opposition or opposition into support.

98.7% on held-out cases

79 responses scored after development, with audited gold labels

The best production-style run cleared the target with 98.7% accuracy and 97.2% macro-F1 across support, opposition, acceptance, rejection, mixed, conditional, and neutral cases.

100% condition grounding

27 audited condition quotes, scored for recall and evidence grounding

Conditional support is stored with the resident's actual condition so reviewers can check the decision.

Accuracy held as the cases got harder

Frozen benchmark tiers include clean feedback, harder wording, adversarial cases, and multilingual responses

The important finding is not that the easy cases worked. It is that the system stayed accurate on the hard, multilingual, and tone-divergent cases that usually distort consultation reporting.

Proof point № 2

Stance held across community languages

The benchmark includes ten community languages and mixed-language cases so non-English feedback is not treated as an afterthought. The best run scored 100% accuracy on the multilingual subset.

  • Mandarin 中文
  • Arabic العربية
  • Vietnamese Tiếng Việt
  • Cantonese 廣東話
  • Punjabi ਪੰਜਾਬੀ
  • Greek Ελληνικά
  • Italian Italiano
  • Hindi हिन्दी
  • Te Reo Māori
  • Samoan Gagana Sāmoa

These are the ten languages benchmarked in this run. Communiti supports more than 50 languages in production, with the same evidence-first review workflow.

At a glance

What changes when stance is measured directly

Comparison of common review shortcuts and Communiti stance detection on the cases that matter in consultation analysis.
Capability Manual review Analyst + spreadsheet Sentiment shortcut Tone treated as position Communiti Intelligence Stance detection
Angry support Caught

A careful reader can separate frustration from the actual position.

Flipped

Negative tone is treated as opposition.

Support

Support is recorded, with the frustrated wording still available for review.

Polite opposition Caught

A reviewer can see the rejection if they read closely.

Flipped

Positive tone is treated as support.

Opposition

The position is separated from the politeness of the wording.

Conditional support Slow

The condition has to be copied into a report or tracking sheet by hand.

Flattened

The response becomes positive or mixed, but the condition is not preserved.

Grounded

The stance and condition are both captured, with the resident's words attached.

Mixed feedback Possible

Accurate when reviewers have enough time and apply the same rules.

Collapsed

Multiple positions get reduced to a single mood label.

Separated

Mixed stance is preserved instead of being forced into support or opposition.

Small wording changes Variable

Different reviewers may read borderline wording differently.

Brittle

Tone words can change the label even when the position stays the same.

Stable

100% invariant across 115 meaning-preserving perturbation pairs.

Audit trail By hand

Review notes and quotes have to be maintained separately.

Thin

A label with little evidence for why the position was assigned.

Built in

Labels, conditions, confidence, and evidence can be traced back to the response.

For your technical reviewers

The scores behind the headlines

Headline percentages are rounded for readability. These are the underlying precision, recall, and F1 figures - the same ones in the published results files.

The full scorecard

Production-style run on a 292-entry synthetic consultation corpus with a 79-entry frozen test split.

Metric Communiti Pass line or baseline
Frozen test stance accuracy 98.7% 85.0% pass line
Macro-F1 across stance classes 97.2% 80.0% pass line
Tone-divergent stance accuracy 28 entries where tone and stance deliberately diverge 100% 25.0% sentiment shortcut
Condition recall 27 audited condition quotes in the corpus 100% 75.0% pass line
Condition grounding 100% 90.0% pass line
Perturbation invariance 115 wording changes that preserve the underlying stance 100% 95.0% pass line
Temperature stability Identical labels across repeated t=0.0 and t=0.1 runs 100% 98.0% pass line
Agreement auto-accept Agreement between two independent production-style arms 99.6% accuracy at 97.3% coverage 99.0% accuracy at 85.0% coverage pass line
Selective prediction 98.0% accuracy at 96.2% coverage Confidence threshold 0.85

Headline percentages are rounded for readability. The benchmark pack includes the synthetic corpus, gold decisions, raw outputs, scoring notebook, and cached verification path.

Methodology

How we measured

Test corpus

Synthetic consultation feedback only - no resident data - spanning 292 entries, a 79-entry frozen test split, 27 audited condition quotes, 28 tone-divergent responses, 115 perturbation pairs, and ten community languages.

Frozen test split
79
Held-out responses scored after development, including easy, medium, hard, tone-divergent, and multilingual cases
Development split
213
Synthetic consultation responses used to develop and stress the stance taxonomy
Perturbation pairs
115
Meaning-preserving wording changes used to check label stability

Processed in Australia

Analysis runs on AWS in Sydney and Melbourne using Australia-geographic infrastructure. Feedback is not processed offshore.

Never used to train AI

Your community's feedback is not used to train any AI model, and the model provider has no access to it - contractually guaranteed by AWS.

Evidence on request

The benchmark pack includes synthetic data, gold decisions, scoring code, raw outputs, charts, and methodology notes for technical review.

Every condition traceable

Conditional feedback is not only labelled. The condition is grounded in the original response so reviewers can check the evidence behind the result.

The fine print we think you should read

  1. Test data. The benchmark uses synthetic consultation feedback written for testing. No resident data was used. The corpus contains 292 entries, including a 79-entry frozen test split and a 213-entry development split.
  2. Gold decisions. Gold labels and condition decisions were audited before scoring. The benchmark pack includes the gold-decision notes used to resolve ambiguous cases.
  3. Sentiment shortcut. The sentiment-as-stance baseline is included because it is the most common analytical mistake in this task: mapping positive tone to support and negative tone to opposition. It is not a product comparison.
  4. Reproducibility. Every number on this page traces to the run summary, raw outputs, and scoring notebook. The cached verification path re-scores existing outputs without making live model calls.

Check our work

See your own consultation benchmarked this way

Bring one real, de-identified feedback export to a 30-minute walkthrough - or request the benchmark pack and have your technical team verify every number on this page.

Ready to turn community feedback into defensible outcomes?

See how Communiti helps teams analyse faster and close the loop with confidence.

Stay close to the future of community engagement

Product notes, practical field guides, and evidence-led thinking for teams working under public scrutiny.

Read about ourWe care about your data in our privacy policy.

End-to-end engagement workflow