All benchmarks
Aspect-based sentiment (ABSA) Communiti Intelligence · v1.0 · Last run

Issue-level sentiment analysis benchmark

How Communiti's issue-level sentiment analysis scores against Amazon Comprehend Targeted Sentiment - the engine behind most engagement platforms' sentiment - and against AI assistants, on the messiest consultation feedback we could write.

Read the launch story behind this benchmark: Hear every issue in every voice

Headline results

0%
of issues found, even in deliberately messy feedback
0%
sentiment accuracy on the issues it finds
0%
of mixed responses split into their real issues
0
community languages benchmarked natively, including Te Reo Māori and Samoan

Head to head № 1

The industry-standard tool misses what residents do not name

Most platforms and many in-house projects run on Amazon Comprehend. It tags things it can recognise. But residents rarely name things. They write, "Twenty minutes on hold just to ask why my rates notice changed." There is no customer service in that sentence, only the experience of it.

PDF Submission_041 - Draft Transport and Open Space Plan.pdf One of the 60 long submissions from the benchmark, read by both tools below
Reading

What Communiti reads

Issue-level analysis

I want to start by saying the new playground at Riverside Park has been wonderful for our family - my kids ask to go every weekend. But getting there is another story. We waited forty minutes for a bus the timetable said ran every fifteen, and when one finally arrived it drove straight past because it was already full.

My mother is in her late seventies and can't manage the walk from the closest stop - the footpath on Hartley Street has been lifted by tree roots for over a year. She has stopped coming with us. And the water pools right across the path near the underpass every time it rains - you need gumboots just to get through.

On the plan itself, I support the new cycleway and would use it to ride to work, but nothing in the document mentions lighting - walking back from the 6pm session it was pitch black by the carpark. Twenty minutes on hold when I rang to ask about session times didn't help either.

0 of 8 issues found

What Amazon Comprehend sees

Targeted sentiment, best setting

I want to start by saying the new playground at Riverside Park has been wonderful for our family - my kids ask to go every weekend. But getting there is another story. We waited forty minutes for a bus the timetable said ran every fifteen, and when one finally arrived it drove straight past because it was already full.

My mother is in her late seventies and can't manage the walk from the closest stop - the footpath on Hartley Street has been lifted by tree roots for over a year. She has stopped coming with us. And the water pools right across the path near the underpass every time it rains - you need gumboots just to get through.

On the plan itself, I support the new cycleway and would use it to ride to work, but nothing in the document mentions lighting - walking back from the 6pm session it was pitch black by the carpark. Twenty minutes on hold when I rang to ask about session times didn't help either.

0 of 8 found 0 missed

Comprehend's export for this submission contains 2 issues. The other 6 - the ones the resident implied but never named - never arrive in any report. The charts below score the same gap across all 60 submissions.

Issue detection score - long written submissions

60 multi-issue submissions, 700 issues to find, same scoring for both

Comprehend left 233 significant issues unreported across those 60 submissions. Communiti left none.

Issue detection score - issues implied, never named

96 responses like, water comes over the kerb every time it rains

This is most of what consultation feedback looks like, and it is structurally invisible to entity-tagging tools.

Issue detection as feedback gets messier

Aspect F1 by difficulty tier on the 217-response stress set - typos, sarcasm, voice transcripts, low-literacy writing, ten languages

Communiti's accuracy holds as feedback gets messier. Comprehend's best configuration drops below half on the feedback consultations actually receive - and reads none of the multilingual set.

Head to head № 2

Feedback in 50+ languages - read, not skipped

Residents write the way they speak - sometimes switching language mid-sentence: "公交车总是晚点, but the new library hours are very helpful." Communiti reads the complaint about buses and the praise for the library. It supports more than 50 languages, and the ten below are the ones we benchmarked natively, end to end. Comprehend's issue-level analysis supports English only.

  • Mandarin 中文
  • Arabic العربية
  • Vietnamese Tiếng Việt
  • Cantonese 廣東話
  • Punjabi ਪੰਜਾਬੀ
  • Greek Ελληνικά
  • Italian Italiano
  • Hindi हिन्दी
  • Te Reo Māori
  • Samoan Gagana Sāmoa

These ten are the natively benchmarked set - 40+ more are supported via automatic translation. Industry-standard issue-level analysis: 0 of 10 supported.

Head to head № 3

Could we just paste it into an AI assistant?

Fair question. Tools like Microsoft Copilot and ChatGPT read well, and we tested that workflow properly - including with the same AI model we use ourselves. The difference is what arrives at the other end: a consultation dashboard needs every issue filed under your topics, with the resident's words attached, for every row in the spreadsheet, every time.

Share of issues that arrive in your dashboard, correctly filed

60 long submissions, found and filed under the right reporting topic

The assistant finds issues, then names its themes differently on every row - so almost nothing lands in your reporting structure without someone re-filing it by hand. That re-filing is the job Communiti does.

The honest finding: the model reads well - the product is the gap

One response at a time with a careful prompt, on the same AI model Communiti uses, raw issue extraction is comparable. What never arrives is everything a consultation dashboard needs around the reading.

Metric Communiti AI assistant, one response at a time, same model
Raw issue extraction (aspect F1) Modern models are good at reading - we report this gap honestly 0.90 - 1.00 0.84 - 1.00
Issues landing in your dashboard taxonomy The assistant invents its own theme names on every row 93 - 95% 0 - 4%
Topic grouping consistency (ARI) 0.81 - 0.92 0.54 - 0.64
Sentiment accuracy, long submissions 100% 96%

Schema enforcement, retries, span grounding, language routing, and versioned taxonomies are the product. Run the per-entry workflow with all of that and you haven't avoided building Communiti - you've built it.

At a glance

Three ways to analyse the same feedback

Comparison of manual review, standard sentiment platforms, and Communiti Intelligence for issue-level feedback analysis.
Capability Reading it yourself Analyst + spreadsheet Existing platforms Standard sentiment tooling Communiti Intelligence Issue-level analysis
What a mixed response becomes Complete

Both points noted - when there's time to read every row closely

Collapsed

One label for the whole response: "MIXED". Trains and buses both vanish

Complete

Trains · positive and Bus delays · negative, with the resident's words highlighted

Issues residents imply but never name Caught

A good reader sees "twenty minutes on hold" for what it is

Missed

Found about 1 in 100 in our testing - there's no named "thing" to tag

Caught

Found 9 in 10, scored and filed like any other issue

Feedback in community languages Sometimes

Only the languages your team happens to read - unless you pay for a translation service

English only

Issue-level analysis supports no other languages

50+ languages

10 benchmarked natively, including mid-sentence switching - nothing skipped

Filing issues under your report topics By hand

Every issue copied and categorised manually - and each reader files differently

Not available

No concept of your project's topics

Automatic

Filed into your topics as it reads - rename or merge topics any time

Evidence behind every finding By hand

Quotes pasted into the report, hours of copy-and-paste

Not linked

A score, and at best keyword tags - the words aren't tied to the sentiment or your topics

Built in

Every issue links to the exact words in the original response

5,000 responses Weeks

And consistency drifts as readers tire

Hours

Fast - but you get one shallow score per response

Under 3 hours

Your best analyst's read, at machine speed, the same rules on row 1 and row 5,000

For your technical reviewers

The scores behind the headlines

Headline percentages are rounded for readability. These are the underlying precision, recall, and F1 figures - the same ones in the published results files.

The full scorecard

Production configuration on the 217-response stress set, scored with identical matching rules. Every figure traces to a published results file in the benchmark pack.

Metric Communiti Amazon Comprehend Targeted Sentiment, best setting
Issue detection (aspect F1), full stress set Comprehend Targeted Sentiment does not support the multilingual portion of the corpus 92.4% English only
Issue detection (aspect F1), English-only subset Same 167 responses scored for both tools 93.3% 49.5%
Precision / recall (English subset) 89.4% / 97.5% 55.3% / 44.9%
Sentiment accuracy on detected issues 99.0% 86.2%
Evidence span grounding Every Communiti issue links to highlightable words in the original response 100% Not linked
Hallucinated evidence 0 n/a
Cost per 1,000 issues analysed Comprehend is cheaper per call - the comparison is capability per dollar ~US$3.08 ~US$0.32
Mean latency per response ~2.0s <1s

Run on AWS Bedrock in ap-southeast-2 (Australia-geographic inference). Zero API errors across the full evaluation.

Methodology

How we measured

Test corpus

All benchmarks use synthetic feedback written for testing - no resident data - spanning 473 responses across two test sets. The full evaluation suite re-runs end to end.

Stress set
217
Typos, sarcasm, rambling voice transcripts, low-literacy writing, and ten languages
Long submissions
256
Long submissions, implied issues, and mixed-language responses

Processed in Australia

Analysis runs on AWS in Sydney and Melbourne using Australia-geographic AI infrastructure. Feedback is not processed offshore.

Never used to train AI

Your community's feedback is not used to train any AI model, and the model provider has no access to it - contractually guaranteed by AWS.

Evidence on request

The complete benchmark pack - test data, scoring code, raw outputs, and methodology notes - is available for technical review.

Every claim traceable

Each issue links to the exact words in the original response. Nothing is summarised beyond what a reviewer can verify in one click.

The fine print we think you should read

  1. Test data. All benchmarks use synthetic feedback written for testing, with no resident data, spanning 473 responses across two test sets: a 217-response stress set and a 256-response set of long submissions, implied issues, and mixed-language responses. Headline accuracy figures of 92% and 99% come from the stress set.
  2. Comprehend comparison. Amazon Comprehend Targeted Sentiment was scored on its best-performing configuration, only on English content, with identical matching rules to ours. The comparison is about what each tool can see.
  3. AI-assistant comparison. The per-response test used a well-crafted prompt on the same AI model Communiti uses, plus a stronger model, so the gap shown is workflow and product, not model quality. Correctly filed means the issue was found and assigned to the project's reporting topic.
  4. Cost fairness. Comprehend is cheaper per call (roughly US$0.30-1.45 per 1,000 parts against our ~US$3). The comparison on this page is capability per dollar: what arrives in your report, not what each API call costs.
  5. Reproducibility. Every number on this page traces to a published results file, and the full evaluation suite re-runs end to end - including a non-live verification path that re-scores cached outputs without calling any model. Ask us for the benchmark pack.

Check our work

See your own consultation benchmarked this way

Bring one real, de-identified feedback export to a 30-minute walkthrough - or request the benchmark pack and have your technical team verify every number on this page.

Ready to turn community feedback into defensible outcomes?

See how Communiti helps teams analyse faster and close the loop with confidence.

Stay close to the future of community engagement

Product notes, practical field guides, and evidence-led thinking for teams working under public scrutiny.

Read about ourWe care about your data in our privacy policy.

End-to-end engagement workflow