Issue-level sentiment analysis benchmark
How Communiti's issue-level sentiment analysis scores against Amazon Comprehend Targeted Sentiment - the engine behind most engagement platforms' sentiment - and against AI assistants, on the messiest consultation feedback we could write.
Read the launch story behind this benchmark: Hear every issue in every voice
Headline results
- 0%
- of issues found, even in deliberately messy feedback
- 0%
- sentiment accuracy on the issues it finds
- 0%
- of mixed responses split into their real issues
- 0
- community languages benchmarked natively, including Te Reo Māori and Samoan
Head to head № 1
The industry-standard tool misses what residents do not name
Most platforms and many in-house projects run on Amazon Comprehend. It tags things it can recognise. But residents rarely name things. They write, "Twenty minutes on hold just to ask why my rates notice changed." There is no customer service in that sentence, only the experience of it.
What Communiti reads
Issue-level analysisI want to start by saying the new playground at Riverside Park has been wonderful for our family - my kids ask to go every weekend. But getting there is another story. We waited forty minutes for a bus the timetable said ran every fifteen, and when one finally arrived it drove straight past because it was already full.
My mother is in her late seventies and can't manage the walk from the closest stop - the footpath on Hartley Street has been lifted by tree roots for over a year. She has stopped coming with us. And the water pools right across the path near the underpass every time it rains - you need gumboots just to get through.
On the plan itself, I support the new cycleway and would use it to ride to work, but nothing in the document mentions lighting - walking back from the 6pm session it was pitch black by the carpark. Twenty minutes on hold when I rang to ask about session times didn't help either.
0 of 8 issues found
What Amazon Comprehend sees
Targeted sentiment, best settingI want to start by saying the new playground at Riverside Park has been wonderful for our family - my kids ask to go every weekend. But getting there is another story. We waited forty minutes for a bus the timetable said ran every fifteen, and when one finally arrived it drove straight past because it was already full.
My mother is in her late seventies and can't manage the walk from the closest stop - the footpath on Hartley Street has been lifted by tree roots for over a year. She has stopped coming with us. And the water pools right across the path near the underpass every time it rains - you need gumboots just to get through.
On the plan itself, I support the new cycleway and would use it to ride to work, but nothing in the document mentions lighting - walking back from the 6pm session it was pitch black by the carpark. Twenty minutes on hold when I rang to ask about session times didn't help either.
0 of 8 found 0 missed
Comprehend's export for this submission contains 2 issues. The other 6 - the ones the resident implied but never named - never arrive in any report. The charts below score the same gap across all 60 submissions.
Issue detection score - long written submissions
60 multi-issue submissions, 700 issues to find, same scoring for both
Comprehend left 233 significant issues unreported across those 60 submissions. Communiti left none.
Issue detection score - issues implied, never named
96 responses like, water comes over the kerb every time it rains
This is most of what consultation feedback looks like, and it is structurally invisible to entity-tagging tools.
Issue detection as feedback gets messier
Aspect F1 by difficulty tier on the 217-response stress set - typos, sarcasm, voice transcripts, low-literacy writing, ten languages
Communiti's accuracy holds as feedback gets messier. Comprehend's best configuration drops below half on the feedback consultations actually receive - and reads none of the multilingual set.
Head to head № 2
Feedback in 50+ languages - read, not skipped
Residents write the way they speak - sometimes switching language mid-sentence: "公交车总是晚点, but the new library hours are very helpful." Communiti reads the complaint about buses and the praise for the library. It supports more than 50 languages, and the ten below are the ones we benchmarked natively, end to end. Comprehend's issue-level analysis supports English only.
- Mandarin 中文
- Arabic العربية
- Vietnamese Tiếng Việt
- Cantonese 廣東話
- Punjabi ਪੰਜਾਬੀ
- Greek Ελληνικά
- Italian Italiano
- Hindi हिन्दी
- Te Reo Māori
- Samoan Gagana Sāmoa
These ten are the natively benchmarked set - 40+ more are supported via automatic translation. Industry-standard issue-level analysis: 0 of 10 supported.
Head to head № 3
Could we just paste it into an AI assistant?
Fair question. Tools like Microsoft Copilot and ChatGPT read well, and we tested that workflow properly - including with the same AI model we use ourselves. The difference is what arrives at the other end: a consultation dashboard needs every issue filed under your topics, with the resident's words attached, for every row in the spreadsheet, every time.
Share of issues that arrive in your dashboard, correctly filed
60 long submissions, found and filed under the right reporting topic
The assistant finds issues, then names its themes differently on every row - so almost nothing lands in your reporting structure without someone re-filing it by hand. That re-filing is the job Communiti does.
The honest finding: the model reads well - the product is the gap
One response at a time with a careful prompt, on the same AI model Communiti uses, raw issue extraction is comparable. What never arrives is everything a consultation dashboard needs around the reading.
| Metric | Communiti | AI assistant, one response at a time, same model |
|---|---|---|
| Raw issue extraction (aspect F1) Modern models are good at reading - we report this gap honestly | 0.90 - 1.00 | 0.84 - 1.00 |
| Issues landing in your dashboard taxonomy The assistant invents its own theme names on every row | 93 - 95% | 0 - 4% |
| Topic grouping consistency (ARI) | 0.81 - 0.92 | 0.54 - 0.64 |
| Sentiment accuracy, long submissions | 100% | 96% |
Schema enforcement, retries, span grounding, language routing, and versioned taxonomies are the product. Run the per-entry workflow with all of that and you haven't avoided building Communiti - you've built it.
At a glance
Three ways to analyse the same feedback
| Capability | Reading it yourself Analyst + spreadsheet | Existing platforms Standard sentiment tooling | Communiti Intelligence Issue-level analysis |
|---|---|---|---|
| What a mixed response becomes | Complete Both points noted - when there's time to read every row closely | Collapsed One label for the whole response: "MIXED". Trains and buses both vanish | Complete Trains · positive and Bus delays · negative, with the resident's words highlighted |
| Issues residents imply but never name | Caught A good reader sees "twenty minutes on hold" for what it is | Missed Found about 1 in 100 in our testing - there's no named "thing" to tag | Caught Found 9 in 10, scored and filed like any other issue |
| Feedback in community languages | Sometimes Only the languages your team happens to read - unless you pay for a translation service | English only Issue-level analysis supports no other languages | 50+ languages 10 benchmarked natively, including mid-sentence switching - nothing skipped |
| Filing issues under your report topics | By hand Every issue copied and categorised manually - and each reader files differently | Not available No concept of your project's topics | Automatic Filed into your topics as it reads - rename or merge topics any time |
| Evidence behind every finding | By hand Quotes pasted into the report, hours of copy-and-paste | Not linked A score, and at best keyword tags - the words aren't tied to the sentiment or your topics | Built in Every issue links to the exact words in the original response |
| 5,000 responses | Weeks And consistency drifts as readers tire | Hours Fast - but you get one shallow score per response | Under 3 hours Your best analyst's read, at machine speed, the same rules on row 1 and row 5,000 |
For your technical reviewers
The scores behind the headlines
Headline percentages are rounded for readability. These are the underlying precision, recall, and F1 figures - the same ones in the published results files.
The full scorecard
Production configuration on the 217-response stress set, scored with identical matching rules. Every figure traces to a published results file in the benchmark pack.
| Metric | Communiti | Amazon Comprehend Targeted Sentiment, best setting |
|---|---|---|
| Issue detection (aspect F1), full stress set Comprehend Targeted Sentiment does not support the multilingual portion of the corpus | 92.4% | English only |
| Issue detection (aspect F1), English-only subset Same 167 responses scored for both tools | 93.3% | 49.5% |
| Precision / recall (English subset) | 89.4% / 97.5% | 55.3% / 44.9% |
| Sentiment accuracy on detected issues | 99.0% | 86.2% |
| Evidence span grounding Every Communiti issue links to highlightable words in the original response | 100% | Not linked |
| Hallucinated evidence | 0 | n/a |
| Cost per 1,000 issues analysed Comprehend is cheaper per call - the comparison is capability per dollar | ~US$3.08 | ~US$0.32 |
| Mean latency per response | ~2.0s | <1s |
Run on AWS Bedrock in ap-southeast-2 (Australia-geographic inference). Zero API errors across the full evaluation.
Methodology
How we measured
Test corpus
All benchmarks use synthetic feedback written for testing - no resident data - spanning 473 responses across two test sets. The full evaluation suite re-runs end to end.
- Stress set
- 217
- Typos, sarcasm, rambling voice transcripts, low-literacy writing, and ten languages
- Long submissions
- 256
- Long submissions, implied issues, and mixed-language responses
Processed in Australia
Analysis runs on AWS in Sydney and Melbourne using Australia-geographic AI infrastructure. Feedback is not processed offshore.
Never used to train AI
Your community's feedback is not used to train any AI model, and the model provider has no access to it - contractually guaranteed by AWS.
Evidence on request
The complete benchmark pack - test data, scoring code, raw outputs, and methodology notes - is available for technical review.
Every claim traceable
Each issue links to the exact words in the original response. Nothing is summarised beyond what a reviewer can verify in one click.
The fine print we think you should read
- Test data. All benchmarks use synthetic feedback written for testing, with no resident data, spanning 473 responses across two test sets: a 217-response stress set and a 256-response set of long submissions, implied issues, and mixed-language responses. Headline accuracy figures of 92% and 99% come from the stress set.
- Comprehend comparison. Amazon Comprehend Targeted Sentiment was scored on its best-performing configuration, only on English content, with identical matching rules to ours. The comparison is about what each tool can see.
- AI-assistant comparison. The per-response test used a well-crafted prompt on the same AI model Communiti uses, plus a stronger model, so the gap shown is workflow and product, not model quality. Correctly filed means the issue was found and assigned to the project's reporting topic.
- Cost fairness. Comprehend is cheaper per call (roughly US$0.30-1.45 per 1,000 parts against our ~US$3). The comparison on this page is capability per dollar: what arrives in your report, not what each API call costs.
- Reproducibility. Every number on this page traces to a published results file, and the full evaluation suite re-runs end to end - including a non-live verification path that re-scores cached outputs without calling any model. Ask us for the benchmark pack.
Check our work
See your own consultation benchmarked this way
Bring one real, de-identified feedback export to a 30-minute walkthrough - or request the benchmark pack and have your technical team verify every number on this page.
