Issue-Level Sentiment Benchmark vs Amazon Comprehend

0%: of issues found, even in deliberately messy feedback
0%: sentiment accuracy on the issues it finds
0%: of mixed responses split into their real issues
0: community languages benchmarked natively, including Te Reo Māori and Samoan

Head to head № 1

The industry-standard tool misses what residents do not name

Most platforms and many in-house projects run on Amazon Comprehend. It tags things it can recognise. But residents rarely name things. They write, "Twenty minutes on hold just to ask why my rates notice changed." There is no customer service in that sentence, only the experience of it.

PDF Submission_041 - Draft Transport and Open Space Plan.pdf One of the 60 long submissions from the benchmark, read by both tools below

Reading

What Communiti reads

Issue-level analysis

I want to start by saying the new playground at Riverside Park has been wonderful for our family - my kids ask to go every weekend. But getting there is another story. We waited forty minutes for a bus the timetable said ran every fifteen, and when one finally arrived it drove straight past because it was already full.

My mother is in her late seventies and can't manage the walk from the closest stop - the footpath on Hartley Street has been lifted by tree roots for over a year. She has stopped coming with us. And the water pools right across the path near the underpass every time it rains - you need gumboots just to get through.

On the plan itself, I support the new cycleway and would use it to ride to work, but nothing in the document mentions lighting - walking back from the 6pm session it was pitch black by the carpark. Twenty minutes on hold when I rang to ask about session times didn't help either.

0 of 8 issues found

What Amazon Comprehend sees

Targeted sentiment, best setting

I want to start by saying the new playground at Riverside Park has been wonderful for our family - my kids ask to go every weekend. But getting there is another story. We waited forty minutes for a bus the timetable said ran every fifteen, and when one finally arrived it drove straight past because it was already full.

My mother is in her late seventies and can't manage the walk from the closest stop - the footpath on Hartley Street has been lifted by tree roots for over a year. She has stopped coming with us. And the water pools right across the path near the underpass every time it rains - you need gumboots just to get through.

On the plan itself, I support the new cycleway and would use it to ride to work, but nothing in the document mentions lighting - walking back from the 6pm session it was pitch black by the carpark. Twenty minutes on hold when I rang to ask about session times didn't help either.

0 of 8 found 0 missed

Comprehend's export for this submission contains 2 issues. The other 6 - the ones the resident implied but never named - never arrive in any report. The charts below score the same gap across all 60 submissions.

Issue detection score - long written submissions

60 multi-issue submissions, 700 issues to find, same scoring for both

Communiti 100%

Amazon Comprehend, best setting 33%

Comprehend left 233 significant issues unreported across those 60 submissions. Communiti left none.

Issue detection score - issues implied, never named

96 responses like, water comes over the kerb every time it rains

Communiti 90%

Amazon Comprehend, best setting 1%

This is most of what consultation feedback looks like, and it is structurally invisible to entity-tagging tools.

Issue detection as feedback gets messier

Aspect F1 by difficulty tier on the 217-response stress set - typos, sarcasm, voice transcripts, low-literacy writing, ten languages

Easy Clean, single-issue responses

96%

62%

Medium Sarcasm, typos, rambling, emoji

93%

42%

Hard Voice transcripts, low-literacy, buried issues

91%

51%

Multilingual Ten community languages, code-switching

89%

Not supported

—

Communiti's accuracy holds as feedback gets messier. Comprehend's best configuration drops below half on the feedback consultations actually receive - and reads none of the multilingual set.

Head to head № 2

Feedback in 50+ languages - read, not skipped

Residents write the way they speak - sometimes switching language mid-sentence: "公交车总是晚点, but the new library hours are very helpful." Communiti reads the complaint about buses and the praise for the library. It supports more than 50 languages, and the ten below are the ones we benchmarked natively, end to end. Comprehend's issue-level analysis supports English only.

Mandarin 中文
Arabic العربية
Vietnamese Tiếng Việt
Cantonese 廣東話
Punjabi ਪੰਜਾਬੀ
Greek Ελληνικά
Italian Italiano
Hindi हिन्दी
Te Reo Māori
Samoan Gagana Sāmoa

These ten are the natively benchmarked set - 40+ more are supported via automatic translation. Industry-standard issue-level analysis: 0 of 10 supported.

Head to head № 3

Could we just paste it into an AI assistant?

Fair question. Tools like Microsoft Copilot and ChatGPT read well, and we tested that workflow properly - including with the same AI model we use ourselves. The difference is what arrives at the other end: a consultation dashboard needs every issue filed under your topics, with the resident's words attached, for every row in the spreadsheet, every time.

Share of issues that arrive in your dashboard, correctly filed

60 long submissions, found and filed under the right reporting topic

Communiti 93%

AI assistant, one response at a time 5%

AI assistant, whole spreadsheet pasted in at once 0%

The assistant finds issues, then names its themes differently on every row - so almost nothing lands in your reporting structure without someone re-filing it by hand. That re-filing is the job Communiti does.

The honest finding: the model reads well - the product is the gap

One response at a time with a careful prompt, on the same AI model Communiti uses, raw issue extraction is comparable. What never arrives is everything a consultation dashboard needs around the reading.

Metric	Communiti	AI assistant, one response at a time, same model
Raw issue extraction (aspect F1) Modern models are good at reading - we report this gap honestly	0.90 - 1.00	0.84 - 1.00
Issues landing in your dashboard taxonomy The assistant invents its own theme names on every row	93 - 95%	0 - 4%
Topic grouping consistency (ARI)	0.81 - 0.92	0.54 - 0.64
Sentiment accuracy, long submissions	100%	96%

Schema enforcement, retries, span grounding, language routing, and versioned taxonomies are the product. Run the per-entry workflow with all of that and you haven't avoided building Communiti - you've built it.

At a glance

Three ways to analyse the same feedback

Comparison of manual review, standard sentiment platforms, and Communiti Intelligence for issue-level feedback analysis.
Capability	Reading it yourself Analyst + spreadsheet	Existing platforms Standard sentiment tooling	Communiti Intelligence Issue-level analysis
What a mixed response becomes	Complete Both points noted - when there's time to read every row closely	Collapsed One label for the whole response: "MIXED". Trains and buses both vanish	Complete Trains · positive and Bus delays · negative, with the resident's words highlighted
Issues residents imply but never name	Caught A good reader sees "twenty minutes on hold" for what it is	Missed Found about 1 in 100 in our testing - there's no named "thing" to tag	Caught Found 9 in 10, scored and filed like any other issue
Feedback in community languages	Sometimes Only the languages your team happens to read - unless you pay for a translation service	English only Issue-level analysis supports no other languages	50+ languages 10 benchmarked natively, including mid-sentence switching - nothing skipped
Filing issues under your report topics	By hand Every issue copied and categorised manually - and each reader files differently	Not available No concept of your project's topics	Automatic Filed into your topics as it reads - rename or merge topics any time
Evidence behind every finding	By hand Quotes pasted into the report, hours of copy-and-paste	Not linked A score, and at best keyword tags - the words aren't tied to the sentiment or your topics	Built in Every issue links to the exact words in the original response
5,000 responses	Weeks And consistency drifts as readers tire	Hours Fast - but you get one shallow score per response	Under 3 hours Your best analyst's read, at machine speed, the same rules on row 1 and row 5,000

For your technical reviewers

The scores behind the headlines

Headline figures are rounded for readability. These are the underlying benchmark results and technical context behind the public claims on this page.

The full scorecard

Production configuration on the 217-response stress set, scored with identical matching rules. Every figure traces to a published results file in the benchmark pack.

Metric	Communiti	Amazon Comprehend Targeted Sentiment, best setting
Issue detection (aspect F1), full stress set Comprehend Targeted Sentiment does not support the multilingual portion of the corpus	92.4%	English only
Issue detection (aspect F1), English-only subset Same 167 responses scored for both tools	93.3%	49.5%
Precision / recall (English subset)	89.4% / 97.5%	55.3% / 44.9%
Sentiment accuracy on detected issues	99.0%	86.2%
Evidence span grounding Every Communiti issue links to highlightable words in the original response	100%	Not linked
Hallucinated evidence	0	n/a
Mean latency per response	~2.0s	<1s

Run on AWS Bedrock in ap-southeast-2 (Australia-geographic inference). Zero API errors across the full evaluation.

Methodology

How we measured

Test corpus

All benchmarks use synthetic feedback written for testing - no resident data - spanning 473 responses across two test sets. The full evaluation suite re-runs end to end.

Stress set: 217; Typos, sarcasm, rambling voice transcripts, low-literacy writing, and ten languages
Long submissions: 256; Long submissions, implied issues, and mixed-language responses

Processed in Australia

Analysis runs on AWS in Sydney and Melbourne using Australia-geographic infrastructure. Feedback is not processed offshore.

Never used to train AI

Your community's feedback is not used to train any AI model, and the model provider has no access to it - contractually guaranteed by AWS.

Evidence on request

The complete benchmark pack - test data, scoring code, raw outputs, and methodology notes - is available for technical review.

Every claim traceable

Each issue links to the exact words in the original response. Nothing is summarised beyond what a reviewer can verify in one click.

The fine print we think you should read

Test data. All benchmarks use synthetic feedback written for testing, with no resident data, spanning 473 responses across two test sets: a 217-response stress set and a 256-response set of long submissions, implied issues, and mixed-language responses. Headline accuracy figures of 92% and 99% come from the stress set.
Comprehend comparison. Amazon Comprehend Targeted Sentiment was scored on its best-performing configuration, only on English content, with identical matching rules to ours. The comparison is about what each tool can see.
AI-assistant comparison. The per-response test used a well-crafted prompt on the same AI model Communiti uses, plus a stronger model, so the gap shown is workflow and product, not model quality. Correctly filed means the issue was found and assigned to the project's reporting topic.
Reproducibility. Every number on this page traces to a published results file, and the full evaluation suite re-runs end to end - including a non-live verification path that re-scores cached outputs without calling any model. Ask us for the benchmark pack.

Issue-level sentiment analysis benchmark

The industry-standard tool misses what residents do not name

What Communiti reads

What Amazon Comprehend sees

Issue detection score - long written submissions

Issue detection score - issues implied, never named

Issue detection as feedback gets messier

Feedback in 50+ languages - read, not skipped

Could we just paste it into an AI assistant?

Share of issues that arrive in your dashboard, correctly filed

The honest finding: the model reads well - the product is the gap

Three ways to analyse the same feedback

The scores behind the headlines

The full scorecard

How we measured

Processed in Australia

Never used to train AI

Evidence on request

Every claim traceable

The fine print we think you should read

See your own consultation benchmarked this way

Ready to turn community feedback into defensible outcomes?

Stay close to the future of community engagement

Headline results

The industry-standard tool misses what residents do not name

What Communiti reads

What Amazon Comprehend sees

Issue detection score - long written submissions

Issue detection score - issues implied, never named

Issue detection as feedback gets messier

Feedback in 50+ languages - read, not skipped

Could we just paste it into an AI assistant?

Share of issues that arrive in your dashboard, correctly filed

The honest finding: the model reads well - the product is the gap

Three ways to analyse the same feedback

The scores behind the headlines

The full scorecard

How we measured

Processed in Australia

Never used to train AI

Evidence on request

Every claim traceable

The fine print we think you should read

See your own consultation benchmarked this way

Ready to turn community feedback into defensible outcomes?

Stay close to the future of community engagement