Skip to content

Benchmark Reports

Acervo runs benchmarks on every release to measure context quality, token efficiency, and entity extraction. Two benchmark types:

  • Conversation benchmarks (v0.2-v0.3): 360 turns across 6 scenarios measuring token savings and context hit rates.
  • Indexed project benchmarks (v0.4+): 55 turns across 3 projects with 5-category scoring (RESOLVE/GROUND/RECALL/FOCUS/ADAPT) and agent efficiency comparison.

Reports by Version

Version Date Type Turns Key Result Report
v0.4.0 2026-04-01 Indexed project 55 100% RESOLVE, 12.1x efficiency Full Report
v0.2.2-3 2026-03-27 Conversation 360 76.1% savings Full Report
v0.2.2-2 2026-03-27 Conversation 360 76.1% savings Full Report
v0.2.2-1 2026-03-27 Conversation 360 76.1% savings Full Report

v0.4.0 — Indexed Project Benchmarks

Category Scores

Category What it proves Score
RESOLVE Answers questions requiring project context 100%
GROUND Prevents hallucination with verified data 92%
RECALL Remembers user-stated facts across turns 67%
FOCUS Sends only relevant context, respects budget 100%
ADAPT Handles topic changes cleanly 100%

Approach Comparison (RESOLVE, 13 turns)

Approach Can Answer Avg Input Tokens Avg Steps
Stateless LLM 8% -- --
Agent + Tools 100% 7,462 2.8
Acervo 100% 616 0

12.1x fewer tokens than an agent approach for the same questions.

Component Health

Component Score
S1 Intent 78%
S2 Activation 56%
S3 Budget 32%
S3 Quality 81%

Test Projects

Project Domain Content
P1 — TODO App Source code 31 TypeScript/React files
P2 — Literature Prose Sherlock Holmes epub (public domain)
P3 — PM Docs Project management 11 markdown files

For detailed methodology, see the Benchmark Guide.

v0.2.x — Conversation Benchmarks

# Scenario Turns Description
1 Developer Workflow 60 Programming questions, debugging, code review
2 Literature & Comics 60 Character tracking, plot analysis, cross-references
3 Academic Research 60 Citations, methodology, multi-domain synthesis
4 Mixed Domains 60 Rapid topic switching across unrelated subjects
5 SaaS Founder (100t) 60 Long-form business context, metrics, strategy
6 Product Manager 60 Real-world PM workflow, stakeholder tracking

Generating Reports

# Indexed project benchmarks (v0.4+, requires LM Studio + Ollama)
pytest tests/integration/test_benchmarks.py -v -s

# Conversation benchmarks (v0.2-v0.3)
python -m tests.integration.run_benchmarks --format html
python -m tests.integration.export_report --tier full --open

After generating, copy reports to docs/benchmarks/vX.Y.Z/ and update this page.