Benchmark Reports
Acervo runs benchmarks on every release to measure context quality, token efficiency,
and entity extraction. Two benchmark types:
- Conversation benchmarks (v0.2-v0.3): 360 turns across 6 scenarios measuring token savings and context hit rates.
- Indexed project benchmarks (v0.4+): 55 turns across 3 projects with 5-category scoring (RESOLVE/GROUND/RECALL/FOCUS/ADAPT) and agent efficiency comparison.
Reports by Version
| Version |
Date |
Type |
Turns |
Key Result |
Report |
| v0.4.0 |
2026-04-01 |
Indexed project |
55 |
100% RESOLVE, 12.1x efficiency |
Full Report |
| v0.2.2-3 |
2026-03-27 |
Conversation |
360 |
76.1% savings |
Full Report |
| v0.2.2-2 |
2026-03-27 |
Conversation |
360 |
76.1% savings |
Full Report |
| v0.2.2-1 |
2026-03-27 |
Conversation |
360 |
76.1% savings |
Full Report |
v0.4.0 — Indexed Project Benchmarks
Category Scores
| Category |
What it proves |
Score |
| RESOLVE |
Answers questions requiring project context |
100% |
| GROUND |
Prevents hallucination with verified data |
92% |
| RECALL |
Remembers user-stated facts across turns |
67% |
| FOCUS |
Sends only relevant context, respects budget |
100% |
| ADAPT |
Handles topic changes cleanly |
100% |
Approach Comparison (RESOLVE, 13 turns)
| Approach |
Can Answer |
Avg Input Tokens |
Avg Steps |
| Stateless LLM |
8% |
-- |
-- |
| Agent + Tools |
100% |
7,462 |
2.8 |
| Acervo |
100% |
616 |
0 |
12.1x fewer tokens than an agent approach for the same questions.
Component Health
| Component |
Score |
| S1 Intent |
78% |
| S2 Activation |
56% |
| S3 Budget |
32% |
| S3 Quality |
81% |
Test Projects
| Project |
Domain |
Content |
| P1 — TODO App |
Source code |
31 TypeScript/React files |
| P2 — Literature |
Prose |
Sherlock Holmes epub (public domain) |
| P3 — PM Docs |
Project management |
11 markdown files |
For detailed methodology, see the Benchmark Guide.
v0.2.x — Conversation Benchmarks
| # |
Scenario |
Turns |
Description |
| 1 |
Developer Workflow |
60 |
Programming questions, debugging, code review |
| 2 |
Literature & Comics |
60 |
Character tracking, plot analysis, cross-references |
| 3 |
Academic Research |
60 |
Citations, methodology, multi-domain synthesis |
| 4 |
Mixed Domains |
60 |
Rapid topic switching across unrelated subjects |
| 5 |
SaaS Founder (100t) |
60 |
Long-form business context, metrics, strategy |
| 6 |
Product Manager |
60 |
Real-world PM workflow, stakeholder tracking |
Generating Reports
# Indexed project benchmarks (v0.4+, requires LM Studio + Ollama)
pytest tests/integration/test_benchmarks.py -v -s
# Conversation benchmarks (v0.2-v0.3)
python -m tests.integration.run_benchmarks --format html
python -m tests.integration.export_report --tier full --open
After generating, copy reports to docs/benchmarks/vX.Y.Z/ and update this page.