Benchmark Reports

Acervo runs benchmarks on every release to measure context quality, token efficiency, and entity extraction. Three benchmark types:

Conversation benchmarks (v0.2-v0.3): 360 turns across 6 scenarios measuring token savings and context hit rates.
Indexed project benchmarks (v0.4+): 55 turns across 3 projects with 5-category scoring (RESOLVE/GROUND/RECALL/FOCUS/ADAPT) and agent efficiency comparison.
Conversation scenario tests (v0.5+): 24 turns across 3 scenarios testing real-time graph construction from conversation.

Reports by Version

Version	Date	Type	Turns	Key Result	Report
v0.5.0	2026-04-06	Indexed + Conversation	79	100% GROUND, 21.3x efficiency, BFS layers	Full Report
v0.4.0	2026-04-01	Indexed project	55	100% RESOLVE, 12.1x efficiency	Full Report
v0.2.2-3	2026-03-27	Conversation	360	76.1% savings	Full Report
v0.2.2-2	2026-03-27	Conversation	360	76.1% savings	Full Report
v0.2.2-1	2026-03-27	Conversation	360	76.1% savings	Full Report

v0.5.0 — Hexagonal Architecture + BFS Semantic Layers

Category Scores

Category	What it proves	v0.4	v0.5
RESOLVE	Answers questions requiring project context	100%	85%
GROUND	Prevents hallucination with verified data	92%	100%
RECALL	Remembers user-stated facts across turns	67%	67%
FOCUS	Sends only relevant context, respects budget	100%	100%
ADAPT	Handles topic changes cleanly	100%	89%

Efficiency vs Agent (21.3x improvement)

Approach	Can Answer	Avg Input Tokens	Avg Steps
Stateless LLM	8%	--	--
Agent + Tools	100%	7,462	2.8
Acervo	100%	~350	0

21.3x fewer tokens than an agent approach. Up from 12.1x in v0.4.

Graph Quality (85/85 checks)

Project	Checks	Entities	Nodes	Edges
P1 Code (Todo App)	28/28 ✓	7	231	1,109
P2 Literature (Sherlock Holmes)	21/21 ✓	5	40	307
P3 PM Docs	32/32 ✓	6	108	331

Conversation Scenarios (NEW in v0.5)

Scenario	Turns	Passed	Graph	Entity Accuracy
C1: Multi-project portfolio	10	7/10	13n / 27e	72%
C2: Personal knowledge	6	3/6	5n / 4e	60%
C3: Progressive building	8	7/8	6n / 5e	83%

What's new in v0.5

BFS semantic layers — S2 does breadth-first traversal: HOT (depth 0), WARM (depth 1), COLD (depth 2)
Compressed context format — XML-delimited (<hot>, <warm>), ~50% fewer tokens
Conversation pipeline — Graph grows in real time from chat. warm_tokens > 0 on 80%+ of retrieval turns (was 0% in v0.4)
Hexagonal architecture — facade.py (1,848 LOC) → domain/pipeline.py (~200 LOC)
Graph quality specs — Automated checks for required/forbidden entities

For the full interactive report with charts, see the v0.5.0 Benchmark Report.

v0.4.0 — Indexed Project Benchmarks

Category	Score
RESOLVE	100%
GROUND	92%
RECALL	67%
FOCUS	100%
ADAPT	100%

12.1x fewer tokens than an agent approach (avg 616 tokens vs 7,462).

For details, see the v0.4.0 Report.

v0.2.x — Conversation Benchmarks

#	Scenario	Turns	Description
1	Developer Workflow	60	Programming questions, debugging, code review
2	Literature & Comics	60	Character tracking, plot analysis, cross-references
3	Academic Research	60	Citations, methodology, multi-domain synthesis
4	Mixed Domains	60	Rapid topic switching across unrelated subjects
5	SaaS Founder (100t)	60	Long-form business context, metrics, strategy
6	Product Manager	60	Real-world PM workflow, stakeholder tracking

Generating Reports

# Run all integration tests (requires Ollama with acervo-extractor-v3)
pytest tests/integration/ -v -s

# Generate unified report (JSON + MD + HTML)
python tests/integration/generate_report.py v0.5.0

# Copy HTML report to docs for publishing
cp tests/integration/reports/v0.5.0/benchmark_report.html docs/benchmarks/v0.5.0/index.html

For detailed methodology, see the Benchmark Guide.