Skip to content

Benchmark Reports

Acervo runs benchmarks on every release to measure context quality, token efficiency, and entity extraction. Three benchmark types:

  • Conversation benchmarks (v0.2-v0.3): 360 turns across 6 scenarios measuring token savings and context hit rates.
  • Indexed project benchmarks (v0.4+): 55 turns across 3 projects with 5-category scoring (RESOLVE/GROUND/RECALL/FOCUS/ADAPT) and agent efficiency comparison.
  • Conversation scenario tests (v0.5+): 24 turns across 3 scenarios testing real-time graph construction from conversation.

Reports by Version

Version Date Type Turns Key Result Report
v0.5.0 2026-04-06 Indexed + Conversation 79 100% GROUND, 21.3x efficiency, BFS layers Full Report
v0.4.0 2026-04-01 Indexed project 55 100% RESOLVE, 12.1x efficiency Full Report
v0.2.2-3 2026-03-27 Conversation 360 76.1% savings Full Report
v0.2.2-2 2026-03-27 Conversation 360 76.1% savings Full Report
v0.2.2-1 2026-03-27 Conversation 360 76.1% savings Full Report

v0.5.0 — Hexagonal Architecture + BFS Semantic Layers

Category Scores

Category What it proves v0.4 v0.5
RESOLVE Answers questions requiring project context 100% 85%
GROUND Prevents hallucination with verified data 92% 100%
RECALL Remembers user-stated facts across turns 67% 67%
FOCUS Sends only relevant context, respects budget 100% 100%
ADAPT Handles topic changes cleanly 100% 89%

Efficiency vs Agent (21.3x improvement)

Approach Can Answer Avg Input Tokens Avg Steps
Stateless LLM 8% -- --
Agent + Tools 100% 7,462 2.8
Acervo 100% ~350 0

21.3x fewer tokens than an agent approach. Up from 12.1x in v0.4.

Graph Quality (85/85 checks)

Project Checks Entities Nodes Edges
P1 Code (Todo App) 28/28 ✓ 7 231 1,109
P2 Literature (Sherlock Holmes) 21/21 ✓ 5 40 307
P3 PM Docs 32/32 ✓ 6 108 331

Conversation Scenarios (NEW in v0.5)

Scenario Turns Passed Graph Entity Accuracy
C1: Multi-project portfolio 10 7/10 13n / 27e 72%
C2: Personal knowledge 6 3/6 5n / 4e 60%
C3: Progressive building 8 7/8 6n / 5e 83%

What's new in v0.5

  • BFS semantic layers — S2 does breadth-first traversal: HOT (depth 0), WARM (depth 1), COLD (depth 2)
  • Compressed context format — XML-delimited (<hot>, <warm>), ~50% fewer tokens
  • Conversation pipeline — Graph grows in real time from chat. warm_tokens > 0 on 80%+ of retrieval turns (was 0% in v0.4)
  • Hexagonal architecture — facade.py (1,848 LOC) → domain/pipeline.py (~200 LOC)
  • Graph quality specs — Automated checks for required/forbidden entities

For the full interactive report with charts, see the v0.5.0 Benchmark Report.

v0.4.0 — Indexed Project Benchmarks

Category Score
RESOLVE 100%
GROUND 92%
RECALL 67%
FOCUS 100%
ADAPT 100%

12.1x fewer tokens than an agent approach (avg 616 tokens vs 7,462).

For details, see the v0.4.0 Report.

v0.2.x — Conversation Benchmarks

# Scenario Turns Description
1 Developer Workflow 60 Programming questions, debugging, code review
2 Literature & Comics 60 Character tracking, plot analysis, cross-references
3 Academic Research 60 Citations, methodology, multi-domain synthesis
4 Mixed Domains 60 Rapid topic switching across unrelated subjects
5 SaaS Founder (100t) 60 Long-form business context, metrics, strategy
6 Product Manager 60 Real-world PM workflow, stakeholder tracking

Generating Reports

# Run all integration tests (requires Ollama with acervo-extractor-v3)
pytest tests/integration/ -v -s

# Generate unified report (JSON + MD + HTML)
python tests/integration/generate_report.py v0.5.0

# Copy HTML report to docs for publishing
cp tests/integration/reports/v0.5.0/benchmark_report.html docs/benchmarks/v0.5.0/index.html

For detailed methodology, see the Benchmark Guide.