Acervo Benchmark Report

Indexed project benchmarks: Acervo answers questions about codebases, books, and docs using a knowledge graph — zero tool calls, constant token cost.

92%
Avg Score
55
Turns Tested
12.1x
vs Agent Efficiency
3
Projects
33
Entities Extracted

2026-04-01 — v0.4.0 — 3 fixture projects — 5 categories

Category Scores

Five categories test different aspects of graph-based context retrieval.

100%
Resolve
Can answer project-specific questions?
92%
Ground
Answers grounded in graph, not hallucinated?
67%
Recall
Remembers facts from earlier turns?
100%
Focus
Ignores irrelevant context?
100%
Adapt
Handles graph updates mid-conversation?
12.1x
fewer tokens than an agent with tools
Acervo answers RESOLVE questions with ~616 tokens of warm context. An agent needs ~7,462 tokens across 2.8 tool-call steps to reach the same answer.

Approach Comparison

How Acervo compares to a stateless LLM and an agent with tools on the same questions.

RESOLVE — 13 turns

Questions that require project-specific knowledge to answer.

ApproachCan AnswerAvg Input TokensAvg Steps
Stateless LLM 8% baseline
Agent + Tools 100% 7,462 2.8 multi-step
Acervo 100% 616 0 12.1x faster

GROUND — 11 turns

Questions where the answer must be grounded in actual project data, not general knowledge.

ApproachCan AnswerAvg Input TokensAvg Steps
Stateless LLM 27% baseline
Agent + Tools 100% 5,500 2.3 multi-step
Acervo 91% 600 0 9.2x faster

Per-Turn Efficiency

Token cost comparison for every RESOLVE and GROUND turn across all three projects.

Agent Tokens vs Acervo Tokens

Each bar pair shows one question. Red = agent with tools, green = Acervo graph context. Hover for details.

Component Health

Internal diagnostic scores for each pipeline stage.

78%
S1 Intent
56%
S2 Activation
32%
S3 Budget
81%
S3 Quality

Category × Component Matrix

Where each category is strong or weak across pipeline components.

CategoryS1 IntentS2 ActivationS3 BudgetS3 QualityFinal Score
RESOLVE73%50%0%100%100%
GROUND80%50%0%77%92%
RECALLn/an/an/a33%67%
FOCUS73%38%40%89%100%
ADAPT89%100%100%78%100%

S1 Intent Misclassifications

9 turns where the model classified user intent incorrectly.

  • Turn 2 "What technologies does this project use?" expected: overview got: specific
  • Turn 6 "Interesting, this is a well-structured project" expected: chat got: overview
  • Turn 19 "Ok, I think I understand the project now" expected: chat got: specific
  • Turn 1 "What is this book about?" expected: overview got: specific
  • Turn 8 "These are great detective stories" expected: chat got: specific
  • Turn 14 "Overall, what themes run through these stories?" expected: overview got: specific
  • Turn 1 "What project is documented here?" expected: overview got: specific
  • Turn 2 "What documents are available?" expected: overview got: specific
  • Turn 9 "This project seems well-documented" expected: chat got: overview

Pipeline Validation

Index → Curate → Synthesize results for each fixture project.

P1: Todo App TypeScript/React

Files indexed31
Sections38
Symbols134
Folders17
Chunks350
Graph nodes235
Graph edges1,163
Entities11
Synthesis nodes4
Express.js SQLite JWT Nuxt Tailwind CSS React MongoDB Docker

P2: Literature Sherlock Holmes EPUB

Files indexed1
Sections33
Chunks2,056
Graph nodes43
Graph edges126
Entities8
Synthesis nodes1
Sherlock Holmes John Watson Baker Street Irene Adler Professor Moriarty A Scandal in Bohemia

P3: Project Docs ADRs, Issues, Sprints

Files indexed11
Sections84
Folders3
Chunks160
Graph nodes116
Graph edges383
Entities14
Synthesis nodes4
Todo App Svelte Express.js SQLite PostgreSQL JWT Docker Compose GitHub Actions