Acervo Benchmark Guide

This document describes every test in Acervo's integration benchmark suite: what it measures, how it works, and how to interpret the results.

Overview

Acervo replaces traditional conversation history with a persistent knowledge graph. Instead of accumulating every message in the context window (which grows linearly and eventually overflows), Acervo builds a graph of entities, facts, and relationships from each conversation turn, then injects only the relevant subset as context for the next turn.

The benchmark suite answers two questions:

Does it work? Can Acervo answer questions that require project knowledge, recall user-stated facts, and avoid hallucinating information that doesn't exist?
Is it efficient? Does Acervo use fewer tokens than alternative approaches (raw history accumulation, or an agent with file-reading tools)?

Test Infrastructure

Fixture Projects

Three real-world projects are used as test data, each representing a different domain:

Project	Domain	Content	Files
P1 — TODO App	Source code	Full-stack TypeScript app: Express backend, React frontend, SQLite, JWT auth	31 files (.ts, .tsx, .md, .json, .html, .css)
P2 — Literature	Prose / epub	"The Adventures of Sherlock Holmes" by Arthur Conan Doyle (1892, public domain, Project Gutenberg)	1 epub file, 12 stories
P3 — PM Docs	Project management	Roadmap, sprint plans, issue trackers, architecture decision records	11 markdown files

These projects live in tests/fixtures/ and contain only source files (no generated data).

Auto-Indexing Pipeline

Before tests run, conftest.py automatically processes each fixture through Acervo's full pipeline:

init -> index -> curate -> synthesize

Index: Parses files, creates graph nodes (file, section, symbol, folder), generates embeddings, stores chunks in vector DB.
Curate: LLM analyzes files in batches, extracts entities (people, technologies, concepts) and relationships between them.
Synthesize: LLM generates a project overview node that summarizes the entire project.

This happens once. Subsequent runs reuse the existing .acervo/ state. Delete .acervo/ inside a fixture to force re-indexing.

Requirements: LM Studio running acervo-extractor-qwen3.5-9b and Ollama running qwen3-embedding.

Layer 1: Pipeline Validation (`test_pipeline_validation.py`)

26 tests that inspect the graph state produced by index/curate/synthesize. These tests do NOT call the LLM at runtime -- they read the existing graph and check structural properties. They run in seconds.

What it tests per project

Index validation

File nodes created: Did the indexer find all source files? (P1: >=20 files, P2: >=1 epub, P3: >=5 markdown files)
Section nodes created: Did the parser extract headings/chapters? (P2: >=10 story sections, P3: >=30 markdown heading sections)
Symbol nodes created: Did code parsing extract functions and classes? (P1: >=50 symbols from TypeScript)
Folder nodes created: Is the directory structure captured? (P1: >=5 folders)
Chunk IDs linked: Are text chunks properly associated with nodes? (P1: >=100 chunk links, P2: >=50 chunk links)
Edges exist: Are parent-child, import, and contains relationships stored? (P1: >=100 edges)

Curate validation

Entities extracted: Did the LLM identify real entities? (P1: >=3 tech entities like React/Express/SQLite) (P2: >=2 character entities like Holmes/Watson) (P3: >=1 entity from PM docs)
Entity types valid: Are types from the ontology? (person, organization, project, technology, place, event, document, concept, symbol)
No phantom entities: Every entity label must appear in at least one source file. Phantoms = hallucinated entities the LLM invented.
Entities have relations: Entities should be connected to file nodes, not orphaned.
Character entities (P2): Does curation extract Sherlock, Watson, Adler, etc?
Location entities (P2): Does curation find Baker Street, London?

Synthesize validation

Synthesis node exists: Is there a synthesis:*overview* node?
Synthesis mentions stack (P1): Does the overview mention React, Express, etc?
Synthesis mentions Sherlock (P2): Does the overview mention Holmes/Conan Doyle?
Module summaries (P3): Are there module-level synthesis nodes?

Diagnostic report

The final test generates a cross-project summary printed to console and saved as reports/pipeline_diagnostic.json and reports/pipeline_diagnostic.md.

Layer 2: 5-Category Benchmark (`test_benchmarks.py`)

55 conversation turns across 3 projects, organized into 5 capability categories. Each turn calls prepare() (the context injection pipeline) and process() (the memory extraction pipeline), then evaluates component-level checks.

How a turn works

1. User message sent to prepare(user_msg, history)
2. S1 (Intent Detection): Classify as overview/specific/chat
3. S2 (Node Activation): Select relevant graph nodes
4. S3 (Context Assembly): Build warm context within token budget
5. Evaluate per-turn checks against S1/S2/S3 outputs
6. Call process(user_msg, assistant_sim) to update memory graph
7. Record pass/fail for each check

The assistant_sim is a simulated assistant response (not from an LLM). This isolates the test to Acervo's context pipeline, not the chat model's quality.

The 5 Categories

RESOLVE (14 turns)

"Can Acervo enable answers that are impossible without project knowledge?"

These questions have NO correct answer without access to the project data. A stateless LLM would have to refuse or hallucinate.

Turn	Project	Question	Why it's impossible without context
P1-1	TODO App	"How many files does this project have?"	Requires file listing
P1-2	TODO App	"What technologies does this project use?"	Must read package.json/imports
P1-5	TODO App	"What routes does the API define?"	Must read route files
P1-13	TODO App	"How does the frontend communicate with the backend?"	Must trace API client code
P1-16	TODO App	"What data models are defined?"	Must read model files
P1-18	TODO App	"What error handling does the API have?"	Must read middleware code
P2-1	Literature	"What is this book about?"	Must read the epub
P2-2	Literature	"How many stories does it contain?"	Must parse table of contents
P2-12	Literature	"Where does Holmes live?"	Must find address in text
P3-1	PM Docs	"What project is documented here?"	Must read project files
P3-2	PM Docs	"What documents are available?"	Must list document files
P3-7	PM Docs	"What milestones are defined?"	Must read roadmap
P3-13	PM Docs	"Who are the developers working on this?"	Must find names in docs
P3-16	PM Docs	"Which issues are related to the current sprint?"	Must cross-reference sprint + issues

Pass criteria: S3 warm context contains the information needed to answer (verified by context_contains / context_contains_any checks).

Effectiveness check: stateless_can_answer: false marks these as turns where Acervo is the difference between answering and not answering.

GROUND (12 turns)

"Does Acervo prevent hallucination by grounding answers in verified data?"

These questions ask about things that may or may not exist in the project. Acervo should provide context that confirms or denies -- and critically, should NOT inject context about things that aren't there.

Turn	Project	Question	What we check
P1-14	TODO App	"Does this project use GraphQL?"	Context must NOT contain "graphql"
P1-15	TODO App	"Is there a Python backend?"	Context must NOT contain "python", "django", "flask"
P1-20	TODO App	"This is React and Express with SQLite, right?"	Context MUST contain "react", "express", "sqlite"; must NOT contain "mongodb", "nuxt"
P2-3	Literature	"Who is Sherlock Holmes?"	Context must contain "holmes", "detective", or "baker street"
P2-6	Literature	"Who is Irene Adler?"	Context must contain "adler" or "irene"
P2-11	Literature	"Does Professor Moriarty appear in these stories?"	Moriarty is barely in this collection -- should not hallucinate his presence
P2-15	Literature	"This is by Arthur Conan Doyle, right?"	Context must contain "conan doyle"
P3-4	PM Docs	"What's the tech stack?"	Context must contain "express", "sqlite", "jwt"; must NOT contain "django"
P3-12	PM Docs	"Why was SQLite chosen over PostgreSQL?"	Context must contain "sqlite" and "decision"
P3-15	PM Docs	"Is there a CI/CD pipeline set up?"	Correct answer is "no" -- context should not hallucinate one
P3-19	PM Docs	"Is Ron Weasley part of this project?"	Context must NOT contain "weasley" -- tests negative grounding

Pass criteria: context_contains for positive grounding, context_not_contains for negative grounding (absence of noise).

RECALL (6 turns)

"Can Acervo remember user-stated facts across turns?"

Two phases: the user states a fact (S1.5 extracts it into the graph), then later asks about it (S3 should retrieve it).

Turn	Project	Phase	Message	What we check
P1-9	TODO App	Store	"The lead developer is Alice and she started in January"	S1.5 extracts "alice" entity into graph
P1-12	TODO App	Recall	"Who is the lead developer?"	S3 context contains "alice"
P2-10	Literature	Store	"Holmes's deduction method is similar to scientific hypothesis testing"	S1.5 extracts fact about "deduction"
P2-13	Literature	Recall	"What did I say about Holmes's method earlier?"	S3 context contains "deduction", "scientific"
P3-8	PM Docs	Store	"The client wants the MVP ready by end of Q2"	S1.5 extracts "MVP" entity
P3-14	PM Docs	Recall	"What was the deadline I mentioned?"	S3 context contains "q2", "mvp"

Pass criteria: Entity extraction creates the node (store phase), and later context retrieval includes the fact (recall phase).

FOCUS (14 turns)

"Does Acervo keep context small and relevant?"

These turns test that Acervo doesn't over-activate nodes or waste the token budget. Chat messages like "thanks" should produce almost no context. Specific questions should activate only the relevant files.

Turn	Project	Question	Budget check
P1-3	TODO App	"How does authentication work?"	100-600 tokens, must contain "auth", must NOT contain "todo.controller"
P1-4	TODO App	"What database does it use?"	<=600 tokens, must contain "sqlite"
P1-6	TODO App	"Interesting, well-structured project"	<=150 tokens (chat -- minimal context)
P1-8	TODO App	"What React hooks does the app use?"	<=600 tokens, <=8 nodes
P1-17	TODO App	"High-level summary of the whole project"	<=300 tokens (synthesis only, no symbols)
P1-19	TODO App	"Ok, I think I understand the project now"	<=150 tokens (chat)
P2-4	Literature	"Tell me about Dr. Watson"	<=600 tokens, <=10 nodes
P2-5	Literature	"What happens in A Scandal in Bohemia?"	100-600 tokens
P2-8	Literature	"These are great detective stories"	<=150 tokens (chat)
P3-3	PM Docs	"What issues are currently open?"	100-600 tokens, <=8 nodes, only issue files
P3-5	PM Docs	"What's in the current sprint?"	<=600 tokens, <=6 nodes, sprint files
P3-9	PM Docs	"Thanks for the overview"	<=150 tokens (chat)
P3-11	PM Docs	"What's the auth issue specifically?"	<=600 tokens, <=6 nodes
P3-17	PM Docs	"What progress has been made?"	<=600 tokens, progress files
P3-20	PM Docs	"Give me a final summary"	<=300 tokens, synthesis only

Pass criteria: Token budget within range, node count within limits, and correct S2 file activation (activated + not-activated checks).

ADAPT (8 turns)

"Can Acervo switch context cleanly when the user changes topic?"

The user shifts focus mid-conversation. Acervo should activate nodes for the NEW topic and stop injecting context from the OLD topic.

Turn	Project	Shift	What we check
P1-7	TODO App	Backend -> Frontend	Must activate "component"/"frontend"; must NOT activate "controller", "middleware"
P1-10	TODO App	Frontend -> Config	Must activate "config" files
P1-11	TODO App	Config -> Auth (return)	Must activate "auth", "middleware"
P2-7	Literature	Bohemia -> Red-Headed League	Must contain "red-headed"/"league"; must NOT contain "adler"/"bohemia"
P2-9	Literature	Red-Headed -> Irene Adler (return)	Must contain "adler"/"irene"
P2-14	Literature	Specific story -> Overall themes	<=400 tokens (overview mode)
P3-6	PM Docs	Sprint -> Roadmap	Must activate "roadmap"; must NOT activate "sprint"/"issue"
P3-10	PM Docs	Roadmap -> Issues (return)	Must activate "issue" files
P3-18	PM Docs	Progress -> Architecture decisions	Must activate "decision"; must NOT activate "sprint"/"progress"

Pass criteria: Correct S2 activation for the new topic, AND absence of noise from the previous topic in S3 context.

Component-Level Checks (Internal Diagnostics)

Each turn can validate specific pipeline components independently:

S1 — Intent Detection

Checks that the pipeline correctly classifies the user's intent: - overview: broad questions ("what is this project?") -> use synthesis nodes - specific: targeted questions ("how does auth work?") -> use vector search - chat: social messages ("thanks", "ok") -> minimal or no context

S2 — Node Activation

Checks which graph nodes the pipeline selected: - activate_kinds: which node types should appear (file, section, synthesis, entity) - not_activate_kinds: which node types should NOT appear - activate_files_containing: node labels must include these terms - not_activate_files_containing: node labels must NOT include these terms - min_nodes / max_nodes: bounds on how many nodes activate

S3 — Context Assembly

Checks the warm context text injected into the LLM prompt: - warm_tokens_min / warm_tokens_max: token budget bounds - context_contains: ALL these terms must appear in context (case-insensitive) - context_contains_any: at least ONE of these terms must appear - context_not_contains: NONE of these terms should appear (noise check)

S1.5 — Memory Extraction

Only checked on RECALL-store turns: - should_extract_entity: verify that process() created a graph node for this entity

Agent Comparison (Approach Scorecard)

24 turns include agent_comparison blocks that estimate what alternative approaches would cost to answer the same question:

Three approaches compared

Approach	Description
Stateless LLM	Plain model with no project access. Can only use training data.
Agent + Tools	Model with `list_directory`, `file_search`, `read_file` tools. Must discover and read files on every turn.
Acervo	Pre-indexed knowledge graph. Context injected from graph in a single step, no tool calls.

What gets measured

For each compared turn:

Stateless: can_answer (true/false) -- can a plain LLM answer from training alone?
Agent + Tools: steps (number of tool calls), tools (which tools used), estimated_input_tokens (total tokens consumed including tool results)
Acervo: warm_tokens (actual tokens of warm context injected)

Reports generated

RESOLVE Scorecard: For RESOLVE turns, shows what percentage each approach can answer and at what token cost. Stateless typically scores 0% (can't answer without data). Agent scores 100% (can always read files) but at high token cost. Acervo scores based on effectiveness checks.

Efficiency Chart ("killer chart"): Per-turn comparison of agent tokens vs Acervo tokens. Shows the ratio -- e.g., an agent uses 9,000 tokens where Acervo uses 500 (18x reduction).

Example agent estimates for P1 Turn 16 ("What data models are defined?"):

Agent: file_search("model") -> read_file(todo.model.ts) -> read_file(user.model.ts)
       3 steps, ~7,000 input tokens
Acervo: warm context with model summaries
       0 steps, ~400 tokens
       Ratio: 17.5x fewer tokens

The key insight: the agent's cost grows with project size (more files to search) and conversation length (more history to carry). Acervo's cost stays bounded by the token budget (~2000 tokens max regardless of project size or turn count).

Version History

Each benchmark run appends results to reports/version_history.json. This tracks category scores and component scores across releases, enabling regression detection and progress tracking.

Running the Tests

# Layer 1 only (no LLM needed, reads existing graph, ~5 seconds)
pytest tests/integration/test_pipeline_validation.py -v

# Full benchmark (needs LM Studio + Ollama, ~5 minutes)
pytest tests/integration/test_benchmarks.py -v -s

# Single project
pytest tests/integration/test_benchmarks.py -k "p1" -v -s

# Both layers
pytest tests/integration/ -v -s

Output files

After running, find reports in tests/integration/reports/:

File	Content
`benchmark_public.json`	Category scores + scorecard + efficiency chart (machine-readable)
`benchmark_public.md`	Category scores + approach comparison tables (human-readable)
`benchmark_diagnostic.json`	Full per-turn detail with all component checks
`benchmark_diagnostic.md`	Component health + cross-matrix + S1 failures
`pipeline_diagnostic.json`	Graph state per project (Layer 1)
`pipeline_diagnostic.md`	Pipeline health summary (Layer 1)
`version_history.json`	Scores over time for regression tracking

Interpreting results

Category scores (public view): Percentage of turns where the context contained the information needed to answer correctly.

100% = every turn had correct context
< 80% = some questions would fail -- investigate which turns and why

Component scores (internal view): Where the pipeline breaks down.

S1 Intent low = intent detection misfires (classifying overview as specific or vice versa)
S2 Activation low = wrong nodes selected (wrong files, too many nodes)
S3 Budget low = token budget exceeded (over-stuffing context)
S3 Quality low = context missing key terms or contains noise

Cross-matrix shows which categories are affected by which component failures. For example, if ADAPT has low S2 Activation, the pipeline isn't switching topics cleanly.

Efficiency ratio shows how many fewer tokens Acervo uses compared to a tool-using agent. Higher is better. Typical: 10-40x fewer tokens per turn.