Tutorial: Build a chat with persistent memory in 5 minutes

This guide walks you through setting up Acervo with a local LLM and running a chat that remembers everything across sessions.

Prerequisites

Python 3.11 or higher
LM Studio (free, runs LLMs locally)

Step 1: Download a model

Open LM Studio
Search for qwen2.5-3b-instruct in the model browser
Download it (about 2GB)
Go to the Local Server tab
Load the model and click Start Server
The server starts at http://localhost:1234/v1

Why a 3B model?

Acervo uses a small utility model for entity extraction, topic detection, and query planning. A 3B parameter model is fast and cheap — it runs on most laptops. Your main chat model can be larger (7B, 13B, etc.) and run separately.

Step 2: Install Acervo

pip install acervo

Or install from source:

git clone https://github.com/sandyeveliz/acervo.git
cd acervo
pip install -e .

Step 3: Run the example

Save this as chat.py (or use examples/chat.py from the repo):

import asyncio
from acervo import Acervo, OpenAIClient

async def main():
    llm = OpenAIClient(
        base_url="http://localhost:1234/v1",
        model="qwen2.5-3b-instruct",
        api_key="lm-studio",
    )

    memory = Acervo(llm=llm, owner="User")
    history = [{"role": "system", "content": "You are a helpful assistant."}]

    print("Acervo Chat (type 'quit' to exit)\n")
    while True:
        user_input = input("You: ").strip()
        if not user_input or user_input.lower() in ("quit", "exit", "q"):
            break

        history.append({"role": "user", "content": user_input})

        # Acervo enriches context from the knowledge graph
        prep = await memory.prepare(user_input, history)

        # Call the LLM with enriched context
        response = await llm.chat(prep.context_stack, temperature=0.7, max_tokens=500)
        print(f"AI: {response}\n")

        history.append({"role": "assistant", "content": response})

        # Acervo extracts knowledge from the response
        await memory.process(user_input, response)
        print(f"  [{memory.graph.node_count} nodes, {memory.graph.edge_count} edges]\n")

asyncio.run(main())

Run it:

python chat.py

Step 4: Try it out

Tell the agent some facts:

You: My name is Sandy and I live in Cipolletti
AI: Nice to meet you, Sandy! What would you like to talk about?
  [2 nodes, 1 edges]

You: I work at Altovallestudio, we build software
AI: Interesting! What kind of projects does Altovallestudio work on?
  [3 nodes, 3 edges]

Now quit (Ctrl+C or type quit) and restart the script. The graph persists:

You: What do you know about me?
AI: You're Sandy, you live in Cipolletti and work at Altovallestudio.
  [3 nodes, 3 edges]

The knowledge survived the restart because Acervo stores it in data/graph/nodes.json.

Step 5: Look at the graph

Open data/graph/nodes.json to see what Acervo extracted:

[
  {
    "id": "sandy",
    "label": "Sandy",
    "type": "Persona",
    "layer": "PERSONAL",
    "owner": "User",
    "facts": [
      {"fact": "Sandy lives in Cipolletti", "source": "user"},
      {"fact": "Sandy works at Altovallestudio", "source": "user"}
    ]
  },
  {
    "id": "cipolletti",
    "label": "Cipolletti",
    "type": "Lugar",
    "layer": "UNIVERSAL",
    "facts": []
  }
]

And data/graph/edges.json for the relationships:

[
  {"source": "sandy", "target": "cipolletti", "relation": "ubicado_en"},
  {"source": "sandy", "target": "altovallestudio", "relation": "TRABAJA_EN"}
]

What's happening under the hood

Each turn, Acervo runs a 3-step pipeline:

flowchart LR
    subgraph prepare["acervo.prepare()"]
        TD["Topic\nDetector"] --> QP["Query\nPlanner"]
        QP --> CI["Context\nIndex"]
    end
    subgraph process["acervo.process()"]
        EX["Extractor"] --> GP["Graph\nPersist"]
    end

    UM["User\nmessage"] --> prepare
    prepare --> LLM["Your app\ncalls LLM"]
    LLM --> process
    process --> KG["Knowledge\ngraph grows"]

    style prepare fill:#2d5016,stroke:#4a8c1c
    style process fill:#2d5016,stroke:#4a8c1c
    style LLM fill:#1a3a5c,stroke:#3a7abd

prepare() — detects the topic, plans what data to retrieve, builds a context stack from the knowledge graph
Your app calls the LLM — with the enriched context
process() — extracts entities, relations, and facts from the response, persists them to the graph

The context stack

The context that prepare() builds has a specific structure:

flowchart TD
    S["[system] Your app's system prompt\n(immutable, KV cached)"]
    W["[user] Warm context from graph\n(verified knowledge about the topic)"]
    A["[assistant] 'Entendido.'\n(acknowledgment)"]
    H["[user/assistant] Hot layer\n(last 2 turn pairs from history)"]
    U["[user] Current user message"]

    S --> W --> A --> H --> U

    style S fill:#1a3a5c,stroke:#3a7abd
    style W fill:#2d5016,stroke:#4a8c1c
    style A fill:#2d5016,stroke:#4a8c1c
    style H fill:#5c3a1a,stroke:#bd7a3a
    style U fill:#5c1a1a,stroke:#bd3a3a

System prompt stays immutable (KV cache friendly)
Warm context from the graph enters as a separate user message — always included when relevant
Hot layer keeps only the last 2 turn pairs from conversation history
Token usage stays flat: the graph grows, but the context window doesn't

Next steps

Configuration — customize context settings, token budgets, embedding thresholds
Knowledge Layers — understand UNIVERSAL vs PERSONAL layers
Web Search example — add Brave Search for real-time data