Vector RAG Isn’t Sufficient — I Constructed a Context Graph Layer for Multi-Agent Reminiscence

I wasn’t making an attempt to construct a brand new reminiscence structure. I used to be making an attempt to know why one agent saved forgetting choices made by one other. The benchmark got here later.

Multi-agent programs lose cross-agent choices as a result of flat transcripts and vector search each have a structural blind spot — not only a noise downside.

A context graph shops information as entities and relationships as an alternative of textual content chunks, so it may reply questions that want two information mixed.

This isn’t an idea. Three reminiscence architectures, 5 scripted eventualities, 18 graded queries, totally deterministic, zero LLM calls.

Context graph: 88.9% accuracy at 26.9 tokens/question. Uncooked historical past dump: 61.1% accuracy at 490.9 tokens/question. Vector-only RAG: 50.0% accuracy at 75.9 tokens/question.

I discovered two actual bugs constructing this — stale-fact retrieval and an entity-matching hole. Each are within the article.

The Downside That Made Me Construct This

I constructed a three-agent pipeline that labored nice for brief duties. However the second the dialog dragged on and an agent wanted to recall a previous choice, the entire thing fell aside.

Right here is strictly the way it broke: Agent_Planner would resolve the challenge ought to use PostgreSQL. Then, twenty turns of “sounds good” and “I’ll get to it” would move. Finally, Agent_Reviewer would pipe up and ask what storage know-how we have been utilizing. Even with all the uncooked transcript sitting proper there within the context window, the agent couldn’t reply reliably.

I used to be working this pipeline regionally as a aspect challenge for EmiTechLogic simply to see how far I might push multi-agent coordination earlier than it hit a wall. Seems, it didn’t take very lengthy.

Initially, I assumed this was only a mannequin limitation. It isn’t. It’s a reminiscence structure downside that often triggers considered one of two large complications relying on the way you attempt to repair it.

The Different Repair: Vector Search and the Relational Entice

In the event you swap to vector search, you repair the noise downside however instantly create a distinct one. A vector retailer retrieves chunks that look much like your question; it doesn’t retrieve relationships between information.

If a key choice lives in a single chunk and a important dependency observe about that call lives in one other, a similarity search has no approach to mix them—regardless of how good your embedding mannequin is.

Each approaches hit completely different structural ceilings. As a substitute of guessing which compromise was “ok,” I made a decision to measure them each.

What This Downside Truly Is

To be clear about what this text is not: this isn’t a token-compression downside, and it’s not a staleness downside. It’s a structural retrieval downside. Some questions can solely be answered by combining two separately-stated information, and neither a rising context window nor a vector index has a mechanism to try this. That may be a fully completely different failure mode than those I’ve written about earlier than, and it wanted a distinct benchmark.

The Check Setup

To check this, I constructed 5 deterministic eventualities containing 18 graded queries and ran all three reminiscence architectures towards the very same conversations.

All the outcomes beneath come from actual runs of that benchmark utilizing a localized setup:

Surroundings: Python 3.12, CPU-only (no GPU wanted)
API Calls: Zero
Consistency: Reproduced identically throughout two separate machines

Code Repo: You will discover the entire implementation and run the assessments your self right here: https://github.com/Emmimal/context-graph-benchmark/

What “Context Graph” Means Right here

A flat reminiscence retailer (whether or not it’s a uncooked chat transcript or a vector index) treats each single flip as an impartial unit of textual content. To retrieve one thing, you simply discover the unit that greatest matches your question.

A context graph adjustments the underlying construction completely. It treats reminiscence as distinct entities with typed relationships connecting them:

AuthModule —–> DEPENDS_ON —–> RateLimiter
Agent_Implementer —–> ASSIGNED_TO —–> AuthModule

Retrieval on this mannequin means traversing these relationships as an alternative of simply matching key phrases or semantic vectors.

That structural distinction solely issues for one particular class of questions: something that requires you to mix two separately-stated information.

Contemplate a query like: “Which workforce owns the element that depends upon the service that X selected?”

There is no such thing as a single reply chunk sitting wherever within the uncooked dialog historical past. The reply doesn’t exist as a block of textual content. It solely exists as a path via a number of information. A flat retailer can not assemble that path on the fly. A graph walks proper via it.

Who This Is For

This strategy is value constructing if you happen to run multi-agent pipelines the place one agent’s choice have to be accurately retrieved by a distinct agent many turns later. It’s constructed for programs the place questions routinely require combining two or extra separately-stated information, or any long-running agent dialog the place the token price of re-sending historical past is changing into an actual line merchandise.

It is best to skip it for single-agent, single-turn duties as a result of there isn’t a cross-agent state to lose. Skip it in case your queries are at all times single-fact lookups with no joins. Vector RAG will get you many of the accuracy there at a fraction of the engineering price. Lastly, skip it in case your workforce has no tolerance for an additional shifting half. A graph wants an extraction step (which is rule-based on this benchmark, however requires an LLM name in manufacturing) {that a} flat retailer avoids.

In case your multi-agent system finishes its work in a single alternate, plain context passing works fantastic. This downside exhibits up particularly when conversations run lengthy and choices must survive previous the flip they have been made in.

The Three Architectures

Structure	What it shops	What it prices	What it’s good at
Uncooked Historical past Dump	Each flip, verbatim	Grows with dialog size, resent each question	Nothing it doesn’t get free of charge from having every little thing
Vector-Solely RAG	Each flip, embedded (TF-IDF)	Flat per question, loses relational construction	Discovering semantically comparable single information
Context Graph	Structured triples in a NetworkX graph	Flat and small per question	Questions that want two information mixed

Why There Are No LLM Calls within the Benchmark

I purposely not noted LLM calls from each stage of this benchmark: no LLMs for extraction, none for question answering, and none for grading.

If an actual LLM dealt with the extraction, the benchmark would measure LLM variance as a lot as precise architectural variations. Utilizing deterministic, rule-based stand-ins ensures that each single run produces the very same numbers.

I ran this take a look at independently on two completely different machines whereas penning this piece. The output matched byte-for-byte, sustaining accuracy to 4 decimal locations and token counts all the way down to the precise integer.

Constructing a Benchmark That Doesn’t Secretly Favor the Graph

The best approach to make a graph win a benchmark is to solely ask it clear, single-fact questions. That proves nothing. To maintain the testing honest, each state of affairs follows 4 strict guidelines:

Distractors outnumber information: Each state of affairs incorporates much more “sounds good,” “I’ll examine that,” and “no blockers on my finish” turns than precise concrete choices.
Queries span bodily distance: Some queries are requested proper after a truth is acknowledged (direct), some are requested many turns later (distant), and a few require stitching two separate information collectively (be a part of). An instance of a be a part of question is: “Which element does the module owned by Agent_Implementer rely upon?”
Some queries are simple on objective: Direct, single-fact lookups are included particularly to present the flat architectures a good shot.
Grading is totally deterministic: The benchmark makes use of substring matching towards a hand-written floor reality moderately than counting on an LLM choose.

@dataclass
class Flip:
    turn_id: int
    turn_type: TurnType          # FACT, DISTRACTOR, or QUERY
    speaker: str
    textual content: str
    topic: str | None = None    # structured triple, FACT turns solely
    predicate: str | None = None
    object: str | None = None
    fact_id: str | None = None
    query_type: str | None = None # "direct", "distant", "be a part of"
    required_fact_ids: tuple = ()
    ground_truth: str | None = None

The benchmark covers 5 distinct eventualities throughout completely different domains: software program planning, a analysis pipeline, incident response, buyer assist escalation, and an information pipeline.

Throughout these 5 setups, there are 18 complete queries cut up into three particular classes:

6 Direct queries: Lookups requested instantly after the very fact is acknowledged.
7 Distant queries: Lookups requested many turns after the very fact is acknowledged.
5 Be a part of queries: Questions that require combining two separately-stated information to get the reply.

Structure 1: Uncooked Historical past Dump

Each single flip will get appended to a flat transcript, and all the transcript will get resent on each question. That is precisely what you get by default when you don’t design a reminiscence system on objective.

I constructed this to function a genuinely honest baseline. It will get the total, good transcript with nothing hidden from it. The reply extraction makes use of key phrase overlap with gentle stemming, searched from the latest flip backward. This setup carefully mirrors how a context-stuffed immediate tends to weight recency anyway.

class RawHistoryDump:
    def ingest(self, flip: Flip) -> None:
        self.transcript.append(f"{flip.speaker}: {flip.textual content}")

    def answer_query(self, query_turn: Flip) -> tuple[str, int]:
        immediate = self._build_prompt(query_turn)   # the ENTIRE transcript
        tokens = count_tokens(immediate)
        reply = self._extract_answer(query_turn)
        return reply, tokens

The price mannequin matches precisely what you see in manufacturing: each question resends all the rising dialog historical past.

Structure 2: Vector-Solely RAG

Each flip, truth and distractor alike, will get embedded and saved as a bit. An actual vector retailer doesn’t know prematurely which turns will matter later. On a question, the top-Okay most comparable chunks are retrieved.

I used TF-IDF as an alternative of a neural embedding API for a similar cause I prevented LLM calls elsewhere. TfidfVectorizer has no random state, making it deterministic by building. Additionally it is not a toy stand-in. TF-IDF is an actual sparse-retrieval methodology utilized in manufacturing RAG, usually paired with dense embeddings in a hybrid setup.

class VectorOnlyRAG:
    def _retrieve(self, query_text: str) -> record[str]:
        if not self.chunks:
            return []
        corpus = self.chunks + [query_text]
        vectorizer = TfidfVectorizer()
        matrix = vectorizer.fit_transform(corpus)
        sims = cosine_similarity(matrix[-1], matrix[:-1]).flatten()
        top_idx = sims.argsort()[::-1][:self.top_k]
        return [self.chunks[i] for i in top_idx if sims[i] > 0]

(The precise implementation wraps fit_transform in a attempt/besides block to deal with the uncommon edge case of a question containing solely cease phrases. I skipped that right here for house, however it’s within the repository.)

The structural ceiling stays clear: a be a part of question requires combining two distinct information. When these information are acknowledged throughout two completely different turns, no single chunk incorporates each items of knowledge. No embedding mannequin can repair that limitation by itself.

Structure 3: The Context Graph

Information get written as (topic, predicate, object) triples right into a NetworkX directed multigraph. Distractor turns by no means get written in any respect. That is the one place this structure will get a bonus the opposite two don’t: filtering knowledge earlier than it ever hits storage.

In manufacturing, that filtering step is an LLM name performing entity extraction. On this benchmark, it’s deterministic as a result of the state of affairs setup already tags which turns are information. I’m isolating precisely what the storage and retrieval structure does by itself, with extraction held fixed as a acknowledged assumption. I’m not claiming to have solved extraction free of charge.

class ContextGraph:
    def ingest(self, flip: Flip) -> None:
        if flip.topic is None:
            return  # distractors carry no structured triple; not saved
        self.graph.add_node(flip.topic)
        self.graph.add_node(flip.object)
        self.graph.add_edge(flip.topic, flip.object,
                             predicate=flip.predicate, fact_id=flip.fact_id)

The join-query traversal is the half doing the actual work. It performs a two-hop stroll throughout the graph nodes as an alternative of trying to find a single textual content chunk that occurs to comprise each information.

def _answer_join(self, query_turn, talked about):
    for entity in talked about:
        out_edges, in_edges = self._edges_touching(entity)
        intermediates = [v for _, v, _ in out_edges] + [u for u, _, _ in in_edges]
        for intermediate in intermediates:
            further_out, _ = self._edges_touching(intermediate)
            for _, goal, knowledge in further_out:
                if goal != entity:
                    # rating candidates by predicate relevance
                    ...

Right here’s the distinction in search house throughout all three:

Uncooked historical past and vector search retrieve textual content. A context graph retrieves relationships. By traversing related entities, the system can reply multi-hop questions that similarity search alone might miss.

What Truly Occurred After I First Ran It

The primary full run, with all three architectures constructed, scored the context graph at 0% accuracy.

I’m together with this as a result of it’s the half most “I constructed X” posts skip. I might have rewritten the eventualities to be friendlier as an alternative of debugging the code. That might have given me a faux end result. I traced it as an alternative.

Bug 1: Entity Vocabulary Mismatch

Graph nodes have been named issues like Project_Alpha or AuthModule. The queries, written the way in which an agent would really phrase them, stated “this challenge” or “the authentication module.” A literal substring match between the question textual content and the node title discovered completely nothing.

That is the very same vocabulary-mismatch downside folks criticize vector seek for. It simply hits the graph at write time as an alternative of question time.

The repair was a small alias desk standing in for an actual entity-linking step, which might often be dealt with by an LLM name in manufacturing. Utilizing a graph doesn’t get you out of this downside. It merely strikes the issue from query-time retrieval to write-time decision. That’s an ongoing engineering price, not a one-time repair.

Bug 2: Returning Stale Information With Full Confidence

That is the precise concern I’d flag first to anybody transport this sample in a manufacturing surroundings.

One state of affairs encompasses a assist ticket that begins at a precedence degree of “excessive” and will get reclassified to “important” mid-conversation. When querying “what’s the present precedence?”, the graph returned “excessive”—the stale worth, with the very same confidence it will have given the present one.

The trigger was easy: my first ingest() implementation simply added each new edge and by no means eliminated the previous one. The graph held two HAS_PRIORITY edges originating from the identical node. Whichever edge occurred to be visited first within the iteration order gained the lookup, fully ignoring which truth was really present.

# the bug
Ticket_4471 --HAS_PRIORITY--> "excessive"      # acknowledged first
Ticket_4471 --HAS_PRIORITY--> "important"  # acknowledged later, supersedes the primary
# each edges exist without delay; nothing tells the graph which one is "now"

A flat chat dump searched with recency bias tends to floor the newer point out simply by scanning backward. In distinction, a graph with no time mannequin fingers again both truth with equal structural confidence as a result of graphs don’t natively know a relationship has been changed until you explicitly inform them.

That failure mode is worse than a fuzzy search returning a stale chunk. The graph seems fully authoritative even when it’s fully mistaken.

The repair: when a brand new truth restates an present (topic, predicate) pair, the previous edge will get dropped earlier than the brand new one is written.

def ingest(self, flip: Flip) -> None:
    if flip.topic is None:
        return
    self.graph.add_node(flip.topic)
    self.graph.add_node(flip.object)

    stale_edges = [
        (u, v, k) for u, v, k, data in self.graph.edges(keys=True, data=True)
        if u == turn.subject and data.get("predicate") == turn.predicate
    ]
    for u, v, ok in stale_edges:
        self.graph.remove_edge(u, v, key=ok)

    self.graph.add_edge(flip.topic, flip.object,
                         predicate=flip.predicate, fact_id=flip.fact_id)

If you’re transport something like this, dealing with truth supersession will not be optionally available. It’s the precise line between constructing a dependable reminiscence layer and constructing a serious legal responsibility.

Remaining Benchmark Outcomes

5 eventualities, 18 queries, totally deterministic, reproduced identically on two separate machines.

Structure	Accuracy	Avg tokens/question	Direct	Distant	Be a part of
Uncooked Historical past Dump	61.1%	490.9	66.7%	71.4%	40.0%
Vector-Solely RAG	50.0%	75.9	66.7%	57.1%	20.0%
Context Graph	88.9%	26.9	100%	85.7%	80.0%

The context graph wins on accuracy and makes use of about 18x fewer tokens per question than the uncooked dump. That isn’t a tradeoff—it’s a win on each axes.

Vector RAG’s token price can also be low and isn’t the graph’s predominant differentiator. Each architectures retrieve a bounded variety of objects, so each keep low cost no matter dialog size. What separates the graph from vector RAG is the be a part of column: 80% versus 20%. That hole is the structural argument for a graph—vector similarity has no native approach to mix two separately-stated information.

The uncooked dump’s accuracy got here in increased than I anticipated at 61.1%, and it earns that. An ideal, lossless transcript with first rate key phrase matching does fantastic on single-fact lookups. It falls aside particularly on joins (40%) for a similar structural cause as vector RAG, simply with a a lot larger token invoice.

One limitation was left in on objective: two queries within the data-pipeline state of affairs fail as a result of they discuss with an entity by description moderately than title—”the dataset that at the moment has an anomaly” as an alternative of naming Upstream_Orders immediately. Fixing that requires actual semantic understanding of a descriptive clause, not easy alias matching. Extending the alias desk to cowl my very own take a look at queries would imply overfitting the benchmark moderately than representing an actual limitation, so it stays damaged. In case your manufacturing queries lean towards descriptive references, price range for an LLM-based decision step as an alternative of an ever-growing static alias desk.

How Token Value Scales With Dialog Size

My working assumption moving into was that raw-dump token price scales O(N^2) as conversations develop. I measured it as an alternative of assuming it, as a result of transport an imprecise complexity declare to an viewers that checks it’s a quick approach to lose credibility.

The setup: one truth acknowledged as soon as, adopted by a rising variety of filler turns (starting from 10 as much as 800), adopted by a single question asking for that truth. This isolates per-query token price as a pure operate of dialog size, with info content material held fully mounted.

Filler turns	Uncooked Dump tokens	Vector RAG tokens	Context Graph tokens
10	157	54	23
50	659	54	23
100	1,287	54	23
200	2,542	54	23
400	5,052	54	23
800	10,072	54	23

When the dialog size grew 80x (from 10 to 800 turns), the uncooked dump’s token rely grew 64.15x. In the meantime, vector RAG and the context graph each grew 1.00x—fully flat.

The uncooked dump’s tokens-per-query is O(N), which is linear in dialog size, converging to about 12.6 tokens per filler flip. It isn’t quadratic. The O(N^2) story solely turns into correct if you happen to sum the associated fee throughout a whole multi-query dialog: Q queries, every run towards a transcript that has grown linearly, lands round O(N.Q) complete price. That’s the actual quantity, only a extra exact one than “every question prices O(N^2).”

Vector RAG and the context graph each maintain flat at O(1) per question as a result of each architectures solely ever pull a bounded variety of objects no matter how lengthy the dialog will get.

Line chart comparing tokens per query against conversation length. The "raw dump" line rises steeply to 10k tokens at 800 turns, while the "vector RAG / context graph" line remains completely flat near zero. — Token effectivity in LLMs: Evaluating the fast context window scaling of uncooked chat dumps towards the flat, sustainable token utilization of Vector RAG and Context Graph architectures.

What I’d Flag Earlier than Taking This to Manufacturing

A number of issues are value being direct about earlier than anybody copies this sample into an actual software.

On latency: Vector RAG is definitely the slowest structure right here, not the graph. It refits TF-IDF over all the corpus on each question name moderately than sustaining an incremental index. Averaged throughout all 5 eventualities, context graph question answering got here in at 0.050ms versus Vector RAG’s 1.764ms.

That hole closes in an actual deployment the place you’d cache the vectorizer as an alternative of refitting from scratch—the benchmark measured default conduct, not best-case engineered variations. The graph’s occasional spike to 1.9ms comes completely from be a part of queries strolling a number of candidate paths earlier than scoring.

On what the alias desk is definitely doing: The entity alias desk that lets “the authentication module” resolve to AuthModule is a hardcoded stand-in for actual entity linking. In manufacturing, that step is an LLM name. The benchmark is deterministic as a result of I hardcoded the aliases I anticipated—it doesn’t imply the vocabulary-mismatch downside is solved for arbitrary question phrasing. It’s a actual ongoing price that I’m flagging, not hiding.

On token estimation: I used a ~4-characters-per-token heuristic as an alternative of tiktoken, as a result of tiktoken downloads its BPE rank file from a distant URL on first use—a hidden community dependency in a benchmark constructed to have none. The heuristic is utilized identically throughout all three architectures, so it can not bias the comparability between them, however the absolute token numbers are approximations.

On what this benchmark didn’t take a look at: Distractor turns listed here are generic chatter—”no blockers on my finish,” “sounds good.” Actual manufacturing noise is topically near precise information. I’d anticipate all three architectures to drop in accuracy below adversarial noise, and I’ve not measured that, so I gained’t declare the lead holds.

On what’s lacking for manufacturing use: actual entity extraction (the ingest() interface already accepts a structured triple, so swapping in an LLM-based extractor is a contained change), incremental vector indexing, graph pruning for long-running conversations that accumulate entities indefinitely, and chronic storage. The repo features a NetworkX-to-Neo4j export path for anybody who wants sturdiness and concurrent multi-agent writes—however that’s an optionally available step, not a efficiency improve. The explanations to make that leap are transactional ensures and concurrency, not uncooked question pace.

What the Numbers Truly Say

None of this wanted an even bigger mannequin or an extended context window. Each single end result got here from altering how info is represented, not how a lot knowledge will get crammed right into a immediate.

In the event you take just one quantity from this text, take the join-query hole: 80% versus 20–40%. That’s the actual argument for structured reminiscence, not the token financial savings.

Whereas the token financial savings are actual and measurable, they’re secondary. On this benchmark, questions requiring two information from fully completely different elements of the dialog have been the place the graph structure confirmed its largest benefit. That hole held persistently throughout all 5 eventualities, not simply those that occurred to be simple for a graph.

The complete challenge—5 eventualities, three architectures, the take a look at suite that locks these numbers in as regression assessments, and the Neo4j export path—is on the market on the repository beneath.

Full supply code: https://github.com/Emmimal/context-graph-benchmark/

References

[1] Liu, N. F., Lin, Okay., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Misplaced within the Center: How Language Fashions Use Lengthy Contexts. Transactions of the Affiliation for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638

[2] Zhang, W., Zhou, Y., Qu, H., & Li, H. (2026). Loosely-Structured Software program: Engineering Context, Construction, and Evolution Entropy in Runtime-Rewired Multi-Agent Programs (arXiv:2603.15690). arXiv. https://arxiv.org/abs/2603.15690

[3] A. Kollegger, “Context Graphs & Agentic Choices,” Neo4j Developer Weblog, Jan. 31, 2026. [Online]. Accessible: https://medium.com/neo4j/context-graphs-agentic-decisions-9a125f22f411

[4] W. Lyon, “When Your Brokers Share a Mind: Constructing Multi-Agent Reminiscence with Neo4j,” Neo4j Developer Weblog, Apr. 13, 2026. [Online]. Accessible: https://medium.com/neo4j/when-your-agents-share-a-brain-building-multi-agent-memory-with-neo4j-bac609f17b23

[5] Macklin, N., Zaim, Z., & Erdl, A. (2026). Context Graphs and AI Reminiscence Throughout the Globe. Neo4j Developer Weblog. https://medium.com/neo4j/context-graphs-and-ai-memory-across-the-globe-bb17e293df32

[6] NetworkX documentation. https://networkx.org/

[7] Scikit-learn Builders, “TfidfVectorizer,” Scikit-learn Documentation. [Online]. Accessible: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

[8] OpenAI. Counting tokens with tiktoken. https://github.com/openai/tiktoken

[9] Neo4j Python Driver documentation. https://neo4j.com/docs/api/python-driver/current/

Disclosure

All code on this article was written by me and is authentic work, developed and examined on Python 3.12 (Home windows, PyCharm). Benchmark numbers are from precise runs of the code within the linked repository and are reproducible by cloning it and working benchmark.py and measure_scaling.py, besides the place the article explicitly notes a quantity is a heuristic or estimate moderately than a measured end result. I’ve no monetary relationship with any device, library, or firm talked about on this article.

Vector RAG Isn’t Sufficient — I Constructed a Context Graph Layer for Multi-Agent Reminiscence

The Downside That Made Me Construct This

The Different Repair: Vector Search and the Relational Entice

What This Downside Truly Is

The Check Setup

What “Context Graph” Means Right here

Who This Is For

The Three Architectures

Why There Are No LLM Calls within the Benchmark

Constructing a Benchmark That Doesn’t Secretly Favor the Graph

Structure 1: Uncooked Historical past Dump

Structure 2: Vector-Solely RAG

Structure 3: The Context Graph

What Truly Occurred After I First Ran It

Bug 1: Entity Vocabulary Mismatch

Bug 2: Returning Stale Information With Full Confidence

Remaining Benchmark Outcomes

How Token Value Scales With Dialog Size

What I’d Flag Earlier than Taking This to Manufacturing

What the Numbers Truly Say

References

Disclosure

Authorized Context Protocol goals so as to add a layer of battle to AI agent funds

Two planets which can be lighter than cotton sweet are orbiting a distant star in tune.

Converter

Editors Pick

Newsletter

Categories