TL;DR
- I constructed a dataset Q&A system and trusted a RAG reply that was lower than half-correct.
- I measured this throughout 7 question sorts and 5 context sizes on 100,000 rows.
- queries away from RAG totally.
I Trusted the Improper Quantity
Final month I used to be heads-down constructing a brand new function for EmiTechLogic. Learners can now add their very own messy CSV information and ask questions in plain English about their knowledge. Sounded excellent for RAG, so I went all in — embeddings, retrieval, nice-looking responses.
The primary few demos appeared superb. Clear tables, assured numbers, skilled formatting. I truly began trusting the system in our inner testing.
Then I picked one quantity to double-check.
Actual grocery spend within the dataset: $1,140,033.24.
The mannequin gave me a good looking breakdown by class. It appeared legit. I added up the numbers it returned.
It was lower than half.
I sat there staring on the display screen pondering “this will’t be proper.” So I did what any engineer would do. I elevated the context window. 4k… 16k… 32k… 128k tokens. Every time the reply obtained longer, extra detailed, and extra confidently improper.
That’s when it lastly clicked. This wasn’t a retrieval situation. I used to be asking a retrieval system to carry out heavy computation on knowledge it had solely partially seen. And as an alternative of claiming it was uncertain or lacking info, the mannequin was producing polished, structured solutions that appeared appropriate.
Why RAG Can not Combination
The RAG pipeline doesn’t actually perceive structured knowledge. All it does is take every CSV row and flatten it into plain textual content. That’s it. To the mannequin, a row seems to be one thing like this:
"2019-01-01 grocery_pos 107.23 F NC Jennifer Banks ..."
For a question like “What’s the whole spend by class?”, the RAG pipeline does this:
1. Tokenise: ["total", "spend", "category"]
2. Rating all 100,000 rows by key phrase overlap
3. Return the top-N rows as serialised plain textual content
4. Ask the LLM to sum and group from that textual content
Step 4 is the place the system fails. The LLM will not be operating a SUM. It’s pattern-matching numbers from a textual content blob and producing a response that mimics an aggregation.
Fashions wrestle with numerical precision at scale [1], however the true situation is the presentation. The mannequin offers you an in depth breakdown throughout all classes. It is a traditional lure. The output seems to be skilled. It mimics the construction of an actual report so effectively that your mind assumes the content material is legitimate. You don’t have any technique to confirm that 92% of your knowledge is lacking.
RAG is a retrieval device. It isn’t a calculation engine. Retrieval finds related fragments. Computation requires a full dataset scan. Whenever you use RAG for math, you get a improper reply that appears authoritative. That distinction is crucial. A partial reply indicators that knowledge is lacking. An entire-looking improper reply simply indicators false confidence.
Full code: https://github.com/Emmimal/context-window-engine/
The Benchmark: Two Pipelines, Identical Question
To measure this exactly, I constructed a benchmark that runs two pipelines facet by facet for each question.
The primary pipeline is a RAG simulation. It fashions what a naive vector pipeline passes to an LLM at 5 context sizes. I examined 5 context sizes, starting from 5 rows as much as 8,000. That scales from 325 tokens to 500,000. For every dimension, I tracked three metrics: how a lot knowledge the LLM sees, what sum it computes from that particular slice, and whether or not a reader might truly spot the error.
The second pipeline is a semantic engine that executes the identical question as a deterministic full-scan over all 100,000 rows and returns the precise appropriate reply.
The simulation doesn’t reproduce actual LLM outputs. What it preserves is the important thing structural property: a partial slice of knowledge fed right into a system that returns a full-form reply. That’s the property that causes the issue, and that’s what the benchmark measures.
I selected seven question sorts to cowl each aggregation sample a structured knowledge system is prone to encounter:
| Question | Operation | Why it breaks RAG |
|---|---|---|
| Complete spend by class | SUM + GROUP BY | Requires summing all rows throughout 14 teams |
| Highest common transaction by class | AVG + GROUP BY | Common adjustments with each lacking row |
| Complete spent on grocery_pos | SUM + categorical filter | Filter requires seeing all matching rows |
| What number of feminine clients transacted | COUNT + filter | Depend is meaningless on a partial scan |
| Complete spend the place quantity > $500 | SUM + numeric comparability | Threshold logic requires full knowledge |
| State with lowest whole spending | MIN + GROUP BY throughout 50 teams | Minimal can solely be discovered with all teams current |
| Share of transactions which can be fraudulent | COUNT + ratio | Ratio is undefined on a partial denominator |
These queries are usually not distinctive or complicated. They’re the usual questions any analyst asks when taking a look at a brand new dataset. That’s precisely why this failure is so crucial.
Error Observability Collapse
Right here is the total benchmark output for the question that began all of this. I’m displaying it in full as a result of the numbers make the issue not possible to dismiss.
GROUND TRUTH (Semantic Engine)
SUM(amt) GROUP BY class → 14 teams
#1 grocery_pos 1,140,033.24
#2 shopping_net 773,527.93
#3 shopping_pos 725,766.14
#4 gas_transport 648,804.24
#5 house 556,526.53
Latency: 100.47ms | Rows scanned: 100,000
RAG SIMULATION — what the LLM receives at every context dimension
Context Rows Protection Partial sum Error detectable?
tiny (~325 tokens) 5 0.0050% 197.73 EASY
small (~3K tokens) 50 0.0500% 2,003.56 MODERATE
medium (~32K tokens) 500 0.5000% 31,023.21 HARD
giant (~130K tokens) 2,000 2.0000% 140,093.16 VERY HARD
xlarge (~520K tokens) 8,000 8.0000% 569,368.22 NEAR IMPOSSIBLE
I stared at these outcomes for some time. Essentially the most troubling half wasn’t simply that the solutions had been improper, it was how a lot more durable the errors grew to become to identify because the context window grew.
At 8,000 rows the error was nonetheless over 50%, but the response appeared like knowledgeable report. You’d have to manually confirm the numbers to note one thing was off. That’s what I began calling Error Observability Collapse. The extra context I gave the mannequin, the extra convincing — however no more correct — the output grew to become.
The “Partial sum” column exhibits the entire if the LLM added each quantity worth within the rows it truly retrieved. The “Error detectable?” column scores how seemingly a human reader is to identify a mistake.
With 5 rows, the partial sum is 197.73. The proper whole is 1,140,033.24. It’s apparent. The output is brief, the numbers are improper, and the lacking knowledge is obvious. The error is instantaneous.
At 8,000 rows, the partial sum hits 569,368.22. The LLM has now seen all 14 classes. It generates a 1,500-word report with particular figures and assured language. The error is 50%, however it’s hidden inside authoritative, well-structured prose. With out an exterior reference, a reader has no technique to catch it.
That is the sample that held throughout all seven queries:
| Context Window | Rows | Dataset Protection | Response Size | Error Detectable? |
|---|---|---|---|---|
| ~325 tokens | 5 | 0.005% | ~50 phrases | YES — clearly a guess |
| ~3K tokens | 50 | 0.050% | ~150 phrases | MAYBE |
| ~32K tokens | 500 | 0.500% | ~400 phrases | HARD |
| ~130K tokens | 2,000 | 2.000% | ~800 phrases | VERY HARD |
| ~520K tokens | 8,000 | 8.000% | ~1,500 phrases | NEAR IMPOSSIBLE |
| Semantic Engine | 100,000 | 100% | <200ms | N/A — actual |
I known as this Error Observability Collapse. As context grows, confidence scales with it. Correctness doesn’t.

The failure modes are uneven, which makes them harmful:
A improper RAG reply seems to be appropriate. It’s formatted, particular, and assured. A failed computation throws an express error. It’s seen.
One failure is silent. The opposite is loud. As context home windows attain thousands and thousands of tokens, the silent failure turns into more durable to detect [4]. The system doesn’t get safer because it scales. It simply will get extra convincing.
The Semantic Engine: Proof That the Appropriate Reply Is Quick
Earlier than I totally understood the issue, I had already thrown collectively a easy semantic engine out of frustration. I simply needed the proper reply a minimum of as soon as.
The strategy turned out to be easy: parse the question into correct operations and run a single move over your complete dataset. No embeddings, no retrieval, no guessing.
Right here’s what that appears like in observe:
The logic is straightforward. Take a question like “What’s the whole spend by class?”. The engine maps this to a direct operation: SUM(amt) GROUP BY class. It processes the total 100,000-row set in a single move. It accumulates grouped totals. There is no such thing as a retrieval. No inference. No partial scanning. It visits each row as soon as and returns the precise end result.
This proves that the proper reply will not be costly. Benchmark queries completed below 200ms. Pattern dimension: 100,000 rows. Aggregation is trivial. The failure occurs while you route these queries to a system constructed to misconceive them.
from context_window_engine import compute_ground_truth, load_csv
rows = load_csv("knowledge/credit_card_transactions.csv", max_rows=100_000)
gt = compute_ground_truth(
query_label = "whole by class",
rows = rows,
agg_func = "sum",
agg_col = "amt",
group_col = "class",
)
# gt.reply → [(grocery_pos, 1140033.24), (shopping_net, 773527.93), ...]
# gt.latency_ms → 100.47
Engine helps SUM, AVG, COUNT, MIN, MAX. Handles categorical and numeric filters. Contains GROUP BY and ratios. Zero exterior dependencies. Each operation runs as a deterministic perform over the total record.
The engine itself isn’t the product. It’s the proof: the proper reply is reachable below a second. No inference required. The actual problem is routing queries there reliably.
The Repair Is Not Higher Retrieval
Cease attempting to enhance retrieval. If a question wants 100% of the info, an 8% pattern fails. The repair is eradicating retrieval from the loop.
We want a classification layer. It sits earlier than the pipeline and makes one binary name: computation or lookup?
The distinction is obvious. “Complete spend by class” calls for a full scan. “Discover transactions from Jennifer Banks” is a straightforward lookup. Commonplace RAG forces each down the identical path. That’s the design flaw.
A QueryRouter fixes this. It inspects each incoming question and routes it to the proper path earlier than a single retrieval begins.

The classifier makes use of three sign tiers, prioritized. Tier 1: aggregation verbs—whole, what number of, common, lowest, share. These demand full-dataset computation. Tier 2: numeric comparability—better than 500, above $1,000, a minimum of. These indicate filter-then-aggregate, not possible for RAG. Tier 3: retrieval indicators—discover, present me, record, fetch. These point out lookups the place semantic similarity works.
| Tier | Sign | Examples | Route |
|---|---|---|---|
| 1 | Aggregation verb | whole, what number of, common, lowest, share |
COMPUTATION |
| 2 | Numeric comparability | better than 500, above $1,000, a minimum of |
COMPUTATION |
| 3 | Retrieval sign | discover, present me, record, fetch |
RETRIEVAL |
| 0 | No match | ambiguous | COMPUTATION — safer default |
Default to COMPUTATION if no tier matches. That is deliberate. Failure modes are uneven: a improper RAG reply on an aggregation is silently improper. A computation engine that may’t parse a question throws an error. When doubtful, fail loudly.
from query_router import QueryRouter
router = QueryRouter(rows)
end result = router.route("What's the whole spend by class?")
# end result.routed_to → "COMPUTATION"
# end result.reply.reply → [(grocery_pos, 1140033.24), ...]
# end result.total_latency → ~250ms — classify + execute mixed
end result = router.route("Discover transactions from Jennifer Banks")
# end result.routed_to → "RETRIEVAL"
# end result.reply.protected → True — RAG is suitable
Routing the Full Benchmark
I ran 9 queries via the router to confirm efficiency throughout each sorts: seven aggregation queries destined for the semantic engine, and two lookup queries for RAG.
Each route was appropriate. The seven aggregation queries hit the full-scan engine and returned actual outcomes. The 2 lookup queries appropriately triggered the RAG path. Take a look at the output: excessive confidence scores, appropriate sample matching, and latency below 130ms—even with the 100,000-row scan.
[1] ✓ COMPUTATION "What's the whole spend by class?"
Tier 1 | matched='whole' | confidence=0.97
#1 grocery_pos 1,140,033.24 (102.57ms | 100,000 rows | actual)
[2] ✓ COMPUTATION "Which class has the best common transaction quantity?"
Tier 1 | matched='highest' | confidence=0.97
71.91 (119.47ms | 100,000 rows | actual)
[3] ✓ COMPUTATION "What's the whole quantity spent on grocery_pos?"
Tier 1 | matched='whole' | confidence=0.97
1,140,033.24 (49.96ms | 100,000 rows | actual)
[4] ✓ COMPUTATION "What number of transactions had been made by feminine clients?"
Tier 1 | matched='What number of' | confidence=0.97
54,641.00 (90.45ms | 100,000 rows | actual)
[5] ✓ COMPUTATION "What's the whole spend the place quantity is larger than 500?"
Tier 1 | matched='whole' | confidence=0.97
1,274,269.60 (91.65ms | 100,000 rows | actual)
[6] ✓ COMPUTATION "Which state has the bottom whole spending?"
Tier 1 | matched='lowest' | confidence=0.97
lowest RI 2,125.60 (109.05ms | 100,000 rows | actual)
[7] ✓ COMPUTATION "What share of transactions are fraudulent?"
Tier 1 | matched='share' | confidence=0.97
0.9900% (87.35ms | 100,000 rows | actual)
[8] ✓ RETRIEVAL "Discover transactions from Jennifer Banks"
Tier 3 | matched='Discover' | confidence=0.85
RAG is suitable — no aggregation required
[9] ✓ RETRIEVAL "Present me a pattern transaction from Texas"
Tier 3 | matched='Present me' | confidence=0.85
RAG is suitable — no aggregation required
Routing accuracy: 9/9
9/9 appropriate. Error Observability Collapse is not possible if aggregation queries by no means attain RAG.
The Take a look at Suite
The benchmark verifies 9 particular queries. The check suite ensures reliability throughout a broader vary: edge circumstances, malformed inputs, lacking knowledge, and customary manufacturing failure factors.
The engine suite has 87 exams throughout 10 lessons. It covers float parsing with greenback indicators, commas, and scientific notation; all 5 aggregation features below regular situations and with empty inputs; all 5 numeric filter operators; full GROUP BY aggregation with categorical and numeric filters mixed; RAG simulation protection metrics at every context dimension; and edge circumstances together with empty datasets, rows with lacking column values, and single-row inputs.
The router suite has 72 exams throughout 5 lessons. It covers all three tier patterns, together with edge circumstances like all-caps queries and really lengthy queries; pure language to typed operation parsing for each supported question kind; routing and execution correctness in opposition to all seven benchmark queries; and a distinction suite that verifies router solutions match impartial ground-truth computation — guaranteeing the router doesn’t introduce any deviation from the engine’s personal output.
Run the engine exams by typing python house -m house unittest house test_engine house -v. This executes the 87 exams within the suite.
Run the router exams by typing python house -m house unittest house test_router house -v. This executes the 72 exams within the suite.
All 159 move on Python 3.9+ with zero exterior dependencies.
Trustworthy Limitations
This answer isn’t excellent. It solely works on single CSV information proper now. Actual manufacturing datasets are normally messy with a number of tables that want becoming a member of — I intentionally saved the scope small as a result of I needed one thing that truly labored end-to-end first.
The router can be nonetheless fairly primary (regex-based). I attempted a small LLM-based classifier early on however it was inconsistent and added latency, so I went again to the easy strategy. Generally the boring answer wins.
I additionally simulated the RAG responses as an alternative of hitting actual APIs for the benchmark. The patterns maintain up, however your mileage with GPT-4o or Claude 3.5 will fluctuate barely.
CSV format required. The engine hundreds structured knowledge straight from CSV information. Database connections, Parquet information, and different tabular codecs are usually not supported right now.
What This Adjustments
Including a routing layer prices virtually nothing. Classifying a question in opposition to 65 regex patterns takes simply microseconds. The semantic engine provides lower than 200ms to scan a 100,000 row dataset. The full overhead is smaller than a single embedding name.
What you get in return is a deterministic reply for each aggregation question. Each whole, each depend, and each share now comes from a full scan as an alternative of a assured approximation based mostly on 8 % of the info. RAG retains dealing with what it’s truly good at: retrieving particular data, surfacing related passages, and answering lookup questions the place semantic similarity is the fitting device for the job.
RAG will not be damaged. It’s simply being requested to compute, and it can’t try this.
The damaging half will not be that it fails. It’s that it fails convincingly. And no quantity of context adjustments that.
You possibly can attempt typing it out like this:
To begin, clone the repository utilizing git clone adopted by the URL https://github.com/Emmimal/context-window-engine/. As soon as that finishes, transfer into the listing by typing cd context-window-engine. Lastly, launch the challenge by operating python demo.py in your terminal.
References
[1] Levy, M., Jacoby, A., & Goldberg, Y. (2024). Identical process, extra tokens: The impression of enter size on the reasoning efficiency of huge language fashions. In Proceedings of the 62nd Annual Assembly of the Affiliation for Computational Linguistics (Quantity 1: Lengthy Papers), pages 15339–15353, Bangkok, Thailand. Affiliation for Computational Linguistics.
https://doi.org/10.18653/v1/2024.acl-long.818
[2] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented era for knowledge-intensive NLP duties. Advances in Neural Info Processing Programs, 33, 9459–9474. https://doi.org/10.48550/arXiv.2005.11401
[3] Gao, Y., Xiong, Y., Gao, X., Jia, Ok., Pan, J., Bi, Y., Dai, Y., Solar, J.,
Guo, Q., Wang, M., & Wang, H. (2023). Retrieval-augmented era for giant language fashions: A survey. arXiv preprint arXiv:2312.10997.
https://doi.org/10.48550/arXiv.2312.10997
[4] Liu, N. F., Lin, Ok., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Misplaced within the center: How language fashions use lengthy contexts. Transactions of the Affiliation for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638
[5] Koshorek, O., Granot, N., Alloni, A., Admati, S., Hendel, R., Weiss, I., Arazi, A., Cohen, S.-N., & Belinkov, Y. (2025). Structured RAG for answering aggregative questions. arXiv preprint arXiv:2511.08505.
https://doi.org/10.48550/arXiv.2511.08505
Disclosure
All benchmark numbers are from precise runs on Python 3.12.6, Home windows 11, CPU solely, no GPU. The benchmark makes use of the Credit score Card Transactions Fraud Detection dataset (Kartik Gajjar, Kaggle, 2020), an artificial dataset generated utilizing the Sparkov transaction simulator created by Brandon Harris, licensed CC0 (Public Area), obtainable at kaggle.com/datasets/kartik2112/fraud-detection. The RAG baseline simulates retrieval and fashions confidence indicators — no actual LLM API calls are made. No exterior API keys are required to breed any end result on this article. All code described right here was written and examined by me.

