In the fast-evolving era of managing context and memory in LLMs, Retrieval-Augmented Generation (RAG) has been the go-to method for augmenting a model's context with relevant information. RAG works by embedding documents into a vector space, then using semantic similarity between a query's embedding and the embedded documents to retrieve the most relevant chunks back into the context window.
It works well enough for simple lookup, but it has well-documented failure modes. Vanilla RAG struggles with multi-hop questions because it can only see flat chunks with no relational structure between them. It suffers from lost-in-the-middle behavior: information buried in the center of a long context is systematically underused [6]. It exhibits context rot: every frontier model tested by Chroma Research (2025) got worse as input length grew well below the stated window size. And it has no way to answer global questions, questions about themes across an entire corpus, because no single chunk contains the answer.
In this article, we survey recent advances in RAG and context-curation methods, and compare them on a subset of the NoCha benchmark [5].
The NoCha benchmark uses full-length recent novels as long, noisy contexts and turns them into a binary claim verification task: for each book, it generates pairs of nearly identical statements where one is true and the other is false, differing only in small but critical details like a character, location, or object. The goal is to test whether LLMs can reason globally over complex noisy narratives rather than relying on shallow retrieval tricks.
The full NoCha set covers 1001 pairs across modern copyrighted novels and is evaluated only through the official leaderboard. We use instead the public-domain subset from the marzenakrp/nocha repository, which ships four classic novels with 126 claims total, balanced 50/50 across true and false:
For each book, we chunk the text at roughly 700 words with 100-word overlap (~1024 tokens per chunk) to feed the retrieval-based methods. The recursive language model setup ignores chunks entirely and inlines the whole novel as a variable, consistent with its navigation-not-retrieval premise.
To understand this family of methods, we first need to define what a knowledge graph actually is. A knowledge graph has three core components:
GraphRAG, developed by Microsoft [1], addresses these challenges by structuring the context search space as a knowledge graph.
The graph is constructed by chunking ingested documents and using an LLM-guided extraction prompt to identify the entities in each chunk, the relationships between them, and the weight (strength or frequency) of each relationship. Node2Vec is then used to embed entities into a latent space that reflects the distribution of nodes within the graph: co-occurring entities end up close together, and biased random walks during training capture both community membership and structural role.
A Leiden algorithm is then run over the entity-relationship graph to detect communities: tightly connected subgraphs that share a topic. Once communities are identified, each one is also embedded into the latent space, and the LLM generates a community report and a community summary describing what that community is about.
GraphRAG exposes three query modes, each suited to a different question type:
For evaluation on NoCha, we use Microsoft's GraphRAG library (v2.7.2) with the fast indexing pipeline and local-search retrieval. Local search is preferred over global search because each NoCha claim is a single-fact assertion whose truth value hinges on specific passages, while global search's map-reduce over community summaries would dilute that signal. The method extracts 3,789 entities and 67,625 entity-entity relationships, organized into 961 Leiden communities across 6 hierarchical levels (16 / 115 / 323 / 415 / 90 / 2 from level 0 to level 5); LLM-generated reports are produced for 934 of those communities (27 mid-level communities fall below the content-threshold filter). GraphRAG is the most resource-intensive method in our comparison: indexing took 2h22m of wall-clock time, of which the community-summarization step alone consumed 8,381s (98%). On NoCha, GraphRAG achieves F1 = 0.603.
HippoRAG [3] [4] takes a different bet from GraphRAG. Instead of pre-summarizing communities for global queries, it introduces a concept from the human cognitive system, the hippocampal memory indexing theory, in which the brain accomplishes two complementary tasks during memory processing:
These functions are split across three regions working together:
HippoRAG maps this architecture directly onto retrieval components:
The system runs in two phases.
Passages are passed through the LLM-as-neocortex to pull the entities and relationships without a predefined ontology. The triples are aggregated into a schemaless knowledge graph: nodes are entities, edges are relations. The PHR's job is then to link conceptually similar nodes. A retrieval encoder embeds each node, and when two embeddings exceed a cosine-similarity threshold, an additional synonymy edge is added between them.
At query time, the LLM-as-neocortex extracts named entities from the query. The retrieval encoder (PHR) maps each to its closest nodes in the graph; these become the query nodes, analogous to the partial memory cues the hippocampus would receive.
Personalized PageRank (PPR) is then run starting only from those query nodes. PPR is a random walker that diffuses probability mass through the graph, biased to stay close to where it started. This concentrates the search in the neighborhood of relevant nodes rather than scoring the entire graphs. The resulting node scores are multiplied by the node-passage matrix to produce passage relevance scores, and the top passages become the context.
We use the reference HippoRAG 2 implementation (v2.0.0-alpha.4) with gpt-4o-mini for OpenIE-style entity and triple extraction and text-embedding-3-small for entity embeddings. HippoRAG 2 constructs a "hippocampal" knowledge graph from extracted triples augmented with chunk-co-occurrence edges; retrieval at query time runs Personalized PageRank seeded at entities matching the query, diffusing probability across both the typed-triple layer and the dense co-mention layer. The OpenIE pass extracts 2,501 entity mentions and 470 typed triples over the 737 corpus chunks, yielding an internal entity-plus-chunk graph of 1,174 vertices and 1,400 edges. On NoCha, HippoRAG 2 achieves F1 = 0.683.
OG-RAG [8] makes a third bet, orthogonal to both GraphRAG and HippoRAG. Where GraphRAG and HippoRAG let the LLM decide what entities and relations exist, OG-RAG forces them to fit a declared domain ontology.
This places OG-RAG very closely in the neurosymbolic AI tradition [2], but at its lightest weight. Classical neurosymbolic systems try to combine neural pattern recognition with symbolic reasoning. OG-RAG combines neural extraction with symbolic typing: the LLM still generates the facts, but the symbolic layer constrains what kinds of facts it can generate.
OG-RAG represents knowledge as a hypergraph rather than a standard knowledge graph. In a hypergraph, an edge can connect more than two nodes, where each hyperedge is a declared instance of a declared relation type.
The construction pipeline is schema-constrained extraction: the LLM is prompted with the ontology and asked to extract only facts that fit the declared types. Malformed extractions (wrong type, missing required slot, unknown relation) are rejected. The result is a graph that is fully auditable: every node has a known type, every edge has a known schema, and every fact can be checked against the ontology that produced it.
At query time, OG-RAG identifies the relevant types and relations implied by the query, retrieves the matching hyperedges, and unrolls them into context for the LLM without depending on the embedding distance between the query and each fact.
We implement OG-RAG with a generic fiction ontology comprising 6 entity classes (Person, Place, Event, Organization, Work, Date) and 7 typed relations (member_of, part_of, located_in, occurred_in, born_in, created_by, founded_by). Each chunk is processed by gpt-4o-mini with a schema-constrained extraction prompt that requests structured JSON output bounded by the ontology. From 737 chunks, OG-RAG produces 910 typed entities and 1,861 hyperedges, with 41 chunks failing schema-compliant extraction entirely. The schema constraint is imperfectly enforced: the extractor occasionally invents off-ontology predicates (married, 2 instances) and frequently violates implicit type signatures, e.g. member_of(Person, Person) is the single most common signature at 386 instances, despite "membership" being a Person→Organization relation in the intended schema. The ontology is generic and reused across all four novels rather than hand-tuned per book, providing what is closer to a lower bound for the method. On NoCha, OG-RAG achieves F1 = 0.310, the weakest of all methods compared, consistent with schema-constrained extraction discarding the narrative subtleties (emotional reactions, plot-arc continuity) on which NoCha discriminators turn.
Zep [7] introduces a temporally aware, dynamic knowledge graph. Instead of building the graph once over a fixed corpus, Zep updates it continuously and non-destructively as new information arrives.
Zep's knowledge graph has three layers: the episodic subgraph, the semantic entity subgraph, and the community subgraph.
Zep's retrieval is a hybrid search across all three subgraphs. The first stage runs three search functions in parallel:
This gives three candidate sets capturing semantic, lexical, and structural notions of relevance.
A reranker then fuses the signals using typically reciprocal rank fusion (RRF), maximal marginal relevance (MMR), graph-distance reranking, or a cross-encoder for highest-precision settings. Crucially, no LLM runs at retrieval time, which is why Zep achieves around 300ms P95 latency.
Zep is excluded from NoCha because the benchmark evaluates retrieval over closed, internally consistent corpora. There are no facts that become invalid, no relationships that change validity over time, no out-of-order episodes for the bi-temporal model to reconcile.
All previous methods discuss precomputed search indexes: different ways of organizing knowledge and working through provided context before running it. In RLM, the context is passed in as a variable, and the model decides at query time how to navigate it.
Introduced by Zhang, Kraska & Khattab (2025) [9], an RLM "treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt." In practice this means the prompt is loaded as a Python variable inside a REPL (read-evaluate-print loop) that the model can write code against. The model can slice the variable, grep it, regex it, chunk it, and spawn recursive sub-calls to fresh language model instances over individual chunks, each with its own clean context window. The final answer is returned once the root model is satisfied.
Every other method in this piece treats context management as a search and retrieval problem: how do we get the right context into the window? RLMs reframe it as a navigation problem: how do we let the model move through context itself? That shift moves the responsibility from the developer (designing chunking, embedding, graph extraction, retrieval) onto the model (writing the code that explores).
The mechanism borrows from a classical idea: out-of-core algorithms, the techniques computer scientists developed for datasets that don't fit in RAM. You don't load the data; you write a program that walks through it. RLMs apply the same logic to context. The model never sees the full input at once. It sees a handle to the input and a set of tools for paging through it.
We use the RLM reference implementation (Zhang, Kraska, Khattab; rlms package) in live mode. Unlike retrieval-based methods, RLM performs no indexing: the full novel is inlined into the prompt and a root model writes Python in a REPL environment, optionally launching recursive sub-LM calls on slices. Because gpt-4o-mini defaulted into the system prompt's narrative-buffering pattern (producing multi-paragraph summaries rather than TRUE/FALSE judgments), we prescribe a per-chunk-vote strategy in the wrapper prompt: split context into 8 chunks, fire llm_query_batched with a SUPPORT/REFUTE/NONE prompt per chunk, then aggregate and enforce FINAL("TRUE"|"FALSE") as the only valid terminator (max_iterations=3). We evaluate two reader variants: RLM-mini (gpt-4o-mini as root) achieves F1 = 0.501, and RLM-strong (gpt-4o as root) achieves F1 = 0.627. The 12.6-point gap is structural rather than incidental: by offloading the decomposition, code-writing, recursive routing onto the root model, RLM is only as strong as its reasoning baseline. gpt-4o-mini needs heavy prompt scaffolding to operate, while gpt-4o possesses the capacity to reason.
HippoRAG 2 dominates the Pareto frontier at F1 = 0.68 in ~3.5s. This suggests that the given global-reasoning claims have important entities that can be used as starting points in a graph. The system starts from those entities and uses Personalized PageRank (PPR) to move through connected nodes and find related information. Because the graph keeps many flexible connections instead of using a strict schema, it can preserve useful relationships that help answer complex questions quickly and accurately.
GraphRAG and RLM-strong cluster at 0.60–0.63 at roughly 10× the latency. OG-RAG at F1 = 0.31 is Pareto-dominated. RLM-mini at 0.50 reinforces the reasoning-baseline point; RLM-strong recovers to 0.63 but pays the heavy latency without separating from HippoRAG on accuracy.
Two caveats: our 126-claim subset is from classic novels rather than the modern speculative fiction the paper flags as hardest because of extensive world-building, so absolute numbers sit on the easier end; and we report aggregate F1 without splitting retrieval-style vs global-reasoning pairs, since the public subset lacks that annotation.
The takeaway: even on a task designed to defeat retrieval, the right precomputed structure beats query-time strategy at a fraction of the cost, and the winning structure here matches how the corpus is actually organized (entities, aliases, dense connection) without overcommitting to a schema.
The arc this piece has traced runs from retrieval to navigation: RAG's flat chunks, GraphRAG and HippoRAG's precomputed structures, OG-RAG's typed schemas, Zep's temporal layer, RLM's query-time agency. Methods stacked in roughly the order of how much responsibility they give the model versus how much they bake in offline. NoCha is one data point on that arc, not an all-in-all benchmark.
Different benchmarks should reorder the leaderboard in ways that match what each method is for. Zep needs LongMemEval or LoCoMo. GraphRAG should pay off on tasks built around explicit global synthesis. OG-RAG belongs on domains where the ontology is real, like legal, medical, or compliance. And RLM should be evaluated where the navigation strategy is the answer, not just a means to it.
The more interesting direction is the intersection, and it has moved fast since RLM landed in December 2025: a wave of 2026 work covers RLM-on-KG, RLM-over-GraphRAG hybrids, and GraphWalk has begun replacing the raw-text REPL with a graph as the navigable substrate. Graphs supply the structure, agents supply the questions, and the interesting work happens at the interface. We will return to that thread in a follow-up piece.