Research

The journey from context retrieval to context navigation

Fatimah Alani · May 17, 2026

In the fast-evolving era of managing context and memory in LLMs, Retrieval-Augmented Generation (RAG) has been the go-to method for augmenting a model's context with relevant information. RAG works by embedding documents into a vector space, then using semantic similarity between a query's embedding and the embedded documents to retrieve the most relevant chunks back into the context window.

It works well enough for simple lookup, but it has well-documented failure modes. Vanilla RAG struggles with multi-hop questions because it can only see flat chunks with no relational structure between them. It suffers from lost-in-the-middle behavior: information buried in the center of a long context is systematically underused ^[6]. It exhibits context rot: every frontier model tested by Chroma Research (2025) got worse as input length grew well below the stated window size. And it has no way to answer global questions, questions about themes across an entire corpus, because no single chunk contains the answer.

In this article, we survey recent advances in RAG and context-curation methods, and compare them on a subset of the NoCha benchmark ^[5].

NoCha benchmark

The NoCha benchmark uses full-length recent novels as long, noisy contexts and turns them into a binary claim verification task: for each book, it generates pairs of nearly identical statements where one is true and the other is false, differing only in small but critical details like a character, location, or object. The goal is to test whether LLMs can reason globally over complex noisy narratives rather than relying on shallow retrieval tricks.

The full NoCha set covers 1001 pairs across modern copyrighted novels and is evaluated only through the official leaderboard. We use instead the public-domain subset from the marzenakrp/nocha repository, which ships four classic novels with 126 claims total, balanced 50/50 across true and false:

The Adventures of Sherlock Holmes (Conan Doyle): 36 claims
Little Women (Alcott): 30 claims
The Great Gatsby (Fitzgerald): 30 claims
Anne of Green Gables (Montgomery): 30 claims

For each book, we chunk the text at roughly 700 words with 100-word overlap (~1024 tokens per chunk) to feed the retrieval-based methods. The recursive language model setup ignores chunks entirely and inlines the whole novel as a variable, consistent with its navigation-not-retrieval premise.

Knowledge graphs

To understand this family of methods, we first need to define what a knowledge graph actually is. A knowledge graph has three core components:

Entities: the nodes, typically people, places, organizations, events, or concepts extracted from text.
Relationships: the edges between entities, describing how they connect.
Communities: clusters of densely connected entities that tend to belong to the same topic or theme.

GraphRAG: structure as a precomputed index

GraphRAG, developed by Microsoft ^[1], addresses these challenges by structuring the context search space as a knowledge graph.

The graph is constructed by chunking ingested documents and using an LLM-guided extraction prompt to identify the entities in each chunk, the relationships between them, and the weight (strength or frequency) of each relationship. Node2Vec is then used to embed entities into a latent space that reflects the distribution of nodes within the graph: co-occurring entities end up close together, and biased random walks during training capture both community membership and structural role.

A Leiden algorithm is then run over the entity-relationship graph to detect communities: tightly connected subgraphs that share a topic. Once communities are identified, each one is also embedded into the latent space, and the LLM generates a community report and a community summary describing what that community is about.

GraphRAG exposes three query modes, each suited to a different question type:

Local search is for entity-centric questions ("Who is X?", "What did Y do?"). The query is embedded into the latent space, the most semantically similar entities are extracted, and from there the relevant text chunks are pulled. Community reports are filtered and ranked, and the resulting context is passed to the LLM.
Global search is where vanilla RAG fails entirely and GraphRAG shines. It answers questions about themes across communities, e.g. "What are the dominant risk patterns in these 10,000 reports?" It works by indexing across community reports, finding those most relevant to the query, extracting numbered points from each with relevance scores, and using map-reduce to assemble a highly relevant context for the final answer.
DRIFT search is the most interesting of the three. The query is first translated into a hypothetical document using HyDE (Hypothetical Document Embeddings): the LLM generates an example of what an ideal answer document might look like, given the query. This synthetic document matches the format of the community reports, so it can be embedded into the same latent space and used to find the closest related communities. Initial answers are generated using those community reports along with follow-up questions. Local search is then run on each follow-up question, generating further follow-ups in an iterative loop. Relevance scores are accumulated across iterations, and map-reduce is used to extract the most relevant context. The aggregated context is finally passed to the LLM for the answer.

For evaluation on NoCha, we use Microsoft's GraphRAG library (v2.7.2) with the fast indexing pipeline and local-search retrieval. Local search is preferred over global search because each NoCha claim is a single-fact assertion whose truth value hinges on specific passages, while global search's map-reduce over community summaries would dilute that signal. The method extracts 3,789 entities and 67,625 entity-entity relationships, organized into 961 Leiden communities across 6 hierarchical levels (16 / 115 / 323 / 415 / 90 / 2 from level 0 to level 5); LLM-generated reports are produced for 934 of those communities (27 mid-level communities fall below the content-threshold filter). GraphRAG is the most resource-intensive method in our comparison: indexing took 2h22m of wall-clock time, of which the community-summarization step alone consumed 8,381s (98%). On NoCha, GraphRAG achieves F1 = 0.603.

HippoRAG: borrowing from the brain

HippoRAG ^[3] ^[4] takes a different bet from GraphRAG. Instead of pre-summarizing communities for global queries, it introduces a concept from the human cognitive system, the hippocampal memory indexing theory, in which the brain accomplishes two complementary tasks during memory processing:

Pattern separation: storing each experience as a unique trace so memories don't blur together.
Pattern completion: reconstructing a full memory from partial, often noisy cues.

These functions are split across three regions working together:

Neocortex: abstracts perceptual experiences into high-level features.
Parahippocampal regions (PHR) channel: processes information into the hippocampus.
Hippocampus: maintains a sparse, context-rich index used to reconstruct memories at recall time.

HippoRAG maps this architecture directly onto retrieval components:

Neocortex becomes an instruction-tuned LLM that extracts entities and relations from text.
PHR becomes a retrieval encoder that detects synonymy between concepts.
Hippocampus becomes a knowledge graph plus Personalized PageRank for indexing and recall.

The system runs in two phases.

Offline indexing: building the artificial hippocampus

Passages are passed through the LLM-as-neocortex to pull the entities and relationships without a predefined ontology. The triples are aggregated into a schemaless knowledge graph: nodes are entities, edges are relations. The PHR's job is then to link conceptually similar nodes. A retrieval encoder embeds each node, and when two embeddings exceed a cosine-similarity threshold, an additional synonymy edge is added between them.

Online retrieval: reconstructing memories from partial cues

At query time, the LLM-as-neocortex extracts named entities from the query. The retrieval encoder (PHR) maps each to its closest nodes in the graph; these become the query nodes, analogous to the partial memory cues the hippocampus would receive.

Personalized PageRank (PPR) is then run starting only from those query nodes. PPR is a random walker that diffuses probability mass through the graph, biased to stay close to where it started. This concentrates the search in the neighborhood of relevant nodes rather than scoring the entire graphs. The resulting node scores are multiplied by the node-passage matrix to produce passage relevance scores, and the top passages become the context.

We use the reference HippoRAG 2 implementation (v2.0.0-alpha.4) with gpt-4o-mini for OpenIE-style entity and triple extraction and text-embedding-3-small for entity embeddings. HippoRAG 2 constructs a "hippocampal" knowledge graph from extracted triples augmented with chunk-co-occurrence edges; retrieval at query time runs Personalized PageRank seeded at entities matching the query, diffusing probability across both the typed-triple layer and the dense co-mention layer. The OpenIE pass extracts 2,501 entity mentions and 470 typed triples over the 737 corpus chunks, yielding an internal entity-plus-chunk graph of 1,174 vertices and 1,400 edges. On NoCha, HippoRAG 2 achieves F1 = 0.683.

OG-RAG: structure from ontologies

OG-RAG ^[8] makes a third bet, orthogonal to both GraphRAG and HippoRAG. Where GraphRAG and HippoRAG let the LLM decide what entities and relations exist, OG-RAG forces them to fit a declared domain ontology.

This places OG-RAG very closely in the neurosymbolic AI tradition ^[2], but at its lightest weight. Classical neurosymbolic systems try to combine neural pattern recognition with symbolic reasoning. OG-RAG combines neural extraction with symbolic typing: the LLM still generates the facts, but the symbolic layer constrains what kinds of facts it can generate.

Hypergraphs over typed entities

OG-RAG represents knowledge as a hypergraph rather than a standard knowledge graph. In a hypergraph, an edge can connect more than two nodes, where each hyperedge is a declared instance of a declared relation type.

The construction pipeline is schema-constrained extraction: the LLM is prompted with the ontology and asked to extract only facts that fit the declared types. Malformed extractions (wrong type, missing required slot, unknown relation) are rejected. The result is a graph that is fully auditable: every node has a known type, every edge has a known schema, and every fact can be checked against the ontology that produced it.

Retrieval over typed hyperedges

At query time, OG-RAG identifies the relevant types and relations implied by the query, retrieves the matching hyperedges, and unrolls them into context for the LLM without depending on the embedding distance between the query and each fact.

We implement OG-RAG with a generic fiction ontology comprising 6 entity classes (Person, Place, Event, Organization, Work, Date) and 7 typed relations (member_of, part_of, located_in, occurred_in, born_in, created_by, founded_by). Each chunk is processed by gpt-4o-mini with a schema-constrained extraction prompt that requests structured JSON output bounded by the ontology. From 737 chunks, OG-RAG produces 910 typed entities and 1,861 hyperedges, with 41 chunks failing schema-compliant extraction entirely. The schema constraint is imperfectly enforced: the extractor occasionally invents off-ontology predicates (married, 2 instances) and frequently violates implicit type signatures, e.g. member_of(Person, Person) is the single most common signature at 386 instances, despite "membership" being a Person→Organization relation in the intended schema. The ontology is generic and reused across all four novels rather than hand-tuned per book, providing what is closer to a lower bound for the method. On NoCha, OG-RAG achieves F1 = 0.310, the weakest of all methods compared, consistent with schema-constrained extraction discarding the narrative subtleties (emotional reactions, plot-arc continuity) on which NoCha discriminators turn.

Zep: when time becomes a first-class citizen

Zep ^[7] introduces a temporally aware, dynamic knowledge graph. Instead of building the graph once over a fixed corpus, Zep updates it continuously and non-destructively as new information arrives.

Zep's knowledge graph has three layers: the episodic subgraph, the semantic entity subgraph, and the community subgraph.

Episodes are the raw data ingested. Each carries a timestamp for when the event occurred. Zep uses a bi-temporal model: one timeline orders events by when they happened in the world, another by when Zep learned about them. The second supports auditing; the first is what makes Zep different, it lets the system reason about when something was true, not just whether it was recorded. Episodes link to the entities derived from their text via episodic edges, preserving the temporal occurrence of every fact.
Semantic entities sit on top of the episodic layer. When a new episode is ingested, an LLM extracts entities and resolves them against existing nodes. Relationships between entities are stored as entity edges, each carrying four timestamps: when the edge was created and expired in the database, and when the relationship was actually valid and invalid in the world. When new information contradicts an existing edge, Zep does not delete it: it marks it invalid at the right moment and adds the new edge. The graph becomes a dynamic updating factual layer rather than a static snapshot.
Communities are clusters of densely connected entities, similar to what is done in GraphRAG.

Search and reranking

Zep's retrieval is a hybrid search across all three subgraphs. The first stage runs three search functions in parallel:

Cosine similarity over embeddings.
BM25 over node and edge summaries.
Breadth-first search outward from anchor nodes already known to be relevant.

This gives three candidate sets capturing semantic, lexical, and structural notions of relevance.

A reranker then fuses the signals using typically reciprocal rank fusion (RRF), maximal marginal relevance (MMR), graph-distance reranking, or a cross-encoder for highest-precision settings. Crucially, no LLM runs at retrieval time, which is why Zep achieves around 300ms P95 latency.

Zep is excluded from NoCha because the benchmark evaluates retrieval over closed, internally consistent corpora. There are no facts that become invalid, no relationships that change validity over time, no out-of-order episodes for the bi-temporal model to reconcile.

Recursive language models: when the model navigates its own context

All previous methods discuss precomputed search indexes: different ways of organizing knowledge and working through provided context before running it. In RLM, the context is passed in as a variable, and the model decides at query time how to navigate it.

Introduced by Zhang, Kraska & Khattab (2025) ^[9], an RLM "treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt." In practice this means the prompt is loaded as a Python variable inside a REPL (read-evaluate-print loop) that the model can write code against. The model can slice the variable, grep it, regex it, chunk it, and spawn recursive sub-calls to fresh language model instances over individual chunks, each with its own clean context window. The final answer is returned once the root model is satisfied.

Why this is different

Every other method in this piece treats context management as a search and retrieval problem: how do we get the right context into the window? RLMs reframe it as a navigation problem: how do we let the model move through context itself? That shift moves the responsibility from the developer (designing chunking, embedding, graph extraction, retrieval) onto the model (writing the code that explores).

The mechanism borrows from a classical idea: out-of-core algorithms, the techniques computer scientists developed for datasets that don't fit in RAM. You don't load the data; you write a program that walks through it. RLMs apply the same logic to context. The model never sees the full input at once. It sees a handle to the input and a set of tools for paging through it.

We use the RLM reference implementation (Zhang, Kraska, Khattab; rlms package) in live mode. Unlike retrieval-based methods, RLM performs no indexing: the full novel is inlined into the prompt and a root model writes Python in a REPL environment, optionally launching recursive sub-LM calls on slices. Because gpt-4o-mini defaulted into the system prompt's narrative-buffering pattern (producing multi-paragraph summaries rather than TRUE/FALSE judgments), we prescribe a per-chunk-vote strategy in the wrapper prompt: split context into 8 chunks, fire llm_query_batched with a SUPPORT/REFUTE/NONE prompt per chunk, then aggregate and enforce FINAL("TRUE"|"FALSE") as the only valid terminator (max_iterations=3). We evaluate two reader variants: RLM-mini (gpt-4o-mini as root) achieves F1 = 0.501, and RLM-strong (gpt-4o as root) achieves F1 = 0.627. The 12.6-point gap is structural rather than incidental: by offloading the decomposition, code-writing, recursive routing onto the root model, RLM is only as strong as its reasoning baseline. gpt-4o-mini needs heavy prompt scaffolding to operate, while gpt-4o possesses the capacity to reason.

Results summary

HippoRAG 2 dominates the Pareto frontier at F1 = 0.68 in ~3.5s. This suggests that the given global-reasoning claims have important entities that can be used as starting points in a graph. The system starts from those entities and uses Personalized PageRank (PPR) to move through connected nodes and find related information. Because the graph keeps many flexible connections instead of using a strict schema, it can preserve useful relationships that help answer complex questions quickly and accurately.

GraphRAG and RLM-strong cluster at 0.60–0.63 at roughly 10× the latency. OG-RAG at F1 = 0.31 is Pareto-dominated. RLM-mini at 0.50 reinforces the reasoning-baseline point; RLM-strong recovers to 0.63 but pays the heavy latency without separating from HippoRAG on accuracy.

Two caveats: our 126-claim subset is from classic novels rather than the modern speculative fiction the paper flags as hardest because of extensive world-building, so absolute numbers sit on the easier end; and we report aggregate F1 without splitting retrieval-style vs global-reasoning pairs, since the public subset lacks that annotation.

The takeaway: even on a task designed to defeat retrieval, the right precomputed structure beats query-time strategy at a fraction of the cost, and the winning structure here matches how the corpus is actually organized (entities, aliases, dense connection) without overcommitting to a schema.

Conclusion

The arc this piece has traced runs from retrieval to navigation: RAG's flat chunks, GraphRAG and HippoRAG's precomputed structures, OG-RAG's typed schemas, Zep's temporal layer, RLM's query-time agency. Methods stacked in roughly the order of how much responsibility they give the model versus how much they bake in offline. NoCha is one data point on that arc, not an all-in-all benchmark.

Different benchmarks should reorder the leaderboard in ways that match what each method is for. Zep needs LongMemEval or LoCoMo. GraphRAG should pay off on tasks built around explicit global synthesis. OG-RAG belongs on domains where the ontology is real, like legal, medical, or compliance. And RLM should be evaluated where the navigation strategy is the answer, not just a means to it.

The more interesting direction is the intersection, and it has moved fast since RLM landed in December 2025: a wave of 2026 work covers RLM-on-KG, RLM-over-GraphRAG hybrids, and GraphWalk has begun replacing the raw-text REPL with a graph as the navigable substrate. Graphs supply the structure, agents supply the questions, and the interesting work happens at the interface. We will return to that thread in a follow-up piece.

References

[1]Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). From local to global: A graph RAG approach to query-focused summarization. arXiv:2404.16130.
[2]Garcez, A. d'A., & Lamb, L. C. (2020). Neurosymbolic AI: The 3rd wave. arXiv:2012.05876.
[3]Jiménez Gutiérrez, B., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). HippoRAG: Neurobiologically inspired long-term memory for large language models. NeurIPS 2024. arXiv:2405.14831.
[4]Jiménez Gutiérrez, B., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2025). From RAG to memory: Non-parametric continual learning for large language models (HippoRAG 2). ICML 2025. arXiv:2502.14802.
[5]Karpinska, M., Thai, K., Lo, K., Goyal, T., & Iyyer, M. (2024). One thousand and one pairs: A novel challenge for long-context language models. EMNLP 2024. arXiv:2406.16264.
[6]Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. TACL. arXiv:2307.03172.
[7]Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A temporal knowledge graph architecture for agent memory. arXiv:2501.13956.
[8]Sharma, K., Kumar, P., & Li, Y. (2025). OG-RAG: Ontology-grounded retrieval-augmented generation for large language models. EMNLP 2025. arXiv:2412.15235.
[9]Zhang, A., Kraska, T., & Khattab, O. (2025). Recursive language models. Preprint.