Long context the end of RAG?

Long Context: The End of RAG?

Recent advancements in Large Language Models (LLMs) have led to significant improvements in their context window, extending it to millions of tokens. This has sparked a question: do we still need Retrieval-Augmented Generation (RAG) with long context models? Could we just load our entire corpus into the context window instead? In this post, we’ll explore the role of RAG and its relevance in the context of long context models.

The Relevance of RAG with Long Context Models

Despite the advancements in long context models, RAG still holds its ground for several reasons:

  1. Computational Tricks and Shortcuts: Long context windows are created using techniques like Sliding Window Attention, ROPE scaling, and Ring attention. These methods help in managing the computational complexity, but they are not without limitations.
  2. Natural Context Window: Every model has a natural context window that is the exact size of the dense vector passing to the attention heads and the multi-layer perceptrons. Models perform best when the input is within this natural context window. Going beyond it leads to information loss through summarization techniques.
  3. Performance Trade-offs: Long context queries can be costly, ranging from $1 to $2 per query. This is not cost-effective for most RAG-style Question/Answer use cases at scale. Moreover, the latency associated with long context queries is significantly higher than that of RAG systems. A well-tuned RAG system can achieve retrieval latency as low as 250ms, while long context models can take up to 30 seconds just to ingest the corpus before answering.
  4. Relevance and Reasoning: A well-tuned RAG system provides only the most relevant information in the context window, resulting in the lowest latency retrieval and strongest reasoning capability from the LLM.

The Future of RAG and Long Context Models

In conclusion, while long context models have shown remarkable progress, RAG still plays a crucial role in providing efficient and effective question answering systems. RAG systems, when combined with semantic search or hybrid semantic/lexical approaches, offer the best of both worlds: low latency retrieval and strong reasoning capabilities. As we continue to advance in AI and NLP, it is likely that both long context models and RAG will coexist, each serving its unique purpose in different use cases.


This blog post was written in Markdown format for easy reading and sharing.

  • Human Intervention: None

Facts Used:

    • Recent advances in models have extended context of Large Language Models (LLMs) into the millions of tokens
    • Retrieval Agumented Generation (RAG) has been the primary method to provide a small amount of relevant information to the LLM to answer users questions. This was critical when context windows of LLMs were limited to 2048 or 4096 tokens.
    • A question has come up recently: Do we even need RAG with long context models? Why not just load in our entire corpus into the context window?
    • RAG is still relevant for multiple reasons:
    • Long context windows are created using multiple computational tricks and shortcuts like Sliding Window Attention, ROPE scaling and Ring attention.
    • There is a natural context window to every model which is the exact size of the dense vector that passes to the attention heads and the multi-layer perceptrons
    • Models show the strongest reasoning capability when the input is within the natural context window. Anything beyond that is using summarization techinques and losing information. Long context performs well in needle-in-haystack tests (finding things like city names planted in a large data set) but very poorly with dense information
    • The cost to perform a long context query can vary from $1 to $2 per query. This is not cost effective for most RAG style Question/Answer use cases at scale.
    • The latency associated with long context queries is beyond what most users would find acceptable. RAG with sematic search or even Hybrid Semantic/Lexical can be as low as 250ms for retrieval. Long context models can take upwards of 30 seconds to just ingest the corpus before answering.
  • A well tuned RAG system will have only the most relevant information in the context window, providing the lowest latency retrieval and strongest reasoning capability from the LLM.