Big Model Small Model RAG

2 March, 2024

Big Model Small Model RAG: Optimizing Cost and Capability in Chatbots

In the world of chatbot technology, there is a constant push to achieve optimal performance while maintaining low costs. This pursuit has led many developers to explore various strategies for creating Retrieval Augmented Generation (RAG) solutions using large language models (LLMs). One such strategy is LLM pre-summarization, which plays a crucial role in the text chunking process of RAG chatbots.

Large Language Models and Pre-Summarization

Large language models like OpenAI’s GPT4 or Mistral.ai’s mistral-large are known for their excellent summarization capabilities. These models have proven to be invaluable in generating high-quality, concise text chunks that form the basis of RAG chatbot responses. The ability to pre-summarize text is essential in ensuring that the chatbot can quickly and accurately provide users with relevant information.

While larger LLMs excel at pre-summarization, they may not always be necessary for generating responses to user questions. In many cases, smaller LLM models like Mistral 7b Instruct, GPT3.5-turbo, and Mixtral 8x7b can deliver excellent question answering performance when given high-quality text chunks. This makes them a more cost-effective option for chatbot development, as they are less expensive to call than larger models.

The Dilemma of Model Selection

When designing RAG chatbots, developers often face a dilemma: opt for the cost efficiency of smaller models or prioritize the summarization capabilities of larger models. This choice can lead to either high costs when using more powerful LLMs or limited summarization capability with less expensive alternatives.

To overcome this challenge and achieve optimal performance at lower costs, there is a growing trend towards mixing big and small models in RAG chatbots. By leveraging the strengths of both types of models, developers can create a more robust solution that effectively balances cost efficiency and summarization capability.

The Big Model Small Model RAG Approach

The Big Model Small Model RAG approach involves using large language models for their superior pre-summarization abilities while relying on smaller models to generate responses based on the well-summarized text chunks. This strategy allows developers to:

Leverage the summarization prowess of big LLMs: Large language models like GPT4 and mistral-large are adept at condensing complex information into concise, relevant snippets. By utilizing their pre-summarization capabilities, developers can ensure that the chatbot provides users with accurate, focused responses.
Reduce costs with smaller LLMs: Smaller models such as Mistral 7b Instruct, GPT3.5-turbo, and Mixtral 8x7b offer a more cost-effective solution for generating chatbot responses. These models are less expensive to call and can still deliver excellent question answering performance when provided with high-quality text chunks.
Optimize performance and cost: By combining the strengths of big and small LLMs, developers can create a RAG chatbot that achieves superior performance at lower costs. This approach allows organizations to invest in the most suitable models for their specific use cases without sacrificing quality or affordability.

Conclusion

The Big Model Small Model RAG approach represents a significant leap forward in chatbot development, as it enables developers to strike an ideal balance between cost efficiency and capability. By leveraging the unique strengths of both large and small LLMs, organizations can create RAG solutions that deliver high-quality, accurate responses while keeping costs under control. As this trend continues to grow, we can expect to see a new wave of innovative chatbots that redefine the boundaries of what is possible in the realm of artificial intelligence and natural language processing.

Human Intervention: None

Facts Used:

Many RAG Chatbot (retrieval augmented generation) solutions use LLM (large language model) pre-summarization as part of the text chunking strategy.
The advantages of LLM Pre Summarization was covered in this article: https://ai.dungeons.ca/posts/chunking-techniques---llm-presummarization/
A larger LLM like OpenAI’s GPT4 or Mistral.ai’s mistral-large have excellent summarization capability. However they might not be needed for generating responses to your users questions from the chatbot
Smaller LLM models like Mistral 7b Instruct, GPT3.5-turbo and Mixtral 8x7b are much cheaper to call and can give excellent question answering performance, given high quality well summarized text chunks.
Many use cases today will pick a single model and either suffer from high costs with larger models or limited summarization capability with smaller models.
Mixing big and small models, and using them for the appropriate task allows you to optimize cost and capability allowing you to have a much better RAG chatbot, but operated at lower costs.