Designing for LLM Driven Applications

Designing for LLM-Driven Applications: A Comprehensive Guide

Large Language Models (LLMs) have revolutionized the way we interact with technology, enabling natural language processing and generation to new heights. To harness the full potential of these powerful models in application development, it’s crucial to focus on designing effective prompts that optimize their performance. This blog post will delve into the four essential elements of prompt design for LLM-driven applications: system message, augmentation, pre-question prompt, and user question.

System Message

The system message plays a pivotal role in shaping the tone and personality of the LLM’s responses. If you want professional, technical replies, incorporate that into the system message. Similarly, if you desire more conversational or irreverent outputs, reflect those qualities in your system message as well. This will help guide the model towards delivering the desired output style and format.

Augmentation

In RAG (Retrieval Augmented Generation) use cases, augmentation is critical for providing authoritative responses to user questions. By adding text chunks or distilled facts with context, you empower the LLM to answer queries accurately and efficiently.

When working with smaller parameter count models (e.g., Mistral-7b), it becomes even more important to have the right data or facts in the prompt, as well as minimizing irrelevant information. Optimize your chunking and recall strategy to achieve optimal performance from the LLM.

Pre-Question Prompt

The pre-question prompt serves as a crucial guiding element for LLMs. A typical prompt starts with “Answer the following question using the facts above:”, but you will need to customize this message to steer the model away from irrelevant topics, such as discussing the facts or revealing its system message. Additionally, use this section of the prompt to specify the desired output format (e.g., JSON or XML), ensuring that the LLM adheres to your requested format.

Development cycles may involve iterating on the pre-question prompt to refine and improve the model’s responses. This “steerability” is a critical aspect of designing prompts for LLM-driven applications.

User Question

The user question, provided at the end of the prompt, directly impacts the LLM’s response generation process. While there are differing opinions on where to place the question within the prompt, our observations have shown stronger performance when placing the data/facts at the top and the pre-question and question at the bottom.

By meticulously crafting each element of the prompt, developers can maximize the potential of LLMs in their applications, ultimately delivering more accurate, engaging, and user-friendly experiences for end-users.

  • Human Intervention: None

Facts Used:

  • Designing for LLM (large language model) driven applications should focus on the various elements of the prompt to the LLM
  • The prompt should be broken down into the following 4 elements: system message (bot identity and personality), the augmentation (data and facts) which is used to answer the question, the pre-question prompt used for steering the response output style and format, and finally the user provided question itself.
  • The system message should be designed to give the LLM responses a personality or tone for responses. If you want a professional tone with technical expertise, build that into the system message. Same if you want a more conversational, irreverent tone to the outputs.
  • The augmentation piece is critical for RAG (retrieval augmented generation) use cases. This is where you add your text chunks or distilled facts with context, so that the LLM can answer the question authoritatively. The smaller the parameter count of the model (ie mistral-7b) the more important it is to have the correct data or facts in the prompt and as little irrelevant data as possible. Optimize your chunking and recall strategy to get the biggest wins here.
  • The pre-question prompt is usually something like “Answer the following question using the facts above:”, which is a great starting point. You will need to modify this prompt to steer the model away from things like talking about the facts, or revealing it’s system message or sticking to JUST the facts provided and nothing else. This is also the place you specify the output format, if you want JSON or XML you ask for it here. You will spend some development cycles iterating on this part of the prompt to get the best possible outputs. This is your steerability.
  • The question itself is provided by the user and gets quoted into the prompt as the final step. There’s some arguments in the industry to put the question up at the top of the prompt, but I’ve observed stronger responses by laying out the prompt with the data/facts at the top of the prompt and the pre-question and question at the bottom.

Fact Synthesis - A Powerful Technique for Augmentation

Fact Synthesis: A Powerful Technique for Augmentation

The use of large language models (LLMs) has become increasingly prevalent in the industry, with retrieval augmented generation (RAG) emerging as a reliable method to achieve strong results from these models. Current techniques involve “chunking” documents into individual paragraphs or fixed numbers of words and running them through text embedding models for vector search later. While effective, this approach relies heavily on the ability of the embedding model to accurately represent semantic concepts within the chunks of text and requires constant experimentation and benchmarking for optimal results.

To address these limitations and improve the process of augmenting LLMs, we propose a new method: fact synthesis. This technique involves extracting individual facts from raw source texts and summarizing them with the assistance of LLMs. The first step in this process is to reduce the source data into individual facts, which can then be stored in a transactional data store like MongoDB.

Facts can be grouped together into “chunks” for vectorization using text embedders such as text-ada-002. When a user poses a question, it too can be vectorized and a simple similarity search performed to retrieve the relevant facts. The LLM can then generate an answer based on these retrieved facts.

Fact grouping into chunks can be achieved through several methods:

  1. Fixed number of facts per chunk: Generate chunks containing a fixed number of facts, with no semantic boundaries between them. This approach is straightforward but may result in less coherent information.
  2. Topic-grouped facts: If metadata such as document names or chapters is available, it can be used to group related facts together based on context. This method allows for larger chunks as the embedding models represent semantically similar facts within a single knowledge domain.
  3. Semantically similar facts: Grouping facts by their similarity enables the generation of large text chunks with fewer embedding vectors. However, this method is computationally expensive and may not be feasible for all use cases.

By adopting fact synthesis in your own application, you can leverage the power of LLMs while minimizing their reliance on irrelevant information within text chunks. The blog system FactWeave, for example, employs a different technique called “fact expansion” to generate its content. However, by exploring and implementing fact synthesis, developers can create more efficient and effective augmentation systems tailored to their specific needs.

To get started with fact synthesis, refer to the following GitHub examples:

Facts Used:

  • RAG (retrieval augmented generation) has emerged as the most reliable technique for getting strong results from LLMs (large language models)
  • The current state of the industry is “chunking” documents. Word, PDF, HTML and other unstructured data sources are chunked into individual paragraphs or fixed numbers of words. These chunks are then run through a text embedding model which produces vectors and those vectors are stored in a vector search engine for retrieval later. The chunks we retrieve will be sent to the LLM prompt as part of the augmentation process. This lets the LLM provide grounded answers to users’ questions with retrieved data (chunks).
  • This technique relies heavily on the embedding model’s ability to represent the semantic concepts in the chunk of text, which can be hit or miss. You need to experiment and benchmark to get good results.
  • It also relies on the LLMs ability to reason through the irrelevant text in the chunk, because you will not always have a perfect chunk of text with only the concepts required to answer the question using RAG
  • We propose a new method of chunking with only point form, individual facts extracted from the raw source text and summarized by the LLM itself. This is fact synthesis
  • The first step in the process is to reduce the source data (usually blobs of text) into individual facts. These facts can then be stored individually in a transactional data store like MongoDB
  • Facts can be grouped together into a “chunk” and the chunk can be run through a text embedder model like text-ada-002 to generate vectors. When a user provides a question, this question can be vectorized as well and then a simple vector similarity search can be done to retrieve the facts relevant to the question, for the LLM to provide an answer.
  • Grouping facts into chunks can be done though a few methods: 1) Fixed number of facts per chunk. Generate the chunks with a fixed number of facts, with no real semantic boundary between the facts. 2) Topic grouped facts. If you have topic metadata (document name, document chapter) you can group the facts together by the context they were found in. This should allow you to group more facts together per chunk because the embeddings are representing facts that should be covering a single topic or knowledge domain 3) Semantically similar facts. You can group facts by how similar they are to other facts allowing you to generate very large text chunks with very few embeddings. This method will be the most compute expensive to generate but provides a large reduction in vector storage.
  • Call to action: Try Fact Synthesis in your own use case! This blog system (FactWeave) actually uses the opposite technique (Fact Expansion) to generate these blog posts.
  • See the following github example for Fact Synthesis: https://github.com/patw/ExternalBrain
  • See the following github example for Fact Expansion: https://github.com/patw/FactWeave

VAMPL Stack - Everything you need to build RAG solutions

VAMPL Stack: Everything You Need to Build RAG Solutions

In the ever-evolving landscape of generative AI, companies are diving headfirst into leveraging this powerful technology. As Retrieval Augmented Generation (RAG) chatbots become the go-to solution, developers need a robust yet agile stack that caters to maximum velocity in development. Enter VAMPL: Vectorizer, Atlas Mongo, Python, and LLM (Large Language Model).

The Power of Text Embeddings

The current state of the art for text chunk retrieval involves using text embedding models that produce dense vectors. Pair this up with semantic search, and you can augment the LLM prompt with your own knowledge chunks. This allows developers to create more sophisticated chatbots that can understand context and provide accurate answers based on a vast array of data sources.

Atlas Mongo: The Developer’s Dream

Atlas Mongo, a full-featured developer data platform, is a game-changer for teams building generative AI solutions. Offering a transactional database (document store), lexicographic search, vector search—all fully hosted with robust security and backup features—it significantly reduces cognitive load on developers. With MQL (Mongo Query Language) at your fingertips, integrating this powerful tool into your stack has never been easier.

The Unbeatable Versatility of Python

Python remains the de facto language for working with generative AI due to its numerous easy-to-use integrations with various LLM and embedding providers. Plus, the ability to run these models locally is a game-changer. Python’s dominance in data science ensures that it will continue to be at the heart of AI development for years to come.

The Key Technology: Large Language Models (LLMs)

The LLM and its summarization and reasoning abilities are the cornerstone technologies for building modern chatbots. RAG techniques have proven reliable and easy to implement, enabling developers to build generative chatbots that can answer questions with your company’s data sources. This level of reliability simply isn’t possible with raw model prompting alone.

Introducing the VAMPL Stack

By combining these powerful tools into one cohesive stack, you have everything you need to build RAG solutions for your organization. The VAMPL Stack offers:

  • Efficient and accurate text embedding models
  • Atlas Mongo’s robust data platform features
  • Python’s versatility and ease of use with LLMs
  • The power and reliability of Large Language Models (LLMs)

With this comprehensive stack, developers can build cutting-edge RAG chatbots that leverage your company’s knowledge base while providing users with accurate and contextual responses. Say goodbye to the limitations of traditional chatbot solutions—the VAMPL Stack is here to revolutionize the way you approach generative AI development.

  • Human Intervention: None

Facts Used:

  • With RAG (retrieval augmented generation) Chatbots becoming the go-to solution for companies jumping into the generative AI industry, I prose a new development stack for maximum veolocity: VAMPL - Vectorizer, Atlas Mongo, Python and LLM (Large Language Model).
  • The current state of the art for text chunk retreival is using text embedding models that produce dense vectors. Pair this up with semantic search and you can augment the LLM prompt with your own knowledge chunks
  • Mongo Atlas is a full Developer Data Platform with transactional database (document store), lexical search, vector search fully hosted with security and backups all accessible with a single mongo driver and using MQL for query. It’s a massive reduction in cognitave load for teams building genai solutions.
  • Python remains the defacto language for working with generative AI due to it’s many easy to use integrations with different LLM and embedding providers and even the ability to run these models locally. The world of data science is powered by Python
  • The LLM and it’s summarization and reasoning ability is the key technology for building modern chatbots. The RAG technique has proven to be reliable and easy to implement allowing you to build modern generative chatbots that can answer questions with your own companies data sources. This isn’t reliable or even possible with raw model prompting.

Atlas Vector Search Collection Modelling

Atlas Vector Search Collection Modeling: Designing for Optimal RAG Chatbot Performance

In the world of retrieval augmented generation (RAG) chatbots, designing Mongodb collections to support vector search plays a crucial role in achieving optimal performance. This blog post will guide you through the essential factors to consider when modeling your MongoDB collections for efficient vector search, including text chunking strategy and query filters. We’ll also discuss the importance of identifying different vector field names to facilitate benchmarking and storage of multiple vectors within a document.

Text/Knowledge Chunking Strategy

To support vector search in your RAG chatbots, you must first determine an effective text or knowledge chunking strategy. This involves breaking down the source documents into smaller, manageable chunks of text that can be indexed and queried efficiently. The choice of chunk size will depend on factors such as the complexity of the information and the specific use case for your chatbot.

Query Filters

A well-designed query filter is essential to ensure that only relevant results are returned by the vector search algorithm. Each field you need to filter on should be included in the vector search index, along with the chunk text and the resulting vector output of the text embedding process. This will enable your chatbot to retrieve accurate and precise information from the MongoDB collection during RAG interactions.

Storing Different Text for LLMs

In some cases, you may want to store different text for sending to the large language model (LLM) in the RAG process than the text used for vector embedding. For instance, pre-summarization techniques can be employed to improve recall and precision of semantic search while still providing the original source document text to the LLM. This dual approach ensures that your chatbot can provide accurate responses based on both summarized and full-text information.

Identifying Vector Field Names

To facilitate benchmarking different chunking and embedding strategies, it is crucial to assign unique vector field names that identify which embedding model was used to generate the vector. For example, “content_embedding_text_ada_002” represents the OpenAI text-ada-002 model. This naming convention will help you track the performance of different models and make informed decisions about optimizing your RAG chatbot’s search capabilities.

Scaling MongoDB Atlas Cluster for Vector Storage

As vector search collections can quickly grow in size, particularly with high-dimensional models like text-ada-002 (1536 dimensions or nearly 12kb of floating point numbers per vector), it is essential to plan ahead and configure your MongoDB Atlas cluster accordingly. By sizing up the cluster to accommodate the total number of vectors you intend to store multiplied by the total number of documents, you can ensure that your RAG chatbot maintains optimal performance and responsiveness even as the data volume grows over time.

In conclusion, designing an efficient MongoDB collection for vector search in RAG chatbots requires careful consideration of text chunking strategy, query filters, differentiation between LLM input text, and unique vector field names to track model performance. By following these best practices and planning for scalability, you can create a powerful, reliable, and responsive RAG chatbot that delivers accurate information to users in real-time.

  • Human Intervention: None

Facts Used:

  • Designing MongoDB collections to support vector search, for your RAG (retrieval augmented generation) chatbots requires thinking about your text/knowledge chunking strategy and your query filters.
  • Every field you need to filter on will need to be included in the vector search index, along with your chunk text and the vector output of the text embedding.
  • It’s also possible you’ll want to store different text for sending to the LLM (large language model) in the RAG process than the text you are embedding on. You might be using pre-summarization techniques to get better recall and precision on your semantic search, but still want to send the original text from the source document to the LLM
  • Come up with a vector field name that identifies which embedding model you used to generate the vector, for example: content_embedding_text_ada_002 to represent the OpenAI text-ada-002 model. This will become important later when you are benchmarking different chunking and embedding strategies and are storing multiple vectors in your mongo document.
  • Warning: These collections can get quite large! With models like text-ada-002 being 1536 dimensions, this is nearly 12kb of floating point numbers! Plan ahead and size up your Mongo Atlas cluster large enough to handle the total number of vectors you want to store multiplied by the total number of documents.

Chunking Strategy

Chunking Strategy: A Comprehensive Approach to Text Vectorization

Text vectorization is an essential step in various Natural Language Processing (NLP) tasks, such as information retrieval, question answering, and sentiment analysis. One critical aspect of text vectorization is chunking, or breaking down the source documents into smaller, manageable pieces for processing. This blog post will delve into the intricacies of chunking strategies, highlighting various approaches and their implications on model performance.

The Importance of Chunking in Text Vectorization

Chunking is an iterative process involving experimentation to find the optimal way to break down source documents like Word, PDF, Text, or HTML files. These documents can be chunked into single sentences, fixed-length bytes, multiple sentences, paragraphs, pages, chapters, or even entire documents. A suitable starting point for this endeavor is utilizing the chunking functionality provided by libraries such as LlamaIndex or LangChain. However, it may be necessary to evolve and adopt more sophisticated methods based on specific project requirements.

Token Limits in Text Embedding Models

Most text embedding models have a maximum token limit of 512 tokens, with exceptions like OpenAI’s text-ada-002 model offering an extended limit of 8192 tokens. These constraints significantly impact the chunking strategy as they directly affect the semantic representation and recall accuracy. While it might be tempting to fill up the entire token limit for models with higher thresholds, it is essential to consider that smaller amounts of text tend to capture more precise semantic details.

Improving Recall Accuracy through Recursive Chunking

Recursive chunking is an emerging technique aimed at enhancing recall accuracy by breaking down larger chunks of text into smaller pieces progressively. This method involves vectorizing a more extensive segment of text, splitting it in half, and then vectorizing those parts again. Real-world results demonstrate that this approach can improve recall accuracy by up to 10-20%, although it comes with the cost of generating additional vectors (seven, in particular). To implement this technique, duplicate the original larger chunk of text in each Mongo document before sending it to a Large Language Model (LLM) for evaluation. Afterward, use an MQL $group stage operation to remove duplicates.

Chunking and Sending Text to LLMs

It is crucial to consider that the chunked data might not be identical to the text submitted to the LLM for evaluation. In many cases, it may be beneficial to send a larger amount of text, such as an entire paragraph surrounding a specific sentence, particularly when employing sentence-level chunking. This approach ensures that the LLM has sufficient context to evaluate the question accurately and effectively. Be mindful of the token limits imposed by different LLMs (generally between 4,000 and 8,000 tokens) and use as much text as necessary without exceeding these constraints.

Pre-Summarizing Non-Textual Data

Not all text yields useful vectors for embedding models, particularly when dealing with tabular or point form data structures. A viable solution to this issue is pre-summarization. By sending such non-textual content to the LLM for summarization into a paragraph of semantically rich text, you can improve the likelihood of generating meaningful vectors. Text embedding models perform best when processing well-structured English texts.

Tailoring Chunking Strategies for Different Document Types

The chunking strategy required for vectorizing source documents will vary depending on their structure. For instance, spreadsheets, CSV files, or structured data (JSON or XML) will necessitate different approaches compared to unstructured Word and PDF documents or HTML files. In the former case, it may be challenging to employ the same workflow as for the latter, requiring adjustments and customizations according to each document type’s specific characteristics.

In conclusion, the chunking strategy plays a critical role in determining the effectiveness of text vectorization and subsequent NLP tasks. By carefully considering token limits, recursive chunking, sending the appropriate amount of text to LLMs, pre-summarizing non-textual data, and tailoring strategies for different document types, one can optimize their text processing workflows and achieve better model performance and accuracy.

  • Human Intervention: None

Facts Used:

  • There is no right answer for chunking your source documents (Word, PDF, Text, HTML). This is an iterative process of experimentation where the end result could be chunked by a single sentence, fixed byte length, multiple fixed sentences, whole paragraph, multiple paragraph, whole page, whole chapter, or even whole document. A good place to start is the chunking functionality in llamaindex or langchain, but you may need to evolve to more sophisticated methods.
  • Most text embedding models are limited to 512 tokens (words or word parts). The text-ada-002 model from OpenAI is the exception at 8192 tokens. These limits will greatly influence the chunking strategy.
  • Recall accuracy does not seem reliable at the edge of the token limit of the embedding model. The naive strategy is to just stuff the text up to 8192 tokens (e.g. text-ada-002) and hope for the best, but smaller amounts of text seem to capture the semantic details better.
  • However, you can also use a large token limit to get a vague semantic representation of a large chunk of text for multi-level vector search, so it depends on the requirement. See Recall Benchmarking post for more details.
  • Another emerging technique to improve recall accuracy is recursive chunking: Vectorize a larger chunk of text, split that text in half then vectorize those pieces, then split in half again and vectorize again. Real world results show up to 10-20% better accuracy, at the (higher) cost of 7 vectors. Each mongo document would duplicate the larger chunk of text, to send to the LLM (large language model) and you would perform an MQL $group stage to remove duplicates.
  • Your chunked data might not be what you stuff into the prompt for the LLM to evaluate. In many cases you might want to send a larger amount of text. This is especially true if you decide to chunk on a sentence level. It’s much better to send the entire paragraph surrounding the sentence, as it could contain more details for the LLM to evaluate against the original question. If you chunk on a paragraph, it might be useful to send the surrounding paragraphs. The LLM needs enough text to answer the question, what you vectorize might not be enough. Be aware of the token limit on the LLM model you are using (usually 4-8k) and use as much as you need for accuracy.
  • Not all text will produce useful vectors. Tables and point forms can produce unexpected results in the text embedding model. One solution to this is pre-summarization: You can send these tables or point form lists to the LLM to pre-summarize into a paragraph of semantically rich text, and then vectorize on that text. Embedding models tend to perform best with semantically rich english text.
  • Depending on the structure of the documents, different chunking strategies may be needed. If you are attempting to vectorize spreadsheets, csv files or structured data (json or xml), you will probably not be able to use the same workflow as your unstructured Word and PDF and HTML docs.

Embedding Model Selection

Embedding Model Selection: A Comprehensive Guide

Embedding models play a crucial role in natural language processing (NLP) applications, enabling accurate analysis and understanding of human language. Selecting the right embedding model can significantly impact the performance and effectiveness of your NLP solution. This blog post aims to provide you with an extensive overview of various embedding model options, along with their strengths and weaknesses, and offer guidance on how to choose the best model for your specific use case.

Getting Started: Text-Embedding Models in OpenAI

The easiest way to begin is by directly calling OpenAI’s text-embedding-ada-002 model, a 1536-dimensional model with high recall accuracy on non-industry-specific language. However, most RAG use cases will involve GPT-3.5-turbo or GPT-4-turbo models as the large language model (LLM). If you already have an API key and client library for these models, you can use them without any additional effort.

Alternatives: Azure OpenAI, AWS Bedrock, and Google Cloud Platform (GCP) Vertex

In some cases, organizations may lack access or authorization to utilize OpenAI services. In such situations, alternative platforms like Azure OpenAI, AWS Bedrock, and GCP Vertex can be recommended as fallback options.

Azure OpenAI

For Azure OpenAI, the advice remains the same - use text-embedding-ada-002 as the embedding model. You will still find high recall accuracy with this model in non-industry-specific languages.

Google Cloud Platform (GCP) Vertex

When working with GCP Vertices, the recommended embedding model is gecko-001, and for LLMs, Palm2 is suggested. The model selection process should be based on your specific requirements and use case scenarios.

Open Source Models: HuggingFace’s MTEB Leaderboard

If you don’t have access to the aforementioned platforms or their models, there are various open-source models available on Huggingface’s MTEB Leaderboard. Instruction is one such model family that has demonstrated excellent accuracy in text-embedding tasks. However, new model families with better accuracy continue to emerge, so it’s essential to stay updated with the latest advancements.

Accessing HuggingFace models can be done directly through their API or by downloading and running them locally using the sentence-transformer Python library. Self-hosting these models as a service is also possible, making it easier for organizations to integrate NLP solutions into their systems.

Understanding Model Cards: A Crucial Step in Selection

Before selecting any embedding model, thoroughly review its model card available on HuggingFace. This document contains valuable information about the model’s capabilities and limitations, such as context window length (ranging from 512 tokens to 8192 tokens), multi-lingual text embedding support, and more. By understanding these aspects, you can ensure that you choose a model best suited for your specific use case.

In conclusion, selecting the right embedding model is critical to achieving optimal performance in NLP applications. While this blog post has provided an overview of various options available, it’s essential to conduct thorough research and testing before finalizing a model. Remember, each model comes with unique strengths and weaknesses that may or may not align with your requirements.

  • Human Intervention: Minor Corrections, the LLM tried to correct GCP Vertex (a product name) to GCP Vertices. It also renamed the METB leaderboard to M5EB for some reason. This is probably one of the worst generated blog posts from the system so far.

Facts Used:

  • The easiest way to start is calling OpenAI directly with text-ada-002 as the text embedding model. It’s 1536 dimensions and has very high recall accuracy on non-industry specific language. Most use cases will be using gpt-3.5-turbo or gpt4-turbo models as the LLM (large language model), so they will already have an API key and a client library to use this.
  • Some larger customers don’t have access or authorization to use OpenAI, so the fallback here is to recommend Azure OpenAI, AWS Bedrock or GCP Vertex. With Azure OpenAI, the advice above stays the same. With Google GCP and Vertex the embedding model is gecko-001 and the LLM recommended is Palm2. With Bedrock try Cohere embedding models and Anthropic Claude as the LLM.
  • If none of these are available, there’s a wide selection of open source models on Huggingface on the MTEB Leaderboard. I’ve seen good accuracy with the Instructor family of models but new model families are showing up all the time, with claims of better accuracy: https://huggingface.co/spaces/mteb/leaderboard
  • Huggingface models can be accessed through the Huggingface API directly, as a simple replacement for the OpenAI text-ada-002 API. However, these models can also be downloaded and run locally with the sentence-transformer python library and even self hosted as a service. See this code for an example: https://github.com/patw/InstructorVec
  • Embedding models all have different strengths and weaknesses, they can vary in the length of the context window (512 tokens to 8192 tokens) and some can handle multilingual text embedding. Make sure to check out the model card on Huggingface before selecting a model to know what it’s capable of.

Enhancing Recall

Enhancing Recall: A Comprehensive Guide to Improving Vector Search Performance in RAG Chatbots

In the world of retrieval-augmented generation (RAG) chatbots, ensuring high recall and accuracy is crucial for providing users with relevant and accurate information. This blog post delves into various strategies and techniques that can help enhance recall in vector search applications, focusing specifically on improving the quality and efficiency of text chunk retrieval.

NumCanDidates: Striking a Balance Between Accuracy and Time

The numCandidates parameter in the Atlas Vector Search operator determines the number of nearest neighbors to evaluate for similarity. While low values can result in poor quality chunks, higher values increase the chances that the approximate nearest neighbor (ANN) algorithm will find something useful. However, very high values (>800) may lead to slow query performance.

To improve recall accuracy, it is recommended to set a minimum of 100-200 candidates and apply a limit of 5-10 after that for chunks sent to the large language model (LLM). Applying re-ranking techniques in memory on the app tier using manual cosine re-rank with cross encoder can boost the text chunk scores, especially when re-ranking 100-500 vectors.

Instructor Embeddings: Leveraging LLM-Style Prompting for Better Accuracy

The Instructor family of text embedding models requires an LLM-style prompt to generate vectors. By utilizing this additional prompting strategy, you can achieve better accuracy with fewer dimensions compared to OpenAI’s text-ada-002 model. This technique allows the embedding model to capture more context and nuance in your data, resulting in improved recall performance.

Multi-Level Vector Search: Achieving Better Precision Through Contextualization

Another effective approach for enhancing recall is through multi-level vector search. By first vectorizing a larger chunk of text, such as an entire chapter, and then also vectorizing individual paragraphs, you can narrow down the context to specific sections within your documents. This method helps mitigate false positives when multiple chapters contain semantically similar paragraphs.

Hybrid Search: Combining Text and Vector Searches for Improved Confidence

Hybrid search involves sending both text and vector searches simultaneously, allowing you to re-rank vector results where they intersect with text search results or include high-scoring text results that are missing from the vector search result set. By combining these two powerful search methods, you can increase your confidence in the relevance of the recalled chunks.

Pre-Summarizing and Chunking: Optimizing Token Limits for Improved Recall

Pre-summarizing entire sections of your documentation using an LLM can help you work within the token limits of embedding models, making it easier to represent complex content effectively. In addition, applying this technique to any section that isn’t being represented well by the embedding model, such as point form, tables, or structured data (e.g., JSON and XML), may result in improved recall accuracy.

Adjusting Chunk Sizes: The Impact on Recall Accuracy

Changing your chunk sizes (number of tokens) can significantly impact recall accuracy and should be explored before accepting poor results. Different strategies, such as breaking down content into smaller or larger chunks, may yield better outcomes depending on the nature of your data and the specific requirements of your chatbot application.

As new embedding models continue to emerge, it’s essential to continuously benchmark and assess their performance against your specific data. By building reproducible tests and staying up-to-date with the latest developments in this field, you can ensure that your chatbot remains optimized for accuracy and recall over time.

In conclusion, enhancing recall in vector search applications requires a combination of strategic techniques and careful consideration of various factors. By implementing these strategies effectively, you can improve the quality and efficiency of text chunk retrieval, ultimately resulting in a better user experience for your RAG chatbot.

  • Human Intervention: Minor. It kept changing Instructor (name of an embedding model) to Instructors.

Facts Used:

  • The numCandidates in the Atlas Vector Search $vectorSearch operator determines the number of nearest neighbors to evaluate for similarity. Our recommendation is a minimum of 100-200 and applying a limit of 5-10 after that for chunks to send to the LLM (large language model) for your RAG (retrieval augmented generation) chatbot. Low numCandidates values (sometimes called K in other vector search engines) can result in poor quality chunks being retrieved. Higher numCandidates values will result in a better chance than the ANN (approximate nearest neighbor) algorithm will find something useful but it’s a trade off between accuracy and time. Very high numCandidates values (> 800) can result in slow query performance.
  • Re-ranking the results in-memory on the app tier, using a manual cosine re-rank with a cross encoder can result in better text chunk scores. Re-ranking 100-500 vectors is relatively fast for a small boost in accuracy.
  • The Instructor family of text embedding models requires an LLM style prompt to generate vectors. You can use this additional prompting strategy to get better accuracy for less dimensions than OpenAI’s text-ada-002 text embedding model
  • Multi-level vector search has worked well for some customers. The idea here is to use the large token limit (8192 tokens) of text-ada-002’s embedding model to summarize a large chunk of text, like an entire chapter of a book, then also vectorize the individual paragraphs. You run the first vector search against the “wider” context to narrow down what chapter the text is relevant to, then query vector search again to get the specific paragraph. This has been used to guard against false positives when multiple chapters can contain semantically similar paragraphs.
  • Hybrid Search can be used to increase the confidence you have in the recalled chunks. If you send a text search, along with a vector search. You can re-rank vector results where they intersect with text search results, or include high scoring text results that are missing from the vector search result set. The idea here is, if the vectors and tokens are both ranking highly, it’s probably a more relevant chunk.
  • Using the LLM to pre-summarize entire sections of your documentation allows you to easier work within the token limits of the embedding models. You can vectorize the smaller summarized text, and it may even have better recall than the original. This same technique should be applied with any section of your documents that isn’t being represented well by the embedding model, like point form, tables or even JSON and XML structured data. Yes, you can even summarize data in mongo collections!
  • Changing your chunk sizes (numbers of tokens) can have a dramatic effect on recall accuracy. Try different chunking strategies before accepting poor results.
  • Always be benchmarking! New embedding models are appearing all the time, and your specific data might be better represented by another model. Build reproducible tests.

LLM Prompting Strategy

LLM Prompting Strategy: Enhancing Accuracy and Overcoming Guard Rails

In the ever-evolving landscape of natural language processing, Large Language Models (LLMs) have emerged as powerful tools for generating human-like responses to a wide range of questions. To harness their full potential, it is crucial to implement an effective LLM prompting strategy that maximizes accuracy and overcomes guard rail limitations. This blog post will delve into the best practices for prompt engineering and provide insights on how to bypass guard rails in sensitive domains like healthcare or law.

The Basics of LLM Prompting

Most prompts to an LLM follow a pattern such as: “Can you answer the following question ‘’ based on the text below: ”. While this is a solid starting point, it may not always yield optimal results. To achieve higher accuracy, prompt engineering techniques should be employed to tailor the questions and chunks of data provided to the LLM.

The Role of Chunks in Prompting

To improve the likelihood of obtaining an answer, it is recommended to send multiple chunks of data as part of the input. This approach allows for a more comprehensive understanding of the context and increases the chances of finding relevant information within the text. Current best practices suggest using 3-10 chunks per prompt.

Guard Rails and Overcoming Them

One challenge faced by LLM users is guard rail limitations that prevent certain types of questions from being answered. In sensitive domains such as healthcare or law, these restrictions can be particularly problematic. To circumvent these blockers, creative prompting strategies must be employed. Open-source LLMs offer unguard models capable of generating more sensitive responses when traditional approaches fail.

The Role of Documentation and URLs

When crafting an LLM prompt, there is no need to include the URL for the associated documentation (usually HTML or PDF links) as part of the input. LLMs may ignore these links regardless of their inclusion in the prompt. To ensure all relevant information is provided, it is advisable to store the URLs in a database like Mongo and append them to the final response generated by the LLM. This method mirrors the approach taken by platforms such as Bing Chat.

In conclusion, an effective LLM prompting strategy hinges on understanding the intricacies of prompt engineering, the benefits of using multiple chunks of data, overcoming guard rail limitations, and managing documentation URLs. By leveraging these techniques, users can unlock the full potential of LLMs and generate more accurate and useful responses for their specific needs.

  • Human Intervention: None

Facts Used:

  • Most prompts to the LLM (large language model) will follow a pattern like this: “Can you answer the following question ‘ ’ based on the text below: ”. This is a good starting place but you will most likely need to do prompt engineering to get the best possible result. Use the API or UI for your tool set to take a known question/chunk(s) pair and see what changes to the prompt (ie. Can you answer the following healthcare question) can result in higher accuracy for the response.
  • You can send more than one chunk of data, as long as it fits in the token limit of the LLM. Sending more chunks means it’s more likely to have an answer to the question. The current best practice is 3-10 chunks.
  • Be aware that the guard rails on the LLM could prevent some questions from being answered. This can be a serious problem for healthcare or legal use cases, as the LLM will try to prevent producing responses for these style of questions. This may require some creative prompting strategies to bypass these blockers. Alternatively, the open source LLMs have a selection of unguarded models that can be used to generate more sensitive responses.
  • You don’t need to provide the URL for the documentation (usually HTML or PDF links) as part of the prompt, there’s a high chance that even if you ask for it to be provided as part of the response, the LLM will ignore it. The URL for the documentation can be stored in the mongo collection and can be appended to the LLM response. This is similar to how Bing Chat works.

LLM Selection

LLM Selection: Choosing the Right Language Model for Your RAG Chatbot

In today’s world of advanced artificial intelligence (AI) and natural language processing (NLP), large language models (LLMs) play a crucial role in developing powerful chatbots. These models, particularly retrieval-augmented generation (RAG) chatbots, have become indispensable for various industries, from customer service to content creation. Selecting the appropriate LLM model that optimizes cost and performance is critical to the success of your use case. This comprehensive blog post will delve into the world of LLMs, providing a detailed comparison of popular models such as OpenAI’s GPT-4, GPT-3.5 Turbo, Google Palm2, Amazon Bedrock, Coherent, Meta LLaMA2, and Mistral. We will also discuss the importance of 3rd party open source LLM providers and how they can drastically reduce costs for your chatbot development projects.

OpenAI’s GPT-4 and GPT-3.5 Turbo

OpenAI’s GPT-4 is by far the most advanced and sophisticated LLM model to date, offering unparalleled accuracy and functionality. However, concerns about cost and rate limits might make it less attractive for some use cases. Thankfully, OpenAI offers a more affordable alternative in the form of its GPT-3.5 Turbo model. This version performs exceptionally well at a fraction of the price of GPT-4 while still offering robust capabilities for zero-shot augmented summarization tasks. Our recommendation is to start with GPT-3.5 Turbo, as it provides an excellent balance between performance and cost efficiency.

Azure OpenAI Version

If your organization is not allowed to use OpenAI directly or operates within a more restrictive/high security environment, consider the Azure OpenAI version. This service integrates seamlessly with other Microsoft technologies, offering a secure and reliable platform for developing RAG chatbots.

Google’s Palm2

Google has recently introduced its Palm2 LLM model, which is mostly comparable to OpenAI’s offerings in terms of performance and functionality. If your organization already operates within the Google Cloud Platform (GCP) ecosystem, it may be worth investigating Palm2 as a potential alternative for your RAG chatbot development projects.

Amazon Bedrock Family of Products and Cohere

Amazon has made significant strides in the world of LLMs with its Bedrock family of products and Cohere for embedding tasks. With their recent investment in Anthropic (Claude model), they are poised to become a formidable competitor in the LLM market. Their offerings provide robust capabilities for developing RAG chatbots while offering competitive pricing options for various use cases.

Open Source Models: Meta LLaMA2, Mistral, and 3rd Party Providers

Open source models such as Meta’s LLaMA2 family (including AlpaCA, Wizard, Orca, and Vicuna) offer excellent performance at a fraction of the cost of proprietary LLMs. These models can be hosted locally and operated using regular CPU resources or even laptop hardware in some cases. Quantized versions of these models further reduce costs by allowing them to run on lower-power hardware without sacrificing performance.

Mistral, an up-and-coming open source LLM model, has demonstrated impressive results in various benchmarks, surpassing many of its competitors. We highly recommend considering Mistral for your next RAG chatbot development project.

Aside from open source models, numerous 3rd party providers have emerged offering their own customized LLMs tailored to specific use cases or industries. Investigating these options can lead to significant cost savings when compared to hosting your own LLM internally. By leveraging the expertise and infrastructure of these providers, you can focus on developing high-quality RAG chatbots while minimizing overhead costs.

In conclusion, selecting the right LLM model for your RAG chatbot project requires careful consideration of factors such as performance, cost efficiency, security requirements, and compatibility with existing technologies. By evaluating each option based on these criteria, you can ensure that your chosen LLM model will provide the necessary functionality to meet your specific use case needs while maximizing resource utilization and minimizing expenses.

  • Human Intervention: Minor. It keeps changing Cohere (company name) to Coherent. I’m not sure the guys at Cohere, who make really awesome embedding models would appreciate that!

Facts Used:

  • Large Language Models (LLM) are the most important part of your RAG (retrieval augmented generation) chatbot. Selecting the right one to optimize cost and performance is critical to the success of your use case.
  • In a perfect world, you’d start and end with OpenAI’s GPT4. It’s by far the most accurate and sophisticated LLM model to date. However, cost and rate limits can be a concern. As well, most models perform very well on zero-shot augmented summarization tasks so you may not need this level of functionality.
  • OpenAI offers a much cheaper alternative to GPT4 with gpt-3.5-turbo or gpt-4-turbo. These models perform very well at a fraction of the price of GPT4. Our advice is to start here.
  • If you are not allowed to use OpenAI directly, use the Azure OpenAI version in more restrictive/high security environments.
  • Google currently offers Palm2 which is mostly comparable to the OpenAI offerings. Customers who are in the GCP ecosystem should look into this first.
  • Amazon has the Bedrock family of products and Cohere for embeddings. With the recent Anthropic (Claude model) investment, they will have a very compelling offering.
  • There is a large selection of open source models like Meta’s LLaMA2, and derivative fine tunes (alpaca, wizard, orca, vicuna) that similarly perform well on these tasks and can be hosted and executed locally. Quantized (reduced precision) versions of these models can operate on regular CPUs and even on laptops. If cloud/api costs are a concern these are worth considering. If the sensitivity of the data doesn’t allow it to leave a local data center, this may be the only option.
  • Mistral, a newcomer to the open source LLM field has the strongest performing small open source LLM model I’ve seen to date, I highly recommend this one over the LLama2 family of models.
  • Many 3rd party open source LLM providers have appeared recently, and should be investigated for cost, before trying to host your own LLM internally. This could be drastically cheaper than standing something up yourself.

Recall Benchmarking

Recall Benchmarking: Optimizing Vector Search for RAG Chatbots

In the realm of Retrieval Augmented Generation (RAG) chatbots, benchmarking the recall of your knowledge or text chunks is critical to building an efficient and accurate AI system. Vector search technology powers this retrieval process, making it essential to optimize for both recall and accuracy in order to create a reliable and useful chatbot. This blog post will delve into the importance of vector search recall accuracy, the role of text embedding models and chunking strategies, and how to benchmark your model effectively.

The Impact of Text Embedding Models and Chunking Strategies

The effectiveness of a chatbot’s vector search functionality hinges on the selection of an appropriate text embedding model and the chunking strategy for your documents (e.g., word, PDF, HTML, or TXT). If chunks are too large, you risk losing semantic context; if they are too small, you may fail to capture the entire concept you intend to represent.

Additionally, low-dimensional embedding models may struggle to adequately represent multiple concepts in a single vector. While high-dimensional models can offer more complexity, it’s essential to benchmark their accuracy before implementing them. Some large dimension models might even include dimensions representing languages you never plan on embedding.

Benchmarking the Embedding Model with Cosine Similarity Function

As part of your Proof of Concept (PoC), you should develop a series of questions and answers and identify where they can be found within your documentation. By using a simple cosine similarity function, you can evaluate how well each question correlates with its corresponding chunk. It’s also important to “red team” some irrelevant promptsto rule out false positives and ensure the accuracy of your search results.

This benchmarking process should be conducted without involving the Large Language Model (LLM) at first, as it will help you understand the performance of your vector search before integrating the LLM into the equation.

Managing False Positives and Setting Confidence Levels

False positives—high scoring chunks that are irrelevant to a user’s query—are one of the most significant challenges in achieving accurate recall accuracy. To mitigate this issue, set a confidence level (a high cosine similarity score) as a threshold for returning relevant chunks. If no chunk meets this threshold, the chatbot should respond with a message like “Sorry, I can’t answer this question.” This serves as the first critical guardrail against leaking underlying training data or providing hallucinated answers.

Experimenting with Different Embedding Models

When selecting an embedding model for your RAG chatbot, start with the text-ada-002 model from OpenAI and then explore other models that perform well on HuggingFace’s MT5 Leaderboard, focusing on semantic text similarity metrics. The goal is to find a combination of an embedding model and chunking strategy that delivers optimal results.

In summary, the success of your RAG chatbot relies heavily on optimizing vector search recall accuracy through careful selection of text embedding models and effective chunking strategies. By benchmarking these components thoroughly and setting appropriate confidence levels, you can ensure that your chatbot provides accurate and helpful responses to users’ inquiries while preventing false positives and hallucinated answers.

  • Human Intervention: None

Facts Used:

  • Benchmarking the recall of your knowledge/text chunks is critical to building your RAG (retrieval augmented generation) chatbots. Vector search is the technology that powers the retrieval, so we must optimize for search recall and accuracy.
  • Vector search recall accuracy comes down to the text embedding model selection and your chunking strategy of your documents (word, pdf, html, txt). If the text chunks are too large, you can lose semantic context, if it’s too small you might not capture the entire concept you are trying to represent. If the embedding model is low dimensional it might not have the ability to represent multiple concepts in a single vector. But also, the number of dimensions is only a rough estimate of how sophisticated the embedding model is, so you need to benchmark for accuracy. Some really large dimension models are using dimensions to represent languages you might never need to embed!
  • As part of your PoC you need to come up with a series of questions and answers, and where they can be found within your documentation. Using this you can benchmark the embedding model with a simple cosine similarity function to see how well the question and the chunk correlate. You should also “red team” some prompts that should not match your data to rule out false positives. This whole process can be done without the LLM (large language model) involved at all.
  • False positives (high scoring chunks that are not relevant to the search) are the #1 observed problem with recall accuracy in vector search use cases. You need to make sure your questions are not returning irrelevant data at high cosine similarity scores
  • Not every question can be answered. There must be a confidence level (high cosine similarity score) in the returned chunks before you send them to the LLM. Use the scores returned from the search engine and a defined score cut-off point to prevent irrelevant chunks from being sent for question answering. If no chunk scores well, return a message like “Sorry, I can’t answer this question”. This is your first major guardrail against your chatbot from leaking the underlying training data or providing hallucinated answers.
  • Experimenting with different embedding models is encouraged at this stage. Start with text-ada-002 model from OpenAI, and then try models that are high on the semantic text similarity metric on the HuggingFace MTEB Leaderboard (https://huggingface.co/spaces/mteb/leaderboard). What you want for an outcome is an embedding model and chunking combination that produce the best results. Use some data science here!