Transactional Knowledge: Build Better Chatbots

9 July, 2024

Transactional Knowledge: Build Better Chatbots

Chatbots have become an essential tool for businesses to provide quick, accurate answers to their customers’ (or internal) questions. One of the keys to building effective chatbots is having high-quality knowledge sources that can provide grounded and relevant responses. In this blog post, we will explore how using transactional knowledge can lead to better chatbots by enabling real-time updates, avoiding downtime, and empowering business users.

LLM-driven RAG Chatbots and Knowledge Sources

LLM (large language model) driven RAG (retrieval augmented generation) chatbots rely on knowledge sources to provide accurate answers. These sources can include word docs, PDF docs, text files, and website scrapes. The chatbots retrieve knowledge using semantic search or a combination of vector search and lexical search, called hybrid search.

To enable efficient retrieval, the unstructured text is “chunked” into smaller pieces. Each chunk of knowledge is then meaningfully represented by a text embedder, which inputs text and produces dense vectors used for similarity search.

Ingesting these sources allows a RAG workflow to ground the LLM’s answers with retrieved chunks, resulting in more accurate and relevant responses.

Knowledge Changes Over Time

However, knowledge sources are not static; they change over time. For example, if an HR document is used as a knowledge source, HR policies may change. To keep chatbot responses up-to-date, it is crucial to update the knowledge sources.

Current Chatbot Update Method: Nuke and Pave

In most current chatbot implementations, updates involve a “nuke and pave” operation. This method involves deleting the existing chunks in the vector index and reingesting, chunking, and vectorizing the documents. While this approach ensures that the chatbot’s knowledge is up-to-date, it has some pitfalls:

Impact on Recall and Precision: Changing the source text and, consequently, the chunk boundaries can affect the recall and precision of vector and lexical search results.
Downtime: The chatbot cannot answer questions during the index rebuild process. Workarounds like Blue-Green environment setups or multiple indexes can mitigate this issue but still require replacing the entire knowledge base.

Alternative: Transactional Knowledge

Transactional knowledge offers a better solution for updating chatbot knowledge sources. In this approach, chunks are stored in a database like MongoDB, and a simple CRUD (create, read, update, delete) application is built on top of the chunks. This allows for real-time editing and updating of the knowledge without needing to nuke the entire index.

The benefits of transactional knowledge include:

Real-time Updates: Business users can update the chatbot knowledge anytime it needs updating, without waiting for scheduled downtime.
No Downtime: Chatbots do not experience downtime during updates, enabling continuous service for users.
Empowering Business Users: Business users can directly contribute to the chatbot’s knowledge base, streamlining the update process and eliminating the need for coordination among stakeholders.

Conclusion

Transactional knowledge transforms the chatbot’s knowledge management into a content management system, a familiar paradigm for business users who create content regularly. By adopting this approach, businesses can build better chatbots with real-time updates, no downtime, and empowered business users. It’s time to ditch the nuke and pave method and embrace the future of chatbot knowledge management.

Human Intervention: Minor. This was applicable to internal an external chatbots. The LLM used here assumed it was external only.

Facts Used:

- LLM (large language model) driven RAG (retrieval augmented generation) Chatbots rely on high quality knowledge sources to give accurate, grounded answers.
- Chatbots retrieve their knowledge using semantic search or a combination of vector search and lexical search, called hybrid search
- Chatbot knowledge sources (word docs, pdf docs, text files, site scrapes) are typically unstructured blobs of text that get broken up into pieces called “chunking” to allow each chunk of knowledge to be meaninfully represented by a text embedder, which inputs text and produces dense vectors which are used for similarity search later. In a hybrid scenario we also search the raw text with a search engine like Lucene.
- In a pilot or PoC environment, we will typically use an integration library like LangChain to ingest and chunk our document sources, which are usually just sitting in a folder on a server or on our laptop.
- Ingesting all these sources results in our ability to run a RAG workflow and ground our LLMs answers with retrieved chunks, so the answers are closer to the source material and not hallucinated from the LLMs pretraining.
- However, knowledge changes over time. If we ingested an HR doc, maybe the HR policies have changed over time. How do we update knowledge?
- In most current chatbot implementations users will do a nuke/pave operation: The current chunks in the vector index will be deleted, and the documents will be reingested, chunked and vectorized.
- This style of update has some pitfalls: The default chunking method counts tokens in the source text and stores a fixed number of them, usually with some overlap on each side. This is called token limit with overlap method. If you change your source text, you change your chunk boundaries, and can potentially impact your recall and precision of your vector and lexical search results.
- Nuke and pave also results in some downtime in your chatbot, meaning it can’t answer questions while the index is being rebuilt. You can get around this with a Blue-Green environment setup or multiple indexes, but again you’re replacing the entire knowledgebase!
- An alternative to this is Transactional Knowledge. Imagine you store your chunks in a regular database, like MongoDB, and you build a simple CRUD application on top of your chunks. You can edit, and update the knowledge in real time, without having to nuke the entire index.
- This also means you no longer need to schedule downtime for your chatbot, or coordinate all the business users and stakeholders to make all their updates in the same nightly window. Business users can be empowered to update the chatbot knowledge anytime it needs updating.
- Transactional Knowledge is treating your chatbot’s knowledge like a content management system, which is a familiar paradigm for any business user who creates content right now.
- Ditch the nuke and pave! Make your chatbot update in real-time!

Tool Series - AudioSumma

8 July, 2024

Tool Series - AudioSumma

Introduction

AudioSumma is a powerful tool that records the global audio (input and output) on a laptop or desktop, transcribes the audio into a transcript, and uses a Large Language Model (LLM) to summarize the transcript. This tool is particularly useful for professionals who need to analyze, summarize, and extract key information from long audio conversations. In this blog post, we’ll dive into the details of how AudioSumma works, its features, and its limitations.

How AudioSumma Works

AudioSumma works entirely locally using whisper.cpp for audio transcription and llama.cpp for calling the LLM for summarization. The process is broken down into different parts to maximize the reasoning capability of the LLM and ensure that it doesn’t exceed the LLM context window.

To use AudioSumma, you’ll need to have Whisper and Llama.cpp running in server mode on their default ports, either on your laptop or a separate machine. Additionally, you’ll need the en_base model for whisper.cpp and a LLama3 9b model for the summarization task.

Features

AudioSumma offers three distinct summaries for each part of the call:

Overall Summary: This summary provides an understanding of what was discussed, in what order, and why it mattered.
Fact Summary: This summary includes exact facts stated in the call, such as project names or timelines.
Sentiment Summary: This summary helps you understand the tone of the call and whether it was positive or negative.

These summaries are designed to help users quickly grasp the main points, facts, and emotional undertones of the conversation, making it easier to review and analyze the content.

Limitations

While AudioSumma is a fantastic tool, it has a few limitations to consider:

Most laptops default recording device will record global audio, but some MacOS devices and lower-end laptops only record from the microphone. This limitation may impact the tool’s usefulness in those environments.
The summarization process is divided into parts (roughly every 15 minutes of audio). This approach can result in some information being overlooked or not included in the summaries.

Conclusion

Overall, we love AudioSumma and use it daily. It’s a valuable addition to our growing set of AI tools that help with work. If you’re looking for a way to quickly analyze and summarize audio conversations, AudioSumma is definitely worth exploring.

You can find the project on GitHub at: https://github.com/patw/AudioSumma.

Note: AudioSumma is a tool that I have created and maintain. If you have any questions or suggestions, feel free to reach out to me on GitHub or via email.

Human Intervention: None

Facts Used:

- Project can be found at: https://github.com/patw/AudioSumma
- AudioSumma is a tool to record the global audio (input and output) on a laptop or desktop, transcribe the audio into a transcript and use an LLM to summarize the transcript
- It works entirely locally using whisper.cpp for audio transcripting and llama.cpp for calling the LLM for summarization.
- whisper.cpp only needs the en_base model to work mostly accurately, and you could get decent summaries from Microsoft Phi3 but I still suggest a LLama3 9b model for the task.
- Whisper and Llama.cpp must be running in server mode, on the default ports. This can be on your laptop or on a separate machine.
- The summarization step will be broken down into different parts (12k of text or roughly every 15 min of audio) and each part is summarized independently. This was done to maximize the reasoning capability of the LLM and to ensure we didn’t exceed the LLM context window.
- Each part of the summary includes 3 different sub sections: An overall summary of that part of the call, a fact only summary and a sentiment summary.
- The overall summary is useful for understanding what was discussed, in what order and why it mattered.
- The fact summary is useful for seeing exact facts stated in the call like project names or timelines
- The sentiment summary is useful to understand the tone of the call and if it was positive/negative. Some calls are difficult, and you need to capture that.
- The entire thing is written in Python and uses PyAudio. It always uses recording device -1, which is the default in windows for the default recording device.
- Most laptops default recording device will record global audio but MacOS and some lower end laptops seem to only record the mic. This limits the usefulness of the tool in those environments
- Overall, I love this tool and use it daily. It’s a nice addition to my growing set of AI tools that help with work.

Tool Series - Natralang

8 July, 2024

Tool Series - Natralang: Natural Language Query for MongoDB

Introduction

Natralang is a Natural Language Query (NQL) project that aims to revolutionize the way we interact with databases, specifically MongoDB. Natralang allows users to query structured data from databases using natural language, eliminating the need to learn SQL or MQL. In this blog post, we will explore the key features of Natralang, its functionality, and how it can enhance your work with MongoDB.

What is Natralang?

Natralang is a tool designed to query Mongo collections and mimic the functionality of MongoDB Compass. It is a showcase of what is possible using Natural Language Query, but it would require some work to make it production-ready. The project can be found on GitHub at https://github.com/patw/Natralang.

How does Natralang Work?

Natralang utilizes a combination of semantic search and vectors to enable users to query data using natural language. This approach allows Natralang to work effectively with databases that have a large number of collections, without overwhelming the LLM (Language Model) context window with excessive schema or example data.

Each data source in Natralang can have a different connection string, allowing it to work across multiple Mongo instances. The reliability of the query generation is approximately 80% for simple to moderately complex queries, but it can decrease significantly for more complex queries. Improved LLM models could potentially enhance this aspect.

Benefits of Using Natralang

User-friendly: Natralang eliminates the need to learn SQL or MQL, making it easier for users to interact with databases using natural language.
Efficient: The semantic search and vector-based approach allows Natralang to work effectively with databases containing a large number of collections.
Flexible: Natralang can connect to multiple MongoDB instances using different connection strings.
Accessible: The project is openly available on GitHub, encouraging collaboration and innovation in the community.

Limitations

While Natralang offers several benefits, it currently serves as a showcase for what is possible rather than a fully production-capable tool. Users should consider this when incorporating Natralang into their projects.

Conclusion

Natralang is a promising tool that demonstrates the potential of Natural Language Query in the realm of MongoDB. By utilizing semantic search and vectors, Natralang enables users to interact with databases using natural language, simplifying the process and reducing the need for specialized knowledge. While currently a work in progress, the project holds great potential for future development and innovation. We encourage the community to explore Natralang and contribute to its growth on GitHub.

Resources

Natralang Project: https://github.com/patw/Natralang

Human Intervention: It tried to put a fake wikipedia article link at the bottom!

Facts Used:

- Natralang project can be found here: https://github.com/patw/Natralang
- Natralang is short for Natural Language Query (NQL)
- NQL is a popular style of chatbot that is used to query structured data from databases.
- In this case, Natralang is designed to query Mongo collections to approximate the functionality we have in the MongoDB Compass product.
- Natural Language Query is using an LLM to parse a query like “What were my sales like last quarter?”, along with the schema for your database and some example data (few shot learning) to produce a query in your data engine that will answer the question. You execute the query, and then feed the results back to the LLM, along with the original question to get an natural language answer to a natural language question. No need to learn SQL or MQL.
- Natralang uses semantic search and vectors to allow the initial query to find the most appropriate data set that could answer the question. This allows for databases with a large number of collections to be used effectively, without polluting the LLM context window with too much collection schema/examples.
- Each data source can have a different connection string, allowing it to work across many mongo instances.
- The reliability of the query generation is close to 80% for simple to moderately complex queries but can drop drastically for complex queries. Better LLM models could help here.
- It’s a showcase for what is possible, but would need some work to be production capable.

Do you need an Integration Library?

7 July, 2024

Do You Need an Integration Library? A Look at RAG and LLM Chatbot Development

In recent months, the Retrieval Augmented Generation (RAG) and Large Language Model (LLM) space has gained significant attention. Many developers believe that they require integration libraries such as LangChain or LlamaIndex to build chatbots in this domain. However, this is not entirely accurate. In this blog post, we will explore the need for integration libraries and alternative approaches for building RAG/LLM chatbots.

Integration Libraries: The Pros and Cons

Integration libraries like LangChain and LlamaIndex can significantly reduce the time to market for chatbot development by coordinating all the necessary components. They handle tasks such as ingesting documents, calling text embedding models, storing embeddings in a vector store, performing semantic search, augmenting the LLM with search results, and calling the LLM itself. While these libraries can be highly productive when used as intended, they also come with a complexity tax.

If you plan to deviate from the standard workflow and perform custom actions, such as using unsupported text ingestion, summarization and chunking methods or injecting custom guardrails into the LLM prompt or response, you may need to override parts of these libraries. Doing so can be challenging due to their complex internal structures.

Alternative Approach: Building RAG Chatbots without Integration Libraries

As a developer, you can build RAG chatbots using the same programming languages, API calling, and basic string manipulation you already know. This alternative approach offers greater control over your application and eliminates the complexity tax associated with integration libraries.

To achieve this, follow these steps:

Learn how to call text embedding models and LLMs through REST endpoints or use higher-level libraries like the Python OpenAI library.
Parse the vectors returned by text embedding models and store them directly in your vector store or database.
Augment LLM inputs with text chunks from your semantic search results using string manipulation.
Handle LLM outputs, which are just text. For streaming purposes, consider using a JavaScript front-end and websockets.
Manage the conversational history using string manipulation.

By following this approach, you can leverage the power of RAG and LLM technologies without relying on integration libraries, giving you full control over your application.

Call to Action

Don’t be intimidated by the advanced technology in the RAG and LLM space. At their core, these systems are comprised of REST APIs, strings, and arrays that you already know how to use. Embrace these building blocks and take control of your chatbot development journey.

In conclusion, while integration libraries can be useful for reducing development time and simplifying the process, they may not be necessary for building RAG/LLM chatbots. By understanding the underlying components and leveraging your existing skills, you can create a custom, powerful chatbot solution without the complexity tax associated with integration libraries.

Human Intervention: Minor. It decided to add “This blog post is in Markdown format” at the bottom, which is super weird.

Facts Used:

- Most users researching the RAG (retrieval augmented generation)/LLM (large languge model) space think they need an integration library like LangChain or LlamaIndex to build a chatbot. This is not entirely true.
- Integration libraries can coordinate all the necessary pieces to build a chatbot like ingesting documents, calling the text embedding model, storing embeddings in a vector store, performing semantic search, augmenting the LLM with the results and calling the LLM itself and getting results back.
- While integration libraries can drastically reduce your time to market, they also come at the price of a complexity tax.
- If you build with them as they are intended they can be very productive
- However if you need to go off-the-rails and do something custom, like doing a text chunking method that is not supported or injecting custom guardrails into the LLM prompt or the response, you will need to override some of what these libraries do
- These libraries tend to be complex internally, and modifying them to fit your purpose can be a daunting task.
- As an alternative, you can build RAG chatbots with the exact same programming languages, API calling and basic string manipulation you already know how to do, as a developer!
- Most text embedding models and LLMs are called through REST endpoints. Learn how to call them directly with REST or use a higher level library like the Python OpenAI library.
- Text embedding models return vectors which can be parsed easily and stored directly in your vector store or database
- LLM inputs are just text. Augmenting them with text chunks from your semantic search results is just string manipulation.
- LLM outputs are also just text. There is slightly more complexity when you want to use the streaming output vs the completion API which produces the full response without streaming. For streaming purposes a Javascript front end and websockets is ideal.
- The conversational history can be managed, again, with string manipulation.
- Call to action: Don’t be scared of the fancy new technology, it’s all just REST APIs, strings and arrays that you already know how to use.
- You can be in 100% control of your application, no integration libraries required.

Semantic Routers

6 July, 2024

Semantic Routers: The Future of Chatbot Coordination and Intent Routing

In recent years, organizations have seen tremendous success with implementing chatbots as a means to improve customer service and streamline internal processes. This has led to the proliferation of chatbots within organizations, with dozens of bots serving various purposes, such as knowledge domain-specific bots and natural language query bots. However, as the number of chatbots grows, users are finding it increasingly difficult to remember where each bot is and what it does. To address this issue, organizations have started developing coordination layers or chatbot routing layers to make the user experience more seamless.

The Emergence of Chatbot Routing Layers

A chatbot routing layer is itself a chatbot that aims to understand the user’s intent and route the request to the appropriate downstream chatbot. This is achieved by either using a list of available bots and their functionalities (“stuffing the LLM context window with all tools” approach) or by employing semantic routing techniques.

While the stuffing method is functional, it has its limitations. As the number of choices for downstream chatbots increases, the reasoning tends to degrade, resulting in a higher error rate. This is where semantic routing comes into play.

Semantic Routing: A Superior Approach

Semantic routing leverages semantic search technology to narrow down the precise set of downstream chatbots that could potentially handle a user’s request. This approach has several advantages over the augmented LLM context window method:

Improved accuracy: By using semantic search, the routing layer can more accurately identify the most relevant chatbot for the user’s request, reducing the error rate and enhancing the overall user experience.
Scalability: Since semantic search can handle an almost infinite number of options, the routing layer can scale effortlessly as more chatbots are added to the organization’s ecosystem.
Increased reliability: By providing only the necessary tools to accomplish a task, semantic routing can increase the reliability of multi-step agentic planning bots, allowing them to perform more efficiently and effectively.

The Future of Chatbot Coordination

As organizations continue to adopt chatbots as a key component of their technology stack, the need for efficient coordination and intent routing will become increasingly important. Semantic routers offer a promising solution to these challenges, providing a scalable, reliable, and accurate approach to managing the growing number of chatbots within organizations.

By embracing semantic routing, organizations can ensure that their chatbot ecosystems remain user-friendly, effective, and efficient, paving the way for a more seamless and integrated future of chatbot-based services.

Author Bio

Pat, is a seasoned AI and chatbot expert, has been at the forefront of the chatbot revolution for over a decade. With a passion for innovation and a deep understanding of the latest advancements in AI, Pat is dedicated to helping organizations harness the power of chatbots to transform their businesses and improve customer experiences. In this blog post, Pat explores the concept of semantic routers and their potential to revolutionize chatbot coordination and intent routing.

Human Intervention: Oh my gawd this one was amazing. This is the first time it generated an author bio at the bottom of the generated blog post and I had to replace [Your Name] in a few places. Now it sounds like I’m shilling some kind of chatbot consulting services - To be clear, I’m search consultant for MongoDB. Otherwise I had to explain prompt stuffing a bit better up top. The content was accurate, but still one of the weirder generations it produced. <3 it so much.

Facts Used:

- Organizations have gone through the process of building their first science experiment of a Chatbot and learned enough and refined it enough to put it into production
- After the first successful chatbot, these tools tend to proliferate in the org. After a while, dozens of chatbots are running. Usually a mix of Kowledge Domain Chatbots, Natural Language Query and Tool use chatbots
- It’s starting to resemble the early days of Client/Server and n-tier applications moving to a coordinated microservice architecture.
- Users eventually complain about having to remember where each chatbot is and what it does
- This results in orgs trying to build coordination layers, or chatbot routing layers so users do not need to remember anything
- This routing layer itself is a chatbot and tends to perform intent-routing, or figuring out the users intent and passing along the request to the proper downstream chatbot
- Intent routing can be done by augmenting the LLM context window with a list of downstream chatbots and what they do. However, this is not ideal, as we observe the more choices it has the worse the reasoning tends to be and the error rate goes up.
- Semantic routing is the idea that you can use semantic search to narrow down the exact set of downstream bots that could service the users request.
- Semantic routers have nearly infinite scaling capability as they use semantic search to narrow down what ends up in the LLMs context window.
- Semantic routing can also be used with an Agentic chatbot approach by allowing multi-step agentic planning bots to be augmented with only the tools needed to accomplish the task, increasing their own reliability and planning capability.

RAGonomics

5 July, 2024

RAGonomics: Optimizing Retrieval Augmented Generation Economics and Performance

Retrieval Augmented Generation (RAG) has emerged as a promising approach to combine the strengths of large language models (LLMs) with semantic retrieval techniques. This hybrid approach offers improved performance in various NLP tasks, such as question answering, summarization, and chatbots. However, the economics of RAG use cases have been challenging, especially with the high cost of using GPT4 for input and output tokens. Fortunately, recent advancements in open-source models and cloud LLM providers offer more cost-effective and efficient solutions. In this blog post, we will explore the key factors that contribute to the economics and performance of RAG systems.

Open-Source Models: A Game-Changer for RAG Economics

Until recently, GPT4 was the go-to choice for RAG use cases, but its high cost made it less accessible for many developers and organizations. Today, open-source models like LLama 3 8b, Mistral 7b, and Microsoft Phi3 2.6b have become good enough for most RAG applications. These models offer a more affordable alternative while maintaining a high level of performance.

Cloud LLM Providers: Unlocking the Potential of Open-Source Models

The emergence of cloud LLM providers, such as Fireworks.ai, has further enhanced the economics of RAG systems. These providers offer excellent pricing on open-source models and allow users to host them in their own datacenters at a fraction of the cost of proprietary solutions. This not only reduces the overall expenses but also enables users to benefit from frequent model updates and improvements.

Focus on Semantic Retrieval for Optimal Performance

To optimize the performance of RAG systems, it is crucial to focus on semantic retrieval (or hybrid semantic/lexical) techniques. By optimizing for recall and precision, users can employ the minimum number of high-quality chunks, which in turn enhances the reasoning capabilities of smaller LLMs. This approach not only improves the efficiency of the system but also reduces the need for the largest models for basic RAG question/answer use cases.

Invest in Chunk Retrieval Mechanisms for Better RAG Performance

Finally, the key to achieving excellent RAG performance is to invest in refining your chunk retrieval mechanisms. By focusing on improving this aspect, users can significantly enhance the overall performance of their RAG systems without relying on the largest models. This will enable them to develop more efficient and cost-effective solutions for various NLP applications.

In conclusion, RAGonomics is all about optimizing the economics and performance of retrieval augmented generation systems. By leveraging open-source models, cloud LLM providers, and focusing on semantic retrieval techniques, users can create more efficient and cost-effective RAG systems that deliver exceptional performance. As the field continues to evolve, we can expect further advancements in this space, enabling even more affordable and powerful RAG solutions for a wide range of applications.

Title: RAGonomics: Optimizing Retrieval Augmented Generation Economics and Performance Author: [Your Name] Date: [Today’s Date]

Human Intervention: None

Facts Used:

- The economics of RAG (retrieval augmented generation) use cases, as early as a year ago was not great. Everyone was using GPT4 which was very expensive for input and output tokens.
- Open source models like LLama 3 8b and Mistral 7b and Microsoft Phi3 2.6b are now good enough for most RAG use cases
- Cloud LLM (large language model) providers like Fireworks.ai have excellent pricing on all open source models. They can also be cheaply hosted in your own datacenter.
- Focusing on the Semantic retrieval (or Hybrid Semantic/Lexical) and optimizing for recall and precision allows you to use the minimum number of high quality chunks, which optimizes the reasoning capabilities of smaller LLMs.
- Spend the budget on fixing your chunk retrieval mechanisms and you won’t need the largest models for basic RAG question/answer use cases.

Long context the end of RAG?

4 July, 2024

Long Context: The End of RAG?

Recent advancements in Large Language Models (LLMs) have led to significant improvements in their context window, extending it to millions of tokens. This has sparked a question: do we still need Retrieval-Augmented Generation (RAG) with long context models? Could we just load our entire corpus into the context window instead? In this post, we’ll explore the role of RAG and its relevance in the context of long context models.

The Relevance of RAG with Long Context Models

Despite the advancements in long context models, RAG still holds its ground for several reasons:

Computational Tricks and Shortcuts: Long context windows are created using techniques like Sliding Window Attention, ROPE scaling, and Ring attention. These methods help in managing the computational complexity, but they are not without limitations.
Natural Context Window: Every model has a natural context window that is the exact size of the dense vector passing to the attention heads and the multi-layer perceptrons. Models perform best when the input is within this natural context window. Going beyond it leads to information loss through summarization techniques.
Performance Trade-offs: Long context queries can be costly, ranging from $1 to $2 per query. This is not cost-effective for most RAG-style Question/Answer use cases at scale. Moreover, the latency associated with long context queries is significantly higher than that of RAG systems. A well-tuned RAG system can achieve retrieval latency as low as 250ms, while long context models can take up to 30 seconds just to ingest the corpus before answering.
Relevance and Reasoning: A well-tuned RAG system provides only the most relevant information in the context window, resulting in the lowest latency retrieval and strongest reasoning capability from the LLM.

The Future of RAG and Long Context Models

In conclusion, while long context models have shown remarkable progress, RAG still plays a crucial role in providing efficient and effective question answering systems. RAG systems, when combined with semantic search or hybrid semantic/lexical approaches, offer the best of both worlds: low latency retrieval and strong reasoning capabilities. As we continue to advance in AI and NLP, it is likely that both long context models and RAG will coexist, each serving its unique purpose in different use cases.

This blog post was written in Markdown format for easy reading and sharing.

Human Intervention: None

Facts Used:

- Recent advances in models have extended context of Large Language Models (LLMs) into the millions of tokens
- Retrieval Agumented Generation (RAG) has been the primary method to provide a small amount of relevant information to the LLM to answer users questions. This was critical when context windows of LLMs were limited to 2048 or 4096 tokens.
- A question has come up recently: Do we even need RAG with long context models? Why not just load in our entire corpus into the context window?
- RAG is still relevant for multiple reasons:
- Long context windows are created using multiple computational tricks and shortcuts like Sliding Window Attention, ROPE scaling and Ring attention.
- There is a natural context window to every model which is the exact size of the dense vector that passes to the attention heads and the multi-layer perceptrons
- Models show the strongest reasoning capability when the input is within the natural context window. Anything beyond that is using summarization techinques and losing information. Long context performs well in needle-in-haystack tests (finding things like city names planted in a large data set) but very poorly with dense information
- The cost to perform a long context query can vary from $1 to $2 per query. This is not cost effective for most RAG style Question/Answer use cases at scale.
- The latency associated with long context queries is beyond what most users would find acceptable. RAG with sematic search or even Hybrid Semantic/Lexical can be as low as 250ms for retrieval. Long context models can take upwards of 30 seconds to just ingest the corpus before answering.
A well tuned RAG system will have only the most relevant information in the context window, providing the lowest latency retrieval and strongest reasoning capability from the LLM.

Customer Service Augmentation or Deferral?

3 July, 2024

Customer Service Augmentation or Deferral?

In recent years, the use of chatbots in customer service has become increasingly popular. One of the more interesting use cases for Retrieval Augmented Generation (RAG) chatbots is to help in customer service, which can be achieved through two main techniques: augmentation and deferral.

Augmentation in Customer Service

Many companies are actively building chatbot projects to augment customer service interactions. By arming customer support agents with a RAG chatbot containing the complete set of vectorized knowledge relevant to helping customers, agents can be more effective in their roles. This technique is called augmentation.

Augmentation chatbots need to be designed carefully. They should be treated like a Content Management System (CMS) and use techniques like Transactional Knowledge Management to allow senior staff to add, update, or correct information in real-time. These chatbots also need to be low latency and provide clear and concise answers without any additional context, as the agent is just looking for an immediate answer. Latency will be a big driver of user acceptance of these tools.

Deferral in Customer Service

On the other hand, deferral involves having a chatbot handle all the lower difficulty questions to prevent an agent from needing to talk to the customer. Deferral chatbots need to be designed for friendly, longer, detailed answers to questions and can be designed to prompt for additional information to ensure a better understanding of the user’s request.

Deferral chatbots should also escalate to a human agent if the question returns no high-scoring results in the vectorized knowledge base or if it detects unhappy responses from the customer. This ensures that customers receive the help they need when the chatbot cannot provide a satisfactory answer.

The Value of Both Techniques

Both augmentation and deferral techniques have value in modernizing customer service for any organization. Augmentation allows agents to be more effective by providing them with the necessary knowledge, while deferral reduces the workload of agents by handling lower difficulty questions. However, these techniques need to be designed differently to ensure their effectiveness.

Conclusion

In conclusion, RAG chatbots offer a versatile approach to improving customer service interactions. By leveraging augmentation and deferral techniques, organizations can modernize their customer service and improve the overall experience for their customers. As technology continues to advance, we can expect to see more innovative uses of chatbots in various industries, further revolutionizing the way we interact with customer service.

Human Intervention: None

Facts Used:

- One of the more interesting use cases for RAG (retrieval augmented generation) chatbots is to help in customer service
- Many companies are acitvely building chatbot projects to either augment or defer customer service interactions
- Agumentation is arming customer support agents with a RAG chatbot with the complete set of vectorized knowledge relevant to helping customers, allowing them to be more effective.
- Deferral is having a chatbot handle all the lower difficulty questions to prevent an agent from needing to talk to the customer.
- Both techniques have value, but need to be designed very differently.
- Augmentation chatbots need to be treated like a CMS and use techniques like Transactional Knowledge Management to allow senior staff to be able to add/update/correct information in real time.
- Agumentation also needs to be low latency and provide clear and consice answers without any additional contex, as the agent is just looking for an immediate answer. Latency will be a big driver of user acceptance of these tools.
- Deferral chatbots need to be designed for friendly, longer, detailed answers to questions and can be designed to prompt for additional information to ensure a better understanding of the users request.
- Deferral chatbots should also escalate to a human agent if the question returns no high scoring results in the vectorized knowledge base or if it detects unhappy responses from the customer.
- Overall both tools are useful in modernizing customer service for any organization.

Generative AI Landscape

3 July, 2024

Generative AI Landscape: Exploring the Current Landscape and Exciting Use Cases

The field of Generative AI has rapidly evolved over the years, with various models and applications emerging to revolutionize the way businesses operate. In this blog post, we will explore the current landscape of Generative AI, including popular models such as Diffusion Models, Large Language Models (LLMs), and ultra-specialized generative models for industries like drug discovery. We will also delve into the most valuable use cases and applications observed so far, covering areas such as customer support, augmented coding, and ideation.

Current Landscape of Generative AI

Diffusion Models

Diffusion models have gained significant popularity in creative spaces, such as music, video, and picture generation. However, their adoption in business settings remains limited. These models are primarily used for image and video generation, but their application in other domains is still under exploration.

Large Language Models (LLMs)

LLMs represent the majority of use cases in the business world, particularly in chatbot-style interactions. They are widely used for tasks such as answering questions, generating text, and processing natural language queries. The versatility of LLMs makes them suitable for a broad range of applications within organizations.

Specialty Models

Ultra-specialized generative models have shown immense value in specific problem domains, such as drug discovery. These models are designed to address unique challenges within industries and have proven to be highly valuable when applied to their intended use cases.

Types of Chatbots

Chatbots are a prevalent use case for LLMs in businesses, and they come in three major varieties:

Knowledge Domain Chatbots: These chatbots are designed to authoritatively answer questions about a domain of knowledge. They utilize vectorized chunks of text for semantic retrieval and often incorporate hybrid search techniques. Knowledge Domain chatbots typically work with unstructured data, such as documents, policies, procedures, blog posts, and news articles.
Natural Language Query (NLQ) Chatbots: NLQ chatbots leverage LLMs along with examples of a database schema to generate queries against structured data stores like databases. They then provide natural responses using the resulting data set.
Tool Use Bots: These chatbots are similar to NLQ bots but are designed to work with REST APIs. The LLM is provided with a set of “tools” or APIs it can call to accomplish tasks or answer questions. It typically determines the parameters for the API, which is then called, and the LLM generates an answer using the returned values from the API.

Agentic Chatbots

Agentic chatbots combine the reasoning capabilities of LLMs with a set of tools, knowledge domain, or natural language query systems to service complex, multi-step requests. While their reliability is still improving, agentic approaches will become increasingly useful as underlying LLM models advance in complexity.

Exciting Use Cases for Generative AI

Generative AI has numerous applications in various domains, with some of the most valuable use cases including:

Intelligence Augmented Workforce: Vectorizing all of a company’s knowledge and utilizing chatbots to work alongside employees can significantly accelerate productivity across the organization. This approach allows employees to focus on higher-level tasks while the chatbot handles routine inquiries.
Personal Brains/External Brains: Transcribing and summarizing calls, along with building a vectorized database of customer interactions, can provide nearly perfect recall for salespeople. This capability enables them to access critical information quickly and efficiently.
Customer Support Deferral and Augmentation: Chatbots can handle a significant portion of customer support inquiries, freeing up support teams to tackle more complex issues. This approach not only improves the customer experience but can also lead to cost savings for the organization.
Augmented Coding Co-Pilots: Chatbots with access to an organization’s entire codebase can assist development teams with code consistency and development velocity. This collaboration can help ensure that code adheres to established standards and accelerates the development process.
Ideation and Sounding Board Chatbots: These chatbots, equipped with critical business metrics, can provide valuable input on key decision-making processes. By taking emotion out of the equation, these chatbots can help organizations make more informed and objective choices.

In conclusion, the landscape of Generative AI is rapidly evolving, with new models and applications emerging continuously. By understanding the current landscape and exploring exciting use cases, organizations can harness the power of Generative AI to drive innovation and improve their operations. As the underlying technology continues to advance, the possibilities for applying Generative AI in businesses will only grow, making it a critical component of modern organizations.

Human Intervention: None

Facts Used:

- Current landscape for GenAI comes down to Diffusion Models, Large Language Models (LLMs) and ultra-specialized generative models for industries like drug discovery
- Diffusion models are popular in the creative spaces (music, video, picture generation) but rare in business settings.
- Specialty models are highly valuable but only for specific problems
- Large language models represent the majority of use cases in business and most of these are Chatbot style use cases.
- Chatbot is a very broad term which includes user to chatbot interactions or even chatbot to chatbot or machine process to chatbot.
- Chatbots come in 3 major varieties: Knowledge Domain (traditional RAG), Natural Language Query (NLQ), and Agentic/Tool Use bots
- Knowledge Domain chatbots are designed to authoratatively answer questions about a domain of knowledge and are made up of vectorized chunks of text for semantic retrieval (and very often lexical with hybrid search). Typically unstructured data like documents, policies and procedures, blog posts and news posts.
- Natural Language Query is using the LLM, along with examples of your database schema, to generate queries against structure data stores like databases and then provide natural responses using the result set.
- Tool use bots are similar to natural language query, except they are meant to be used against REST APIs. The LLM is give a set of “tools” or APIs it can call to accomplish tasks or answer questions. It typically just figures out parameters for the API. The REST endpoint is called and the LLM produces an answer with the resulting returned values from the API.
- Agentic bots use the reasoning capabilities of the LLM along with a set of tools, knowledge domain or natural language query systems to service complex mutli-step requests.
- Different routing methods are emerging to coordinate all these emerging chatbot implementations. Some organizations use Agents with a list of bots to figure out which bot can do what task. Sometimes they will be @ coded, and called like a slack or teams user. Others have developed semantic lookup with vector search to route to specific bots.
- To this day, knowledge domain chatbots that represent a document or set of documents are still the most popular chatbots that organizations are producing, while NLQ is starting to become more popular. The reliability of agentic approaches is still quite low, so it’s not as popular.
- Agentic approaches will become more useful as the underlying LLM models advance in complexity
- Intelligence Augmented Workforce, the idea of vectorizing all of your company knowledge and having a chatbot work alongside every user to accelerate everyone in the org is one of the most valuable use cases observed so far.
- Personal Brains/External Brains are not as popular, but can be game changing for the individual. Transcribing and summarizing calls and building up a vectorized database of all your customer interactions gives you nearly perfect recall. The perfect salesman.
- Customer support deferral and augmentation are some of the most financially exciting use cases. We blogged about this previously
- Augmented coding co-pilots that have access to the entire organization codebase are extremely valuable to development teams for code consistency and development veolocity
- Ideation and sounding board chatbots that have critical business metrics augmented into them can help take the emotion out of critical business decisions. This is an exciting, emerging use case.
- Generative AI can help your organization in many ways, these are just a few that I talk about daily with our customers.

Graph Database or Natural Language Query?

2 July, 2024

Graph Database or Natural Language Query?

In recent months, there has been a growing interest in using graph databases and large language models (LLMs) for querying structured data. While both approaches have their advantages, there are some important considerations that need to be taken into account when deciding which one to use.

Querying Structured Data with LLMs

Querying structured data with LLMs, such as retrieval augmented generation (RAG), can be challenging. This is because structured data does not vectorize well, and when an LLM summarizes it into text, details like dates, times, and numeric values are often poorly represented in the dense vector output of the embedding model.

Graph Databases for Structured Data

To address this issue, some graph database vendors have proposed using graph structures to query structured data with vectorized text concepts as starting points to explore the graph. This approach involves ingesting structured data and converting it into a graph structure using LLMs that generate semantic triplets. However, this method is not without its problems.

The “Exploding Edge” Problem

One major challenge with using graphs to query structured data is the “exploding edge” problem. Any node in a graph can have an arbitrary number of edges, and each edge leads to another node with an arbitrary number of edges. This can result in a large amount of data being pulled in, making it difficult to determine how many levels deep to traverse the graph to get relevant information for a user’s question. This can easily hit the LLM context window limits.

Natural Language Query for Structured Data

A more practical approach for working with structured data is to use natural language query (NLQ). With NLQ, the LLM is given the schema of the database and samples of the data, which allows it to generate queries on behalf of the user. The generated query is then executed, and the LLM uses the resulting data set, along with the original question, to provide an answer based on the results.

Conclusion

In summary, while graph databases and LLMs can be used to query structured data, they each have their limitations. Graph databases can help address some of the challenges of working with large, interconnected data sets, but the exploding edge problem can make them difficult to work with in practice. On the other hand, natural language query provides a more straightforward and effective way to query structured data using LLMs. Ultimately, the choice between graph databases and natural language query will depend on the specific use case and the data being worked with.

Keywords: Graph database, Natural Language Query, LLM, Retrieval Augmented Generation, structured data.

Human Intervention: Minor

Facts Used:

- Querying structured data with LLMs (large language models), using RAG (retrieval augmented generation) is difficult.
- Structured data does not vectorize well and if you summarize it with an LLM into text, details like dates/times and numeric values will be poorly represented in the dense vector output of the embedding model
- Graph database vendors are proposing graph structures to query structured data with vectorized text concepts as starting points to exploring the graph
- Structured data is ingested and converted into a graph structure with LLMs which generate semantic triplets.
- The problem with graphs is the “exploding edge” problem where any node can have an arbitrary number of edges and each edge leads to another node with an arbitrary number of edges.
- It’s impossible to determine up-front how many levels deep to traverse in the graph to get relevant information for the users question, so LLM context window limits can be hit very easily when pulling in too much data.
- Most use cases on structured data need Natural Language Query: Give the LLM the schema of your database and samples of the data and let it generate queries for you.
- Execute the generated query, get a result set back and have the LLM use the result set along with the original question to answer it based on the results.
- Graph databases are not required for working with structured data, Natural Language query is a better choice.