Why use local LLM models?

10 March, 2024

Why Use Local LLM Models?

Local and edge computing has been gaining traction in recent years, especially with the advent of 5G and advancements in AI technology. While cloud-based solutions are still prevalent for many tasks, there are several reasons why utilizing local Large Language Models (LLMs) can provide benefits over relying solely on centralized models. In this article, we’ll explore these advantages and discuss the role of open source in the development of powerful local LLM models.

Advantages of Local LLM Models

Open Source Provides Advantages for Edge/Remote Tasks: Open source AI models offer several advantages when it comes to edge computing, such as privacy, low-end optimization, and feasibility for most tasks that don’t require the most advanced models. This makes open source a more viable option for many applications.
Local Inference Enables Applications Like Home Automation Without Internet: Local inference allows models to operate independently of an internet connection, which is crucial for applications like home automation or devices in remote locations with limited connectivity. Open source models can be trained to perform these tasks effectively and efficiently.
Training Models on Private Data: Open source enables training models on private data that cannot leave an organization’s premises. This is particularly important for companies handling sensitive information, as they can benefit from AI technology without compromising security.
Control Over LLMs: With open source, organizations have more control over their AI models rather than relying solely on large corporations. This allows them to tailor models to their specific needs and ensures data privacy.
Domain-Specific Open Models Outperform Larger Closed Models: Fine-tuned, domain-specific open models can often outperform larger closed models on specific tasks. This demonstrates the power of local models that have been specifically trained for a particular application.
Demonstrating Viability of Small, Local, Task-Specific Models: Open source is critical for showcasing the effectiveness of small, local, task-specific models. These models can be more cost-effective and efficient than their larger counterparts, making them an attractive option for many applications.
Open Source Closing the Gap with Large LLMs: Recent developments in open source models have closed the gap significantly with larger, closed models. As research in this area continues to progress, there’s a chance that local powerful models may become more prevalent and surpass their cloud-based counterparts in certain tasks.
Lower Operating Costs: Local LLM models can provide lower operating costs compared to cloud-based solutions, as they don’t require constant internet connectivity or expensive data storage. This makes them a cost-effective option for many applications.

The Role of Open Source in Local LLMs

Open source plays a crucial role in the development and adoption of local LLM models. By sharing knowledge and resources, researchers and developers can collaborate to create high-quality models that benefit humanity as a whole. This contrasts with the primary focus of closed source solutions on profit.

As the research community continues to adapt to Large Language Models, we can expect to see more high-quality open models emerge. With the rapid advancements in this field, it’s essential to consider the potential benefits of local LLMs and how they can contribute to the growth of AI technology.

In conclusion, there are several reasons why utilizing local LLM models is advantageous, particularly when considering privacy, control, cost-effectiveness, and edge computing scenarios. The rise of open source in this space has enabled rapid advancements and made powerful AI technology more accessible to a wider range of applications. As we continue to see the development and adoption of these models, it’s clear that local LLMs have an important role to play in shaping the future of AI.

Human Intervention: None

Facts Used:

Open source provides advantages for edge/remote tasks, privacy, and low-end optimization
Most tasks don’t require the most advanced models, so open source can be more feasible and useful
Local/offline and fast inference will keep open source relevant, enabling applications like home automation without internet
Open source allows training models on private data that can’t leave an organization’s premises
Open source provides more control over LLMs (large language models) rather than ceding it to large corporations
Fine-tuned, domain-specific open models can exceed larger closed models on specific tasks
Open source is critical for demonstrating the viability of small, local, task-specific models
Recently a fine-tuned open model outperformed GPT-4 on a specific work task at much lower cost
Open source provides reliability and consistency in the face of company policy/leadership changes
The essential goal of open source is sharing knowledge to benefit humanity vs closed source focus on profit
There’s a chance current LLMs plateau and open models catch up, enabling local powerful models
The research community is starting to adapt to LLMs and may produce high-quality open models
Never is a long time - open models have rapidly closed the gap and will continue advancing
Open source matters the way Linux became the most popular OS - it may power many AI applications
Open models provide lower operating costs which is a key advantage

Big Model Small Model RAG

2 March, 2024

Big Model Small Model RAG: Optimizing Cost and Capability in Chatbots

In the world of chatbot technology, there is a constant push to achieve optimal performance while maintaining low costs. This pursuit has led many developers to explore various strategies for creating Retrieval Augmented Generation (RAG) solutions using large language models (LLMs). One such strategy is LLM pre-summarization, which plays a crucial role in the text chunking process of RAG chatbots.

Large Language Models and Pre-Summarization

Large language models like OpenAI’s GPT4 or Mistral.ai’s mistral-large are known for their excellent summarization capabilities. These models have proven to be invaluable in generating high-quality, concise text chunks that form the basis of RAG chatbot responses. The ability to pre-summarize text is essential in ensuring that the chatbot can quickly and accurately provide users with relevant information.

While larger LLMs excel at pre-summarization, they may not always be necessary for generating responses to user questions. In many cases, smaller LLM models like Mistral 7b Instruct, GPT3.5-turbo, and Mixtral 8x7b can deliver excellent question answering performance when given high-quality text chunks. This makes them a more cost-effective option for chatbot development, as they are less expensive to call than larger models.

The Dilemma of Model Selection

When designing RAG chatbots, developers often face a dilemma: opt for the cost efficiency of smaller models or prioritize the summarization capabilities of larger models. This choice can lead to either high costs when using more powerful LLMs or limited summarization capability with less expensive alternatives.

To overcome this challenge and achieve optimal performance at lower costs, there is a growing trend towards mixing big and small models in RAG chatbots. By leveraging the strengths of both types of models, developers can create a more robust solution that effectively balances cost efficiency and summarization capability.

The Big Model Small Model RAG Approach

The Big Model Small Model RAG approach involves using large language models for their superior pre-summarization abilities while relying on smaller models to generate responses based on the well-summarized text chunks. This strategy allows developers to:

Leverage the summarization prowess of big LLMs: Large language models like GPT4 and mistral-large are adept at condensing complex information into concise, relevant snippets. By utilizing their pre-summarization capabilities, developers can ensure that the chatbot provides users with accurate, focused responses.
Reduce costs with smaller LLMs: Smaller models such as Mistral 7b Instruct, GPT3.5-turbo, and Mixtral 8x7b offer a more cost-effective solution for generating chatbot responses. These models are less expensive to call and can still deliver excellent question answering performance when provided with high-quality text chunks.
Optimize performance and cost: By combining the strengths of big and small LLMs, developers can create a RAG chatbot that achieves superior performance at lower costs. This approach allows organizations to invest in the most suitable models for their specific use cases without sacrificing quality or affordability.

Conclusion

The Big Model Small Model RAG approach represents a significant leap forward in chatbot development, as it enables developers to strike an ideal balance between cost efficiency and capability. By leveraging the unique strengths of both large and small LLMs, organizations can create RAG solutions that deliver high-quality, accurate responses while keeping costs under control. As this trend continues to grow, we can expect to see a new wave of innovative chatbots that redefine the boundaries of what is possible in the realm of artificial intelligence and natural language processing.

Human Intervention: None

Facts Used:

Many RAG Chatbot (retrieval augmented generation) solutions use LLM (large language model) pre-summarization as part of the text chunking strategy.
The advantages of LLM Pre Summarization was covered in this article: https://ai.dungeons.ca/posts/chunking-techniques---llm-presummarization/
A larger LLM like OpenAI’s GPT4 or Mistral.ai’s mistral-large have excellent summarization capability. However they might not be needed for generating responses to your users questions from the chatbot
Smaller LLM models like Mistral 7b Instruct, GPT3.5-turbo and Mixtral 8x7b are much cheaper to call and can give excellent question answering performance, given high quality well summarized text chunks.
Many use cases today will pick a single model and either suffer from high costs with larger models or limited summarization capability with smaller models.
Mixing big and small models, and using them for the appropriate task allows you to optimize cost and capability allowing you to have a much better RAG chatbot, but operated at lower costs.

Tool Series - extBrain

12 February, 2024

Tool Series - extBrain: Your External Brain for Building Another You

https://github.com/patw/ExternalBrain

In our ongoing series exploring various tools to build Generative AI applications, we present the eighth tool in the lineup: extBrain. This innovative platform allows you to tap into a powerful knowledge management system designed specifically for question answering. By leveraging advanced techniques such as Fact Synthesis and Retrieval Augmented Generation (RAG), extBrain enables efficient storage, vector storage, and precise semantic search capabilities. In this blog post, we’ll delve into the technical aspects of extBrain and how it can transform your AI application development process.

A Brief Overview of Fact Synthesis

At the core of extBrain lies its ability to break down large text articles into individual facts using Fact Synthesis. This process involves reducing textual data into a structured format (facts), making it easier for machines to understand and process information. By storing these facts in Mongo collections alongside metadata such as the source, context, and timestamp, extBrain can provide authoritative answers based on up-to-date knowledge sources.

Grouping Facts into Semantically Relevant Chunks

One of the key advantages of using extBrain is its ability to group facts into chunks of text. This can be achieved through various methods, including fixed numbers of grouped facts or more advanced techniques like context-based grouping and semantic similarity matching. By organizing facts into relevant groups, extBrain ensures that users receive accurate responses tailored to their specific inquiries.

Leveraging RAG for Enhanced Question Answering

RAG (Retrieval Augmented Generation) systems are common in the AI industry. These systems break down source documents such as PDF, Word, or HTML files into smaller text blobs and perform text embedding to generate dense vectors. ExtBrain takes this concept one step further by focusing exclusively on grouping facts together, resulting in a more efficient use of resources and improved vector search accuracy.

Using Semantically Similar Facts for Larger Text Chunks

By grouping semantically similar facts into larger text chunks, extBrain enables users to ask broader questions while still maintaining high recall rates. This approach ensures that relevant information is readily available without compromising the overall efficiency of the system.

A Second You: The Power of Your External Brain

ExtBrain’s primary goal is to act as an extension of your own knowledge and cognitive abilities. By ingesting all your relevant data, extBrain can provide authoritative answers on demand, allowing you to work smarter instead of harder. This innovative approach to AI application development has the potential to revolutionize how we approach problem-solving and information retrieval.

Multiple Front End Options for Easy Access

To make it even more convenient for users, extBrain offers multiple front end options for asking questions, including a website, a Discord bot, and a Slack bot. These intuitive interfaces allow you to access your external brain from virtually anywhere, making it simple to retrieve accurate information whenever you need it.

Managing Facts and Summarizing Text with the extBrain Back End

The extBrain back end serves as an administrative interface for managing facts and summarizing large chunks of text into digestible pieces of information. This powerful tool enables users to input, organize, and refine their knowledge base, ensuring that they always have access to up-to-date, accurate data when needed.

Conclusion: Why Everyone Should Have an External Brain

In conclusion, extBrain offers a comprehensive solution for building scalable AI applications focused on question answering. With its innovative approach to Fact Synthesis and RAG systems, combined with the ability to group semantically similar facts into larger text chunks, extBrain delivers efficient storage solutions and accurate vector search capabilities. By leveraging these features, you can work smarter, not harder, and revolutionize your AI application development process. So why wait? It’s time to unlock the power of your external brain!

Note: The views expressed in this blog post are based on publicly available information and represent our understanding of extBrain as a tool for building Generative AI applications.

Human Intervention: None

Facts Used:

This series covers different tools used for building Generative AI (genai) applications
Eighth Tool in the series is External Brain (https://github.com/patw/ExternalBrain)
External Brain uses Fact Synthesis (reducing text into facts - see previous posts about the subject) which allows the user to paste in large text articles and reduce them down to individual facts.
Facts are stored in Mongo collections along with some metadata about who stated the facts, what was the context and when it was stated.
The facts can then be grouped together into chunks of text, by either a fixed number of grouped facts or with more clever methods like only grouping together facts from the same context or even facts based on their semantic similarity.
Users can then ask questions to the LLM (large language model) and we will perform semantic search to find the fact chunks that are most relevant to the question, which will be injected into the prompt to the LLM. This is an example of RAG (retrieval augmented generation).
Most RAG systems will chunk up source documents like PDF, Word, HTML files into blobs of text, and run text embedding models against that to produce dense vectors.
extBrain only chunks on groups of facts, which reduces the overall size of the data set and produces more accurate results in vector search and stronger results from the LLM.
Semantically similar facts can produce larger text chunks without causing recall issues later on.
External Brain can act as a second you. It’s designed to ingest all your knowledge and be able to answer authoritatively on it. This technique could allow knowledge workers to scale.
extBrain has multiple front ends for asking questions: A website, a Discord bot and a Slack bot.
The extBrain back end, or admin UI allows you to enter and manage facts and summarize large chunks of text into facts.
This is a pretty ideal system for question answering. It’s efficient on storage and vectors (which are expensive to store).
Call to action: Everyone should have an external brain! Work smarter not harder.

Tool Series - FactWeave

9 February, 2024

Tool Series - FactWeave: Writing More by Writing Less

In the ongoing series of covering different tools for building Generative AI applications, we introduce FactWeave (https://github.com/patw/FactWeave), a unique tool that can generate blog posts with minimal input. Designed with Fact Expansion in mind, it’s the perfect solution if you want to share your thoughts and ideas without spending hours crafting each post.

How FactWeave Works

FactWeave is an incredible example of Fact Expansion. This tool allows you to input individual facts, which are then fed into a Large Language Model (LLM) along with some prompt engineering to generate specific types of blog posts. You can choose from technical, personal, or humor-based content, and the output will be Markdown files ready for consumption by static site generators like HUGO.

A Website Built with FactWeave

The website you’re currently reading is a testament to the power of FactWeave. This tool has helped me share my ideas in a well-formatted and professional manner, which has been incredibly valuable for my customers at Mongodb who are interested in building Retrieval Augmented Generation (RAG) use cases.

Managing Your Blog Posts with FactWeave

FactWeave also acts as a Content Management System (CMS). It enables you to manage your blog posts, change their tags, titles, or content. If the “post” field is blank, FactWeave will generate the blog post using the LLM. Afterward, you can edit the post as needed.

Tailoring the Content

The default tone for FactWeave’s generated posts is technical, detailed, and professional. However, you have the option to change this tone to personal if you prefer a more casual writing style. The term “detailed” might sometimes produce overly wordy content, so you can also switch it to “succinct” for shorter blog posts.

Automating Tags

Currently, FactWeave uses default tags such as “RAG”, “Grounding”, and “LLM”. However, I plan on updating the system to generate these tags automatically based on the content of each post. This way, the AI can help you categorize your blog posts more effectively!

FactWeave Built With Python

FactWeave is built using Flask, FlaskBootstrap, and FlaskForms, which are popular tools in this series. The tool also incorporates vector search functionality to help you find relevant articles for editing later.

Conclusion

In conclusion, FactWeave is a powerful tool that can help you create engaging blog posts with minimal effort. By inputting individual facts, the AI generates well-formatted, personalized content that can be edited and managed as needed. With its versatility and ease of use, FactWeave is an excellent choice for anyone looking to streamline their blogging process.

So why not give it a try? Write more by writing less with FactWeave!

Human Intervention: Minor. It called Fact Expansion “Facet Expansion”. Also renamed one of the headings.

Facts Used:

This series covers different tools used for building Generative AI (genai) applications
Seventh Tool in the series is FactWeave (https://www.github.com/patw/FactWeave)
FactWeave is a tool for generating blog posts by just providing individual facts.
This tool is an example of Fact Expansion (see previous blog post about this subject), it’s the opposite of Fact Synthesis
You can provide an arbitrary number of facts, which get fed to the LLM (large language model), with some prompt engineering to produce a specific type of blog post (technical, personal, humor) and output Markdown files.
The Markdown files can be consumed by static site generators like HUGO, to deploy a complete website with minimal inputs.
The website you’re reading right now is produced by FactWeave
FactWeave helped me get my ideas into a nicely formatted blog site which has been valuable for my customers at MongoDB who are interested in building RAG (retrieval augmented generation) use cases
FactWeave itself is a CMS. It provides the ability to manage the blog posts, change their tags, title or content. It only generates the blog post using the LLM if the “post” field is blank. After the post is generated it can be edited.
Sometimes I do need to edit the blog posts and I indicate that with the “Human Intervention” part at the bottom of the post. If it says “None”, it means I haven’t edited the post. If I do have to edit I explain what and why. Usually it’s due to the LLM hallucinating URLs or referencing open source projects with the wrong name.
The default tone for posts is “technical, detailed and professional”. This directs the LLM to produce technical sounding blogs. I sometimes change “professional to “personal” when I want that tone instead. The term “detailed” can also be a problem, sometimes. It’ll get very wordy, so I’ll change it to “succinct” instead.
The system also has default tags “RAG, Grounding, LLM” but I’ll modify the system later to have to produce tags automatically from the outputted content. When you have an AI problem, more AI fixes it!
The tool is built using Flask, FlaskBootstrap and FlaskForms as are many of the tools in this series.
It also incorporates vector search, to find relevant articles. This is so I can edit them later.
Call to action: This same technique could be used for building technical documentation, or even your own blogging solution. Clever tagline: Write more by writing less!

Tool Series - discord_llama

8 February, 2024

Tool Series: Discord_Llama - The Ultimate AI Companion for Your Server

Discord has become an integral part of our online social lives, allowing friends and communities to connect through chat rooms, voice channels, and even play games together. With the ever-growing demand for unique experiences in these servers, it’s no wonder that developers are continuously creating innovative tools to enhance interactions between users. Today, we’re taking a deep dive into discord_llama, a fantastic tool designed to bring large language models (LLMs) to life within your Discord server.

Discord_Llama is an open-source project by Pat W. that allows you to create LLM-driven chatbots tailored to your server’s needs. This versatile tool can introduce personality, humor, and even specific ideologies into your bot, making for a more engaging and entertaining user experience.

How Does it Work?

Discord_Llama leverages the same llama.cpp running in server mode as its LLM backend, sharing this powerful technology with other tools such as BottyBot, SumBot, RAGTAG, and ExtBrain. This backend is GPU-accelerated, ensuring lightning-fast responses to user queries on your Discord server.

Currently, discord_llama supports up to 7 different personality-based bots, ranging from the more conventional WizaardBot for answering questions to more unique concepts like ideology-focused bots or even a bot that clones a friend and their interests (HermanBot). The tool offers a wide variety of options to cater to your server’s preferences.

Chatting with Bots

One standout feature of discord_llama is its access to Discord channel history, allowing bots to engage in back-and-forth conversations with users for up to five lines by default. This immersive interaction greatly enhances the user experience and fosters a more natural conversation flow within your server.

The Future of `discord_llama`

Although development on discord_llaama has slowed in recent months, it continues to provide valuable chatbot services to four different Discord servers. A potential enhancement for this tool could involve further augmentation, such as web searches for up-to-date information or more advanced interaction capabilities. Regardless of its future developments, discord_llaama remains an invaluable asset for server owners and users alike.

Conclusion

In a world where Discord has become the go-to platform for connecting with friends and communities, having engaging chatbots is more important than ever. With Discord_Llama, you can now add personality and unique experiences to your servers, enriching conversations and entertaining users in ways previously unimaginable. If you haven’t already given this tool a try, I highly recommend checking out the GitHub repository here and exploring the potential of LLMs in your Discord community. Who knows? You might just discover your server’s new best friend!

Human Intervention: None

Facts Used:

This series covers different tools used for building Generative AI (genai) applications
- Sixth Tool in the series is discord_llama (https://www.github.com/patw/discord_llama)
- discord_llama is a tool for building LLM (large language model) driven chatbots for Discord servers
- The tool allows you to build bots with personality and add them to an existing discord server. They can respond to questions, and even react to specific keywords, at random, to act like regular users in the discord server.
- discord_llama uses the same llama.cpp running in server mode as the LLM back end which is shared in my homelab with tools like BottyBot, SumBot, RAGTAG and ExtBrain. This llama.cpp instance is GPU accelerated allowing very fast responses to questions from users in Discord.
- I currently run about 7 different personality based bots that range from the very boring WizardBot, which is a typical chatbot for answering questions to more extreme personality based bots like ideological focused bots and even a bot to clone a friend and his interests (HermanBot)!
- The bots have access to the discord channel history (up to 5 lines by default) which allows them to have back and forth exchanges with Discord users, which provides a great experience for the users.
- This project hasn’t seen much work in the last few months, but continues to provide useful chatbot services to 4 different Discord servers.
- A future enhancement to this tool could be further augmentation like web searches, for more up to date information.
- If you run a discord server, or particiopate in one, this tool can add a ton of value to conversations, or just troll the users

Tool Series - BottyBot

8 February, 2024

Tool Series - BottyBot: A Frontend Chat UI for Local LLM Models

https://www.github.com/patw/BottyBot

In this installment of our Generative AI (GenAI) tool series, we will be exploring a unique solution to interfacing with locally hosted Large Language Models (LLMs): BottyBot. Developed by an individual who was not satisfied with existing options on the market, BottyBot is specifically designed to seamlessly connect with llama.cpp running in server mode. This powerful frontend chat UI has become a crucial tool in the developer’s daily workflow, serving as the main interface for interacting with multiple tools and applications that leverage the capabilities of the LLM.

The creator of BottyBot operates two GPU-accelerated instances of llama.cpp, which serve as the backbone for numerous applications such as SumBot, ExtBrain, RAGTAG (soon to be updated), and several Python scripts, including a website generator. These tools benefit from the efficiency and versatility provided by BottyBot’s intuitive chat interface, which is typically powered by either the OpenHermes Mistral or Dolphin Mistral families of LLM models.

One of the key features that sets BottyBot apart is its support for multiple “bot” identities. These distinct personalities can be engaging to interact with and are entirely generated within the application itself. The development process of BottyBot exemplifies a unique approach known as “bootstrapping,” where much of the initial design was created using OpenAI’s ChatGPT-3, while subsequent features were added by directly communicating with the LLM model integrated into BottyBot. This innovative method has resulted in a continually evolving and feature-rich application that caters to a wide range of use cases.

In addition to its core functionality, BottyBot also includes export capabilities for formatting and organizing conversations in an easily shareable format. This feature is particularly useful for collaborating with others or showcasing the results of interactions with LLM models.

While BottyBot does not currently support Retrieval Augmented Generation (RAG) techniques like RAGTAG or ExtBrain, its developers have expressed interest in potentially incorporating manual augmentation or vector search capabilities in future updates. This would allow users to enhance the prompt generation process and further optimize their interactions with LLM models.

Overall, BottyBot has proven to be an incredibly valuable tool for individuals who wish to harness the power of local, open-source Large Language Models while maintaining complete privacy and control over their data. As a result, it serves as a perfect example of how cutting-edge AI technology can be effectively integrated into everyday workflows and applications. Stay tuned for future updates and enhancements to this versatile and essential chat interface!

Human Intervention: Minor. Added the URL for the github repo up top. Also, it seemed to depersonalize me entirely in this article talking about an unknown developer. I’m cool with it, but I probably needed to add some context in the points to indicate who worked on it.

Facts Used:

This series covers different tools used for building Generative AI (genai) applications
- Fifth Tool in the series is BottyBoy (https://www.github.com/patw/BottyBot)
- BottyBot is a front end chat UI that connects to llama.cpp running in server mode which hosts a LLM (large language model)
- I wasn’t happy with other solutions on the market and none of them could consume llama.cpp in server mode directly. I operate 2 GPU accelerated instances of llama.cpp which is used by multiple tools like SumBot, ExtBrain, RAGTAG (soon, it needs updating) and a few python scripts like my website generator
- BottyBot has been a huge success for me, as I use it daily as my main chat interface to LLM models. The back end llama.cpp server is usually running the OpenHermes Mistral or Dolphin Mistral families of LLM models.
- BottyBot supports different “bot” identities that represent differnet personalities that can be interesting to interact with. The entire set of built in identites were generated by BottyBot!
- BottyBot was a perfect example of bootstrapping: I designed a lot of the application with OpenAI’s ChatGPT 3, but as soon as the UI was running well enough, all features from that point on were added by talking to the LLM and getting useful python code for features I wanted. It’s now being used for all future products.
- I added export functionality to produce nicely formatted exports for conversations. These are useful for sharing with others.
- BottyBot is not an example of RAG (retrieval augmented generation) like RAGTAG or ExtBrain. BottyBoty uses the LLM directly without any augmentation.
- A future enhancement to this tool could include manual augmentation or vector search for augmenting the LLM prompt.
- I love this tool and so far, it’s provided the most value to me personally and is a perfect example of using local, opensource large language models with full privacy.

Tool Series - InstructorVec

7 February, 2024

Tool Series - InstructorVec: A Single Endpoint for Text Embedding

https://www.github.com/patw/InstructorVec

In this series, we dive into different tools and techniques used to build Generative AI applications. As we progress through the series, we’ll explore a variety of methods that aid in creating efficient and powerful models. Today, we delve into InstructorVec, the latest evolution in the VectorService tool family.

InstructorVec is not just another text embedding model; it’s an innovative approach to generating dense vectors for production use cases. It’s designed as a single endpoint that calls the instructor-large model from HuggingFace, marking a departure from the previous VectorService tool that hosted eight different embedding endpoints. This change allows us to focus entirely on a singular solution for text embedding and outputting 768 dimension dense vectors, making it ideal for real-world applications.

To achieve this, InstructorVec loads the full instructor-large model but quantizes it from its original FP32 precision down to FP8. This slight reduction in precision offers significant performance improvements without compromising the quality of the outputs. In fact, executing InstructorVec takes no more than 100 milliseconds on a CPU, compared to the 1000 millisecond execution time at full FP32 precision. As a result, many RAG (Retrieval Augmented Generation) tools that consume this service have become much more responsive and efficient.

As we continue to build new production applications, InstructorVec will serve as the baseline vectorizer for all our tools moving forward. This is because its performance and precision make it a reliable solution for various use cases. Moreover, we currently maintain multiple copies of this service to cater to different production requirements.

In the future, we may expand InstructorVec’s capabilities by adding a similarity checking endpoint. Similarity checking was one of the most useful features in the legacy VectorService tool, and its integration into InstructorVec will further enhance its versatility and utility.

To sum it up, InstructorVec is an innovative single-endpoint solution for text embedding that delivers high performance and precision while being ideal for production applications. As we continue to refine this tool, we’re excited about the possibilities it holds for enhancing our Generative AI applications and unlocking new levels of efficiency in RAG tools. Stay tuned as we explore more exciting developments in the world of text embedding and Generative AI!

Human Intervention: Minor. I added the URL for the github repo. Sometimes it adds it as a link in the text, other times it forgets it entirely.

Facts Used:

- This series covers different tools used for building Generative AI (genai) applications
- Fourth Tool in the series is InstructVec (https://www.github.com/patw/InstructorVec)
- InstructorVec is the next generation of VectorService tool, however instead of hosting 8 different embedding endpoints, it hosts a single endpoint that calls the instructor-large model from HuggingFace.
- The other text embedding models in VectorService are nice to show off differences in vector outputs and can demonstrate how different models measure similarity between two strings of text.
- InstructorVec is focused entirely on being a single endpoint for text embedding and outputting 768 dimension dense vectors for production use cases.
- This tool loads the full instructor-large model, but quantizes it from full FP32 precision down to FP8. For a small loss in precision in the model weights, it executes in 100 ms or less on CPU compared to 1000 ms at full precision. This has made some RAG (retrieval augmented generation) tools that consume this service, much more responsive.
- This will be the baseline vectorizer I’ll be using in all tools moving forward, and currently operate multiple copies of this for servicing different production applications
- Future state for this tool might include a similarity checking endpoint, as this has been very useful in the legacy VectorService tool.

Tool Series - RAGTAG

7 February, 2024

Tool Series - RAGTAG: An In-depth Look at Retrieval Augmented Generation

https://www.github.com/patw/RAGTAG

In our ongoing series covering various tools for building Generative AI applications, we dive into the third tool: RAGTAG. This end-to-end example of RAG (Retrieval Augmented Generation) allows you to experiment with question/answer pairs, test lexicographic and semantic search capabilities, and generate an LLM (Large Language Model) response using semantically augmented data.

As a simple CRUD application, RAGTAG enables users to create and modify question/answer pairs that are combined into a single chunk of text. This text is then run through the “instruct-large” text embedding model for retrieval later on. All semantic search in RAGTAG is performed using vector search with the same “instruct-large” model.

The key to RAGTAG’s effectiveness lies in its tunable features. The chunk testing tool allows users to adjust the score cut-off for cosine similarity in vector search as well as control the K value, which determines the overrequest value on the vector search query. Meanwhile, the LLM tester provides an interface to set the above semantic search parameters along with system messages, prompts, and user questions.

RAGTAG is a product of Mongodb’s specialist search group and has been in production since its inception. It serves as a valuable tool for question answering, demonstrating the power of RAG and its practical applications. However, there is room for improvement. The current LLM (implemented with the Llama.cpp Python module) runs on CPU instead of GPU, which can cause response generation to be slow.

Looking towards the future, we envision a more efficient version of RAGTAG that incorporates InstructorVec for text embedding and runs Llama.cpp in server mode. By leveraging these advancements, RAGTAG will be better equipped to share infrastructure with other tools and enjoy the benefits of GPU-accelerated token generation for faster response times.

In conclusion, RAGTAG is an essential tool for those looking to experiment with Retrieval Augmented Generation. With its robust capabilities and potential for improvement, it continues to be a valuable resource within our Generative AI toolkit. Stay tuned as we explore further advancements in this exciting field!

Human Intervention: Added URL for github project and fixed InstructionVec to InstructorVec

Facts Used:

This series covers different tools used for building Generative AI (genai) applications
- Third Tool in the series is RAGTAG (https://www.github.com/patw/RAGTAG)
- RAGTAG is an end-to-end example of RAG (retrieval augmented generation) allowing you to manually create question/answer pairs, testing lexical and semantic search and allow you to generate an LLM (large language model) response with semantic search augmented data and manipulate the system message and the prompt to see how it changes the ouput.
- This is a pretty simple CRUD application that lets you create and modify question/answer pairs which get appended together as a single chunk of text and run through the “instructor-large” text embedding model for retrieval later.
- All semantic search is performed using vector search using the same instructor-large model
- The chunk testing tool allows you to tune the score cut-off for the cosine similarity in the vector search as well as the K value, which controls the overrequest value on the vector search query
- The LLM tester allows you to set the above semantic search values as well as the system message and the LLM prompt format along with the users question.
- RAGTAG was a great experiment and is still used in production for question answering for our own specialist search group at MongoDB
- This tool also integrated the text embedding model and the LLM into a single installable package, which was very convenient. However, the LLM (running on llama.cpp python module) will run on CPU instead of GPU making the output responses quite slow
- The future for this tool is to migrate it to the InstructorVec for text embedding and on llama.cpp running in server mode, so it can share infra with other tools and run on GPU for much faster token generation.

Tool Series - SumBot

6 February, 2024

Tool Series - SumBot: A Powerful AI Summarization Tool for Structured Data

https://www.github.com/patw/sumbot

In our ongoing series covering various tools used for building Generative AI (genai) applications, we are excited to introduce you to SumBot, a Python FastAPI service designed specifically for summarizing structured data into semantically rich English text. As the second tool in this series, SumBot has proven its worth as an essential addition to any genai developer’s toolbox, particularly when working with JSON or XML data.

What is SumBot?

SumBot is a powerful AI summarization tool that takes structured data (usually JSON) and converts it into coherent paragraphs of English text. With just a single endpoint (summarize) and two parameters - entity and data, this Python FastAPI service can quickly process and summarize your data, making it ideal for running through text embedding models like BERT or Instruct-large.

Why Use SumBot?

Embedding models often struggle to perform well on JSON, XML, point form data, or tabular data. By using an LLM (Large Language Model) for pre-summarization before text embedding, you can significantly improve recall and precision for semantic search. SumBot was the first tool I hosted on a GPU with LLaMA.cpp running in server mode, utilizing the OpenHermes-2.5-Mistral-7b model to provide accurate summarizations.

How Does SumBot Work?

The actual LLM prompt uses the entity parameter to guide the LLM into summarizing the JSON or XML data. This guidance can be necessary if the keys in your JSON document aren’t clear enough for the LLM to figure out what it’s summarizing. Thankfully, SumBot doesn’t require validation of whether the input data is actually JSON or XML; it can summarize almost anything as long as you provide it an entity and data.

Deploying SumBot

SumBot can be deployed against any LLaMA.cpp server running locally or could be easily updated to point to a hosted service like Mistral.ai or OpenAI. This flexibility makes SumBot an excellent choice for developers who need to quickly process and summarize large amounts of structured data while taking advantage of the latest advancements in AI technology.

In conclusion, SumBot is a valuable addition to any genai developer’s toolbox. Its ability to transform JSON or XML data into coherent English text using LLM pre-summarization makes it an essential tool for improving recall and precision in semantic search. As the second installment in our series on tools for building Generative AI applications, SumBot demonstrates the power of leveraging cutting-edge technology to optimize workflows and enhance productivity.

Human Intervention: Minor. I had to add the Github URL to sumbot to the article.

Facts Used:

This series covers different tools used for building Generative AI (genai) applications
- Second tool in the series is SumBot (https://www.github.com/patw/sumbot)
- SumBot is used for summarizing structured data (usually JSON) into paragraphs of sementically rich english text.
- The tool is a Python FastAPI service with a single endpoint (summarize) and two parameters: entity and data
- The output of SumBot is ideal for running through text embedding models like BERT or Instructor-large
- Embedding models tend to perform poorly on JSON, XML, point form data and tabular data. Using an LLM (large language model) for pre-summarization before text embedding can provide a drastic increase in recall and precision for semantic search
- SumBot was the first tool I hosted on GPU with llama.cpp running in server mode with the OpenHermes-2.5-Mistral-7b model.
- The actual LLM prompt uses the entity parameter to guide the LLM into summarizing the JSON or XML data. This can be necessary if the keys in the JSON document aren’t clear enough for the LLM to figure out what it’s summarizing.
- SumBot doesn’t validate if it’s actually JSON or XML data! It can be used to summarize almost anything, as long as you provide it an entity and data.
- This tool can be deployed against any llama.cpp server running locally or could be easily updated to point to a hosted service like Mistral.ai or OpenAI

Tool Series - VectorService

6 February, 2024

Tool Series - VectorService: Exploring the Journey of Text Embedding Models

In this series, we will dive deep into different tools used for building Generative AI (GenAI) applications. The first tool in our exploration is VectorService, a FastAPI service that generates dense vectors using various text embedding models.

The initial motivation behind VectorService was to test out multiple different text embedding models for generating semantic search capabilities for RAG (Retrieval Augmented Generation) tools with LLMs (Large Language Models). The journey of exploring and implementing various embedding models in this tool has taught us valuable lessons about the evolution of language processing techniques.

From SpaCY to SentenceTransformers: A Journey of Improvement

VectorService’s initial models were sourced from the Python SpaCY Library. We implemented small (96d), medium (384d), and large (384d) SpaCY models, which proved to be quite easy to use. However, these models performed poorly beyond a few words or a single sentence compared to more modern alternatives like BERT. They remain in use for legacy reasons in some applications.

To improve the quality of text embeddings, we then moved on to using the SentenceTransformers Library, which was incredibly easy to work with. The library provided two models: all-MiniLM-L6-v2 (384d) and all-mpnet-base-v2 (768d). These models performed significantly better than SpaCY, demonstrating the advancements in language processing techniques over time. Many RAG examples online still show miniLM as the text embedding model of choice; however, it’s worth noting that larger models like mpnet-all-MiniLM-L6-cos-v1 outperform it significantly.

BERT: A Significant Leap in Quality

Next, we explored using BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language processing model developed by Google. BERT uses 768 dimensions and delivered a substantial improvement in the quality of text embeddings compared to its predecessors. The recall and precision metrics showed significant jumps, indicating that BERT was a major step forward for text embedding models.

SOTA: Instructor-Large Takes Center Stage

The final model we integrated into VectorService was Instructor-Large, a state-of-the-art (SOTA) language processing model at the time of its release. This model achieved exceptional results on the HuggingFace MTEB leaderboard and required 4 gigabytes of memory to run, making it quite slow on CPUs.

However, the quality level of Instructor-Large was considered the bare minimum for production use cases, and it could be quantized to reduce its memory footprint and latency. This model required LLM-style prompting to produce optimal results and directly competed with OpenAI’s much larger text-ada-002 model, which is typically used as a default in RAG applications.

A Comparison of Models: The Power of Benchmarking

To provide users with valuable insights into the performance of different models, VectorService included endpoints for comparing similarity results across all text embedding models implemented. This feature allowed users to benchmark recall and precision metrics between various models, which in turn helped them optimize their RAG use cases.

The Evolution of Text Embedding Models: From Legacy to State-of-the-Art

As we have seen, the journey from SpaCY to Instruction-Large showcases the evolution of text embedding models and how they have improved over time. VectorService served as a valuable experimentation platform for exploring different models and their capabilities. However, it is now considered legacy and not recommended for use. Instead, we recommend using InstrucTorVec, an open-source alternative that offers self-hosted vector embedding solutions with exceptional performance and ease of use.

In conclusion, the development of VectorService has been a fascinating journey through the world of text embedding models. From SpaCY’s early attempts to BERT’s groundbreaking achievements and Instruction-Large’s SOTA status, we have witnessed the incredible progress made in language processing technologies. As we move forward into an era of increasingly sophisticated AI applications, it is essential for developers and researchers alike to continue exploring new frontiers in this rapidly evolving field.

Human Intervention: Moderate: the bot continues to call the Instructor model Instruction and totally made up a research paper for it. I also had to correct the two sentence transformer model names, but that was entirely my fault for providing the wrong names in the facts. I wrote it on a plane while I was half asleep.

Facts Used:

- This series covers different tools used for building Generative AI (genai) applications
- First tool in the series is VectorService (https://www.github.com/patw/VectorService)
- Text embedding models that output dense vectors and critical for building semantic search for RAG (retrieval augmented generation) tools with LLMs (large language models)
- Originally wanted to test out multiple different text embedding models, so built a FastAPI service in Python that would generate dense vectors using different models.
- The first 3 models were from the Python SpaCY Library. I implemented small (96d), medium (384d) and large (384d). SpaCY was very easy to use, but performed pretty poorly beyond a few words or a single sentence, compared to more modern models like BERT. They’re still in the process for legacy reasons.
- The next two models were minilm-l6 (384d) and mpnet-alllm (768d). These models performed much better than SpaCY and used the HuggingFace SentenceTransformers library, which was super easy to use. Many RAG examples online still show minilm as the text embedding model, and while it performs poorly compared to larger models it’s still quite good.
- Next, I tried BERT (768d). This model used 768 dimensions and seemed to be another large step up in quality for embeddings and was the first time I saw large jumps in recall and precision. BERT has much more dimensions than minilm, but performed better in all my tests
- Finally I added in Instructor-large (768d). This model was considered SOTA (state of the art) for the time it released and quickly became #1 on the Huggingface MTEB leaderboard. The model itself needed 4 gigs of memory to run and is quite slow on CPU. However, the quality level should be considered the bare minimum for production use cases, and can be quantized to run with less memory and less latency. Instructor requires LLM style prompting to produce good results and competes directly with OpenAI’s much larger text-ada-002 model, which is the default for most RAG use cases.
- This tool also included endpoints for comparing similarity results across all the models, which is useful to show off in a demo. It gives customers the idea that they should be benchmarking recall and precision between multiple models to optimize the RAG use case.
- At this point, VectorService is legacy and not recommended. InstructorVec (https://www.github.com/InstructorVec) is considered a replacement for this tool. It was a great exercise for experimenting with different embedding models but InstructorVec is all you need for self hosted, open source vector embeddings.