Tool Series - AudioSumma

Tool Series - AudioSumma

Introduction

AudioSumma is a powerful tool that records the global audio (input and output) on a laptop or desktop, transcribes the audio into a transcript, and uses a Large Language Model (LLM) to summarize the transcript. This tool is particularly useful for professionals who need to analyze, summarize, and extract key information from long audio conversations. In this blog post, we’ll dive into the details of how AudioSumma works, its features, and its limitations.

How AudioSumma Works

AudioSumma works entirely locally using whisper.cpp for audio transcription and llama.cpp for calling the LLM for summarization. The process is broken down into different parts to maximize the reasoning capability of the LLM and ensure that it doesn’t exceed the LLM context window.

To use AudioSumma, you’ll need to have Whisper and Llama.cpp running in server mode on their default ports, either on your laptop or a separate machine. Additionally, you’ll need the en_base model for whisper.cpp and a LLama3 9b model for the summarization task.

Features

AudioSumma offers three distinct summaries for each part of the call:

  1. Overall Summary: This summary provides an understanding of what was discussed, in what order, and why it mattered.
  2. Fact Summary: This summary includes exact facts stated in the call, such as project names or timelines.
  3. Sentiment Summary: This summary helps you understand the tone of the call and whether it was positive or negative.

These summaries are designed to help users quickly grasp the main points, facts, and emotional undertones of the conversation, making it easier to review and analyze the content.

Limitations

While AudioSumma is a fantastic tool, it has a few limitations to consider:

  1. Most laptops default recording device will record global audio, but some MacOS devices and lower-end laptops only record from the microphone. This limitation may impact the tool’s usefulness in those environments.
  2. The summarization process is divided into parts (roughly every 15 minutes of audio). This approach can result in some information being overlooked or not included in the summaries.

Conclusion

Overall, we love AudioSumma and use it daily. It’s a valuable addition to our growing set of AI tools that help with work. If you’re looking for a way to quickly analyze and summarize audio conversations, AudioSumma is definitely worth exploring.

You can find the project on GitHub at: https://github.com/patw/AudioSumma.

Note: AudioSumma is a tool that I have created and maintain. If you have any questions or suggestions, feel free to reach out to me on GitHub or via email.

  • Human Intervention: None

Facts Used:

    • AudioSumma is a tool to record the global audio (input and output) on a laptop or desktop, transcribe the audio into a transcript and use an LLM to summarize the transcript
    • It works entirely locally using whisper.cpp for audio transcripting and llama.cpp for calling the LLM for summarization.
    • whisper.cpp only needs the en_base model to work mostly accurately, and you could get decent summaries from Microsoft Phi3 but I still suggest a LLama3 9b model for the task.
    • Whisper and Llama.cpp must be running in server mode, on the default ports. This can be on your laptop or on a separate machine.
    • The summarization step will be broken down into different parts (12k of text or roughly every 15 min of audio) and each part is summarized independently. This was done to maximize the reasoning capability of the LLM and to ensure we didn’t exceed the LLM context window.
    • Each part of the summary includes 3 different sub sections: An overall summary of that part of the call, a fact only summary and a sentiment summary.
    • The overall summary is useful for understanding what was discussed, in what order and why it mattered.
    • The fact summary is useful for seeing exact facts stated in the call like project names or timelines
    • The sentiment summary is useful to understand the tone of the call and if it was positive/negative. Some calls are difficult, and you need to capture that.
    • The entire thing is written in Python and uses PyAudio. It always uses recording device -1, which is the default in windows for the default recording device.
    • Most laptops default recording device will record global audio but MacOS and some lower end laptops seem to only record the mic. This limits the usefulness of the tool in those environments
    • Overall, I love this tool and use it daily. It’s a nice addition to my growing set of AI tools that help with work.