laitimes

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

author:Heart of the Machine Pro

Machine Heart report

Editor: Zhang Qian

The input box of the ChatGPT-like model can be pasted with a voice document.

Large Language Models (LLMs) are changing user expectations in every industry. However, building generative AI products centered on human speech is still difficult because audio files pose a challenge to large language models.

A key challenge in applying LLM to audio files is that LLM is limited by its context window. Before an audio file can be fed into LLM, it needs to be converted to text. The longer the audio file, the greater the engineering challenge of bypassing LLM's context window restrictions. But in work scenarios, we often need LLM to help us process very long voice files, such as extracting core content from a recording of a meeting for several hours, finding the answer to an answer from an interview...

Recently, speech recognition AI company AssemblyAI launched a new model called LeMUR. Just as ChatGPT processes dozens of pages of PDF text, LeMUR can transcribe, process, and then summarize the core content of the speech and answer questions entered by the user.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

Trial address: https://www.assemblyai.com/playground/v2/source

LeMUR, short for Leveraging Large Language Models to Understand Recognized Speech, is a new framework for applying powerful LLM to transcribed speech. With just one line of code (via AssemblyAI's Python SDK), LeMUR can quickly process up to 10 hours of transcription of audio content, effectively converting it into about 150,000 tokens. In contrast, off-the-shelf, plain LLM can only accommodate up to 8K or about 45 minutes of transcribed audio within the limits of its context window.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

To reduce the complexity of applying LLM to transcribed audio files, LeMUR's pipeline consists of intelligent segmentation, a fast vector database, and several inference steps such as thought chain prompts and self-assessment, as shown in the following diagram:

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

Figure 1: LeMUR's architecture enables users to send long and/or multiple audio transcriptions into LLM with a single API call.

In the future, LeMUR is expected to be widely used in customer service and other fields.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want
LeMUR unlocks some amazing new possibilities that I thought would have been impossible just a few years ago. It's really amazing to be able to effortlessly extract valuable insights such as determining the best action and discerning call outcomes such as sales, appointments, or call purposes. —Ryan Johnson, chief product officer, CallRail, a phone tracking and analytics services technology company

What possibilities does LeMUR unlock?

Apply LLM to multiple audio texts

LeMUR allows users to get feedback from LLM's processing of multiple audio files at once, as well as up to 10 hours of speech transcription, and the converted text token length can reach 150K.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

Reliable, safe output

Since LeMUR includes security measures and content filters, it will provide users with responses from LLM that are less likely to produce harmful or biased language.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

Supplemental context

When inference, it allows additional contextual information that LLM can use to provide personalized and more accurate results when generating output.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

Modular, fast integration

LeMUR always returns structured data as a processable JSON. Users can further customize the output format of LeMUR to ensure that LLM gives a response in the format expected by their next piece of business logic (e.g. converting the answer to a boolean value). In this process, users no longer need to write specific code to process the output of LLM.

Trial results

According to the test link provided by AssemblyAI, Heart of the Machine tested LeMUR.

LeMUR's interface supports two ways to input files: upload audio and video files or paste web links.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

We used a recent interview with Hinton as input to test the performance of LeMUR.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

After uploading, we are prompted to wait a while because it has to convert speech to text first.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

The interface after transcription is as follows:

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

On the right side of the page, we can ask LeMUR to summarize the interview or answer questions. LeMUR can do things with little ease:

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want
ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

If the voice you're working on is a speech or customer service reply, you can also ask LeMUR for suggestions for improvement.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

However, LeMUR does not seem to support Chinese yet. Interested readers can try it.

Read on