ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

Machine Heart report

Editor: Zhang Qian

The input box of the ChatGPT-like model can be pasted with a voice document.

Large Language Models (LLMs) are changing user expectations in every industry. However, building generative AI products centered on human speech is still difficult because audio files pose a challenge to large language models.

A key challenge in applying LLM to audio files is that LLM is limited by its context window. Before an audio file can be fed into LLM, it needs to be converted to text. The longer the audio file, the greater the engineering challenge of bypassing LLM's context window restrictions. But in work scenarios, we often need LLM to help us process very long voice files, such as extracting core content from a recording of a meeting for several hours, finding the answer to an answer from an interview...

Recently, speech recognition AI company AssemblyAI launched a new model called LeMUR. Just as ChatGPT processes dozens of pages of PDF text, LeMUR can transcribe, process, and then summarize the core content of the speech and answer questions entered by the user.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

Trial address: https://www.assemblyai.com/playground/v2/source

LeMUR, short for Leveraging Large Language Models to Understand Recognized Speech, is a new framework for applying powerful LLM to transcribed speech. With just one line of code (via AssemblyAI's Python SDK), LeMUR can quickly process up to 10 hours of transcription of audio content, effectively converting it into about 150,000 tokens. In contrast, off-the-shelf, plain LLM can only accommodate up to 8K or about 45 minutes of transcribed audio within the limits of its context window.

To reduce the complexity of applying LLM to transcribed audio files, LeMUR's pipeline consists of intelligent segmentation, a fast vector database, and several inference steps such as thought chain prompts and self-assessment, as shown in the following diagram:

Figure 1: LeMUR's architecture enables users to send long and/or multiple audio transcriptions into LLM with a single API call.

In the future, LeMUR is expected to be widely used in customer service and other fields.

LeMUR unlocks some amazing new possibilities that I thought would have been impossible just a few years ago. It's really amazing to be able to effortlessly extract valuable insights such as determining the best action and discerning call outcomes such as sales, appointments, or call purposes. —Ryan Johnson, chief product officer, CallRail, a phone tracking and analytics services technology company

What possibilities does LeMUR unlock?

Apply LLM to multiple audio texts

LeMUR allows users to get feedback from LLM's processing of multiple audio files at once, as well as up to 10 hours of speech transcription, and the converted text token length can reach 150K.

Reliable, safe output

Since LeMUR includes security measures and content filters, it will provide users with responses from LLM that are less likely to produce harmful or biased language.

Supplemental context

When inference, it allows additional contextual information that LLM can use to provide personalized and more accurate results when generating output.

Modular, fast integration

LeMUR always returns structured data as a processable JSON. Users can further customize the output format of LeMUR to ensure that LLM gives a response in the format expected by their next piece of business logic (e.g. converting the answer to a boolean value). In this process, users no longer need to write specific code to process the output of LLM.

Trial results

According to the test link provided by AssemblyAI, Heart of the Machine tested LeMUR.

LeMUR's interface supports two ways to input files: upload audio and video files or paste web links.

We used a recent interview with Hinton as input to test the performance of LeMUR.

After uploading, we are prompted to wait a while because it has to convert speech to text first.

The interface after transcription is as follows:

On the right side of the page, we can ask LeMUR to summarize the interview or answer questions. LeMUR can do things with little ease:

If the voice you're working on is a speech or customer service reply, you can also ask LeMUR for suggestions for improvement.

However, LeMUR does not seem to support Chinese yet. Interested readers can try it.

ChatGPT that can understand speech is coming: 10 hours of recording thrown in, ask what you want

Read on

【Qingxin Observation Room】ChatGPT, Sora's Big Explosion Is Generative AI Credible?

Microsoft launched a ChatGPT-level model that the iPhone can run, netizens: OpenAI has to eliminate 3.5

Hello ChatGPT, may I ask: "How to stay single elegantly in the electrical major?"

I used ChatGPT to write a down-to-earth TVC ad copy for Xiaomi Auto

How to use ChatGPT to find a topic for a popular article?

4 Ways to Use ChatGPT API in Python

Huang Zitao's apology is on the hot search!How to use ChatGPT to write public opinion crisis words?

Ultraman: The next generation of AI models is smarter, ChatGPT will not have emotions, and there is no need to be afraid of superintelligence [with prediction of the development prospects of the generative AI industry]

Chen Baoya Chen Yue: The Cognitive Reduction Mode of Human Language Acquisition: Starting from the Speech Reduction Mode of ChatGPT

Affected by ChatGPT and other positive news, Microsoft and Google's latest financial reports have increased significantly

百万网友围观博主和AI"谈恋爱",ChatGPT"AND"模式有多上头?

ChatGPT's dialog box is outdated? This AI product offers a very new way to chat

Fresh Early Technology丨OpenAI opens the "memory" function to ChatGPT Plus users, Cao Cao Travels submits an IPO application to Hong Kong, and Xiaohongshu denies the Pre-IPO round of financing

If you don't learn it, just wait to be eliminated, an Ai artifact that is better to use than ChatGPT

OpenAI secretly launched a mysterious model, suspected to be ChatGPT4.5 for public testing

The ChatGPT History Chat feature will no longer collect user chat history