laitimes

Prompt unlocks speech language model generation capabilities, SpeechGen speech translation and patching tasks

author:Heart of the Machine Pro

Machine Heart column

Heart of the Machine Editorial Office

This paper proposes a unified framework called SpeechGen, which can be used for arbitrary speech LM and various speech generation tasks, which has good potential.
Prompt unlocks speech language model generation capabilities, SpeechGen speech translation and patching tasks

Link to paper: https://arxiv.org/pdf/2306.02207.pdf

Demo page: https://ga642381.github.io/SpeechPrompt/speechgen.html

Code: https://github.com/ga642381/SpeechGen

Introduction and motivation

Large Language Models (LLMs) have attracted considerable attention in terms of Artificial Intelligence Generated Content (AIGC), especially with the advent of ChatGPT.

However, how to process continuous speech with large language models is still an unsolved challenge, which hinders the application of large language models in speech generation. Because speech signals contain rich information, such as speakers and emotions, beyond plain text data, speech-based language models (speech LM) are emerging.

Although speech language models are still in their early stages compared to text-based language models, they have great potential and are expected due to the richer information contained in speech data than text.

Researchers are actively exploring the potential of the prompt paradigm to harness the power of pre-trained language models. This hint guides the pretrained language model to do specific downstream tasks by fine-tuning a small number of parameters. This technique is favored in the NLP field because of its efficiency and effectiveness. In the field of speech processing, SpeechPrompt has demonstrated significant improvements in parameter efficiency and achieved competitive performance in various speech classification tasks.

However, whether prompts can help speech language models complete the task of generation remains a mystery. In this paper, we propose an innovative unified framework: SpeechGen, which aims to unleash the potential of speech language models for generative tasks. As shown in the figure below, by feeding a speech and a specific prompt to the speech LM as input, the speech LM can do a specific task. For example, by taking the red prompt as input, speech LM can do the speech translation task.

Prompt unlocks speech language model generation capabilities, SpeechGen speech translation and patching tasks

The framework we propose has the following advantages:

1. Textless: Our framework and the speech language model it relies on are invaluable independent of text data. After all, the process of getting tagged text to pair with speech is time-consuming and tedious, and it's not even possible to find the right text in some languages. The text-free nature allows our powerful speech generation capabilities to cover a variety of language needs, benefiting all of humanity.

2. Versatility: We have developed a framework that is extremely versatile and can be applied to a wide variety of speech generation tasks. The experiments in the paper use speech translation, speech repair, and speech continuity as examples.

3. Easy to follow: Our proposed framework provides a common solution for a wide range of speech generation tasks, making designing downstream models and loss functions a breeze.

4. Transferability: Our framework is not only easy to adapt to more advanced speech language models in the future, but also has great potential to further improve efficiency and effectiveness. What's especially exciting is that our framework will evolve even more robustly with the advent of advanced speech language models.

5. Affordability: Our framework is carefully designed to train only a small number of parameters, rather than the entire huge language model. This greatly reduces the computational burden and allows the training process to be performed on the GTX 2080 GPU. University labs can also afford such computational overhead.

Introduction to SpeechGen

Prompt unlocks speech language model generation capabilities, SpeechGen speech translation and patching tasks

Our research methodology lies in building a new framework, SpeechGen, which uses Spoken Language Models (SLMs) to fine-tune various downstream speech generation tasks. During training, the parameters of SLMs remain unchanged, and our method focuses on learning task-specific prompt vectors. SLMs efficiently produce the output required for a specific speech generation task by conditionally setting both the cue vector and the input unit. These discrete cell outputs are then fed into a cell-based speech synthesizer to generate the corresponding waveform.

Our SpeechGen framework consists of three elements: a speech encoder, SLM, and a speech decoder.

First, the speech encoder takes the waveform as input and converts it into a sequence of units derived from a limited vocabulary. To shorten the sequence length, repeated contiguous units are removed to produce a compressed sequence of units. SLM, as a language model for the unit series, then optimizes the likelihood by predicting the previous unit and the subsequent units of the unit series. We make prompt adjustments to SLM to guide it to generate the appropriate units for the task. Finally, the SLM-generated markers are processed by the speech decoder, which converts them back into waveforms. In our prompt tuning strategy, cue vectors are inserted at the beginning of the input sequence, which will guide the direction of SLMs in the generation process. The number of prompts inserted depends on the architecture of SLMs. In a sequence-to-sequence model, both encoder input and decoder input are added to a hint, but in an encoder-only or decoder-only architecture, only a hint is added before the input sequence.

In sequence-to-sequence SLMs, such as mBART, we employ a self-supervised learning model such as HuBERT to process input and target speech. Doing so generates discrete cells for the input and corresponding discrete cells for the target. We add cue vectors in front of both the encoder and decoder inputs to construct the input sequence. In addition, we further enhance the guidance of cues by replacing key value pairs in the attention mechanism.

In model training, we take cross-entropy loss as the objective function of all generation tasks, and calculate the loss by comparing the prediction results of the model and the target discrete unit labels. In this process, cue vectors are the only parameters in the model that need to be trained, while the parameters of SLMs remain unchanged during training, which ensures consistent model behavior. By inserting cue vectors, we guide SLMs to extract task-specific information from the input and increase the likelihood of producing output that matches a specific speech-generated task. This approach allows us to fine-tune and adjust the behavior of SLMs without modifying their underlying parameters.

In general, our research method is based on a new framework, SpeechGen, which guides the model generation process by training cue vectors and enables it to effectively produce outputs that meet specific speech generation tasks.

experiment

Our framework can be used for any speech LM and various generation tasks, and has great potential. In our experiments, since VALL-E and AudioLM are not open source, we chose to use Unit mBART as a speech LM for case studies. We use speech translation, speech inpainting, and speech continuity as examples to demonstrate the capabilities of our framework. A schematic diagram of these three tasks is shown in the figure below. All tasks are voice input, voice output, no text help required.

Prompt unlocks speech language model generation capabilities, SpeechGen speech translation and patching tasks

Voice translation

When we train speech translation, we use the Spanish-to-English task. We input the model with a voice in Spanish, hoping that the model will produce a voice in English, and the whole process does not require text help. Here are a few examples of speech translation, where we show the ground truth and the model prediction. These demo examples show that the model's predictions capture the core meaning of the correct answer.

Prompt unlocks speech language model generation capabilities, SpeechGen speech translation and patching tasks

Voice patching

In our speech inpainting experiment, we specifically selected audio clips longer than 2.5 seconds as target voices for post-processing, and selected a speech clip between 0.8 and 1.2 seconds in length through a random selection process. We then mask the selected fragments to simulate the missing or damaged parts of the voice patching task. We use word error rate (WER) and character error rate (CER) as metrics to assess how well damaged fragments are repaired.

Comparing the output generated by SpeechGen with impaired speech, our model can significantly reconstruct spoken vocabulary, reducing WER from 41.68% to 28.61% and CER from 25.10% to 10.75%, as shown in the table below. This means that our proposed method can significantly improve the ability of speech reconstruction, ultimately improving the accuracy and comprehensibility of speech output.

Prompt unlocks speech language model generation capabilities, SpeechGen speech translation and patching tasks

The following figure is a sample example, the upper subgraph is the damaged speech, the lower subgraph is the speech produced by SpeechGen, you can see that SpeechGen repairs the damaged speech well.

Prompt unlocks speech language model generation capabilities, SpeechGen speech translation and patching tasks

Speech continuous

We will demonstrate practical applications of voice continuous tasks with LJSpeech. During the training prompt, our strategy is to let the model only see the seed segment of the fragment, which accounts for a portion of the total length of speech, which we call the condition ratio (r), and let the model continue to generate subsequent speech.

Here are some examples, black text represents seed segments, and red text is sentences generated by SpeechGen (the text here is first recognized by speech. During training and inference, the model performs speech-to-speech tasks entirely and does not receive any text information at all). Different conditional ratios enable SpeechGen to generate statements of different lengths for coherence and complete a complete sentence. From a quality perspective, the resulting sentences are grammatically identical to the seed fragment and semantically related. Although, the generated speech still does not perfectly convey a complete meaning. We expect this problem to be addressed in more powerful speech models in the future.

Prompt unlocks speech language model generation capabilities, SpeechGen speech translation and patching tasks

Deficiencies and future directions

Speech language models and speech generation are in a booming phase, and our framework offers a possibility to skillfully leverage powerful language models for speech generation. However, there are still some things to be done in this framework, and there are many issues that deserve our in-depth study.

1. Compared to text-based language models, speech language models are currently in the early stages of development. Although the prompt framework proposed by us can stimulate speech language models to do speech generation tasks, it cannot achieve excellent performance. However, with the continuous advancement of speech language models, such as the big shift from GSLM to Unit mBART, the performance of prompts has improved significantly. In particular, tasks that were previously challenging for GSLM now show better performance under Unit mBART. We expect more advanced speech language models to emerge in the future.

2. Beyond content information: Current speech language models do not fully capture speaker and emotional information, which poses challenges for current speech prompt frameworks to effectively process this information. To overcome this limitation, we introduced plug-and-play modules that specifically inject speaker and emotional messages into the framework. Going forward, we expect future speech language models to integrate and leverage information beyond this content to improve performance and better handle the speaker and emotion-related aspects of speech generation tasks.

3. Possibility of prompt generation: For prompt generation, we have flexible options and can integrate various types of instructions, including text and image instructions. Imagine that we could train a neural network to use images or text as input, rather than using trained embedding as a hint as in this article. This trained network will become a prompt generator, adding variety to the framework. This approach makes prompt generation more fun and colorful.

conclusion

In this article, we explored using hints to unlock the performance of speech language models in various build tasks. We propose a unified framework called SpeechGen that has only about 10M trainable parameters. The proposed framework has several characteristics, including no text, versatility, efficiency, transferability, and affordability. To demonstrate the capabilities of the SpeechGen framework, we used Unit mBART as a case study and experimented on three different speech generation tasks: speech translation, speech repair, and speech continuation.

When the paper was submitted to arXiv, Google came up with a more advanced speech language model, SPECTRON, which showed us the possibility of speech language models modeling information such as speakers and emotions. This is certainly exciting news, and our unified framework has great potential as advanced speech language models continue to be proposed.

Read on