preface
Speech synthesis technology is a technology that can convert text into speech, and has been widely used in many fields, such as: radio and television, network audio-visual, etc.
In traditional speech synthesis technology, an artificial speech needs to be recorded and then converted into artificial speech by computer algorithms.
With the development of artificial intelligence technology, speech synthesis technology has also developed rapidly, and its application scenarios are becoming more and more extensive.
This paper mainly introduces the development history, research status and development trend of artificial intelligence speech synthesis technology in foreign countries.
At the same time, combined with the latest results of mainland artificial intelligence and media fusion research, the application of speech synthesis technology in the fields of radio and television and network audiovisual is discussed, and the future development is prospected.
First, the development history of speech synthesis technology
(1) History of foreign development
The development of foreign speech synthesis technology can be traced back to the early 60s of the 20th century, when some universities in the United States began to study how to use computers to synthesize artificial speech.
Early speech synthesis techniques were mainly based on rules and rule sets, where computers converted text into speech according to preset rules. This approach requires a lot of human intervention, and synthesized speech is not ideal.
With the continuous improvement of computer processing speed and storage capacity, speech synthesis technology has also developed rapidly.
In the 90s, a speech synthesis method based on statistical parameters was proposed, which proposed three very important modules of speech synthesis: language model, acoustic model and vocoder, as shown in Figure 1.
The task of the language model is to extract the input text as linguistic features through the technology of natural language processing, which have the linguistic information required by the back-end acoustic model.
The acoustic model is responsible for converting the linguistic features into acoustic features, and then a separate vocoder completes the conversion of the acoustic features to the original speech waveform.
Figure 1: Basic speech synthesis architecture
With the development of AI deep learning technology, speech synthesis technology has made a leap forward, and the iconic technical representative is the Tacotron model proposed by Google in 2017.
As shown in Figure 2, the model is an end-to-end speech synthesis model based on self-attention mechanism, the input end is composed of text, the text encoder generates a robust context text vector, and the autoregressive decoder based on the attention mechanism is used on the decoder side to output the Mel spectrum speech features of N frames at a time.
Figure 2: Google's Tacotron framework
The so-called autoregressive decoding means that the N frames output in the first step become input in the second step, so as to reciprocate and finally produce a complete Mel spectrum.
The Mel spectrum is generated by Tacotron's final high-speed convolution module, and the linear spectrum is finally synthesized by Griffin-Lim algorithm.
Subsequently, the Tacotron 2nd generation model proposed by Google in 2018 replaced the high-speed convolution module of the generation of algorithms with a 3-layer long short-term memory module, and replaced the vocoder part from the GriffinLim algorithm to the deep learning WaveNet algorithm, it is worth noting that the synthetic quality of the model has reached the level of false reality in subjective evaluation.
The Tacotron model has the ability to generate high-quality speech synthesis, but due to its autoregressive generative structure, the training speed and inference speed are not very ideal.
Therefore, in 2018, the TransformerTTS proposed by the University of Electronic Science and Technology of China and Microsoft Research Asia used the self-attention mechanism Transformer to replace the original traditional content-based attention mechanism to complete non-autoregressive generation.
Subsequently, the FastSpeech1 and Fastspeech2 architectures proposed by Zhejiang University and Microsoft Research Asia in 2019 and 2020 respectively successfully generated end-to-end non-autoregressive generation, which not only improves the inference speed, but also has a duration predictor, pitch predictor and energy predictor that can complete fine-grained control of output speech duration, pitch, energy, etc., while improving the errors of lost and repeated words that will occur in Tacotron2.
The VITS model is a high-performance speech synthesis model that combines variational inference, standardized flow and adversarial training in 2021, and most of the speech synthesizers used on major self-media platforms are composed of this model.
Unlike the Tacotron and FastSpeech described above, the traditional model will map text, that is, characters and phonemes, speech features such as Mel spectra in inference, and usually require a vocoder to predict the Mel spectrum as a speech waveform.
VITS is the first truly end-to-end speech synthesis model, which does not require an additional vocoder to reconstruct waveforms and directly map characters or phonemes to waveforms.
This synthesis method improves the diversity of speech synthesis by connecting the vocoder and acoustic model of speech synthesis in series instead of the spectrum of the previous model.
(2) The history of the development of the mainland
The development of AI speech synthesis in China can be traced back to the early 90s of last century. At that time, the Natural Language Processing Laboratory of Tsinghua University first began the research of speech synthesis.
Early speech synthesis systems were mainly based on template matching and concatenation techniques, and although the effect was limited, it was already able to achieve basic speech synthesis functions.
After entering the 21st century, with the development of deep learning technology, speech synthesis technology has developed rapidly.
In 2010, iFLYTEK successfully developed the first speech synthesis system based on deep learning - "iFLYTEK speech synthesis technology". This technology uses a deep neural network model, which can achieve a more natural and smooth speech synthesis effect.
Since then, iFLYTEK has made major breakthroughs in the field of speech synthesis, and has successively launched multiple systems such as "iFLYTEK Intelligent Speech Synthesis System" and "iFLYTEK Hybrid Speech Synthesis System".
Figure 3 VITS model inference framework diagram
Another Internet giant, Baidu, has also continued to strengthen R&D investment in the field of speech synthesis. In 2017, Baidu released DeepVoice, the first speech synthesis system based on deep learning.
The system uses neural network model to realize speech synthesis, which has high speech naturalness and emotional expression ability. In 2019, Baidu further launched "Baidu Super Speech Synthesis Technology", which can generate highly personalized speech, greatly improving the user experience.
In 2020, Alibaba Natural Language Processing Lab proposed the "Meta-VoiceGAN" model, which adversarially generated network (GAN)-based method to achieve high-fidelity speech synthesis by learning the mapping relationship between speech signals and speech features.
In 2021, JD AI Lab released "JD Streaming Speech Synthesis Technology", which uses a neural network model based on Transformer, combines pre-training and fine-tuning and other technologies, which can achieve a more natural and smooth speech synthesis effect, and has high adaptability and flexibility.
At present, more and more scientific research units in mainland China have invested heavily in the technology development of AI speech synthesis, and the future technology development and application space is extremely broad.
Second, the application of speech synthesis technology
(1) Applications in the field of radio and television
The field of radio and television is an important application field of speech synthesis technology. With the continuous development of digital technology, the radio and television industry has become more and more dependent on automated production processes and the application of digital technology.
The application of speech synthesis technology in the field of radio and television mainly involves news broadcasting, program dubbing, advertising and other aspects.
(2) News broadcasting
News broadcasting is one of the most basic and important contents in the field of radio and television. Traditional news broadcasts require manual recording of voice, which is time-critical and inefficient.
Moreover, because the voice quality of the anchor has a great relationship with the effect of manual recording, traditional news broadcasts have certain limitations in voice quality.
Speech synthesis technology can automatically generate speech according to specific text, thereby reducing labor costs, improving broadcast efficiency, and also producing more natural and realistic speech effects.
(3) Program dubbing
Program dubbing is an important aspect of the application of speech synthesis technology in the field of radio and television. With the increasing richness of radio and television entertainment content, dubbing has gradually become an indispensable part of the broadcast and television industry.
Traditional dubbing requires manual recording, and requires the voice talent to have certain vocal characteristics and performance skills.
Speech synthesis technology can produce high-quality dubbing by adjusting characteristics such as pitch and speech rate, and can even adjust the pitch, speech rate and other characteristics of the voice according to the characteristics of different characters, so as to improve the effect of dubbing.
Therefore, in terms of dubbing, speech synthesis technology can improve dubbing efficiency, reduce production costs, and also produce more natural and realistic dubbing effects, so as to better attract the attention of the audience.
(4) Advertising
Advertising is an important application scenario in the field of radio and television. Traditional advertising production requires a lot of time and labor costs to produce, and it also requires hiring professional voice actors to record advertising audio.
Speech synthesis technology can automatically generate speech based on specific text, which greatly shortens the time of advertising production and reduces production costs.
Therefore, the application of AI speech synthesis technology in advertising production can improve production efficiency and quality, so as to better meet the needs of advertisers.
(5) Video dubbing
In the field of network audiovisual, video dubbing is a very important link. Traditional dubbing requires a lot of manpower and material resources, and can be interfered by various factors, such as the quality of the sound recording equipment, the accent of the voice actors, etc.
These factors can lead to instability in dubbing quality, which can affect the viewing experience of the video. Speech synthesis technology can help solve these problems. Through speech synthesis technology, text information can be converted into speech, so as to realize automatic dubbing.
This not only saves costs and increases efficiency, but also results in a more natural, realistic voice act. In network video, speech synthesis technology can be applied to various types of video content, such as short videos, micro-movies, educational videos, etc.
Through speech synthesis technology, the voice of the video can be made more natural, thereby improving the viewing experience of the audience.
(6) Voice interaction
Voice interaction is a form of human-computer interaction and one of the important applications in the field of network audio-visual. Voice interaction technology can enable machines to produce natural and smooth speech, thereby improving the user's interactive experience.
At present, voice interaction technology has been widely used in smart home, intelligent customer service, intelligent navigation and other fields. Through speech synthesis technology, machines can produce more humanized and natural speech, thereby improving the interaction between users and machines.
In terms of smart home, speech synthesis technology can enable machines to better understand the user's instructions, so as to realize the automatic control of smart homes.
In terms of intelligent customer service, speech synthesis technology can make human-computer interaction more convenient for users, thereby improving user satisfaction. In terms of intelligent navigation, speech synthesis technology can provide users with a more convenient navigation experience, and at the same time, it can avoid users' distraction during driving.
Third, the prospect of speech synthesis technology
(1) Technical aspects
The future development of speech synthesis technology will mainly rely on the continuous development of deep learning and neural network technology. With the continuous upgrading of hardware equipment and the continuous optimization of algorithms, the quality and naturalness of speech synthesis technology will continue to improve.
The current speech synthesis technology has achieved realistic speech synthesis, but there are still some shortcomings, such as the rhythm and rhythm of speech are not natural enough.
Future speech synthesis technology will pay more attention to these aspects of improvement to achieve more realistic speech synthesis. Future speech synthesis technology will focus more on personalized services and experiences.
With the continuous development of artificial intelligence technology, future speech synthesis technology will be able to carry out personalized speech synthesis according to the needs and preferences of users, and can synthesize anyone's speech according to small samples or zero samples, providing speech synthesis services that are closer to user needs.
Future speech synthesis technology will pay more attention to real-time speech synthesis. Real-time speech synthesis can provide users with a more natural and smooth voice interaction experience, and provide a broader application space for the development of voice interaction technology.
(2) Application
With the development of 5G and artificial intelligence technology, the field of radio and television and network audiovisual will pay more and more attention to the personalized needs and experience of users.
Future speech synthesis technologies will be able to better deliver personalized services and experiences, such as personalized speech synthesis based on the user's needs and interests, thereby increasing user satisfaction and loyalty.
Speech synthesis technology is expected to support more languages and dialects, so as to better meet the needs of users in different countries and regions and achieve cross-cultural communication. In addition, speech synthesis technology will also realize automatic translation and conversion between multiple languages, providing users with more convenient and diversified services.
Future speech synthesis technology will be able to better integrate with other fusion technologies, such as images, videos, text, etc., so as to achieve richer and more vivid media expressions. For example, in TV news, speech synthesis technology can combine video and text to achieve more vivid and intuitive news broadcasts.
Future speech synthesis technology will be combined with augmented reality technology to achieve a smarter and more convenient user experience. For example, in the field of tourism, users can listen to the narration of a guided tour through smart glasses or mobile phone applications, so as to better understand the history and culture of tourist attractions.
summary
In summary, AI speech synthesis technology has broad application prospects in the fields of radio and television and network audio-visual. With the continuous development of AI technology, AI speech synthesis technology will become an indispensable part of the field of radio and television and network audio-visual.
In the future, speech synthesis technology will pay more attention to speech quality and naturalness, pay more attention to personalized services and experience, pay more attention to multilingual support and cross-cultural communication, and pay more attention to commercialization and industrialization.