Edit: Good sleepy little salted fish
An era, after all, has come to an end.
On November 22, the Shanghai Third Intermediate People's Court conducted a public trial of the infringement case of the "Renren Film and Television Subtitle Group" and rendered a first-instance judgment in court.
Defendant Liang Yongping was sentenced to three years and six months' imprisonment and fined RMB1.5 million for copyright infringement.
The illegal gains shall be recovered, and the personal property seized for the purpose of committing the crime shall be confiscated.

Some time ago, the Korean dystopian drama "Squid Game" can be described as quite popular, with 142 million plays in the first month of online, dominating 90 countries and regions.
Netflix also offers subtitles and dubbing in up to 13 languages.
However, Korean-American comedian Youngmi Mayer found the official subtitles of "The Squid Game" too outrageous and completely unsatisfactory.
For example, when an actress uses Korean to mean "what to see," Netflix's English subtitles translate to "go away."
With the rise of streaming media such as Netflix, there are also more and more non-English-language works such as "Squid Games".
However, there is a shortage of talent in the subtitling and dubbing industry, especially in small languages.
Again, if you want to bring it to the Spanish market, you usually export the English version of the subtitle, but then translate it into French on this basis.
That is to say, the quality of subtitles in some languages depends entirely on the Translation of English, and this conversion process will inevitably lose a lot of information details.
According to statistics, the dubbed version of "Squid Game" has more viewers than the subtitle version.
To this end, whether it is a streaming giant like Netflix or some small localization service providers, they are exploring whether ai technology can replace human translation.
So, can AI work or not?
It starts with what Deepfake Voice is.
Copying or cloning a person's voice, a commonly used technique called Deepfake Voice, also known as speech cloning or synthesizing speech, aims to generate a person's voice using AI.
At present, the technology has been developed to the point where the human voice can be reproduced very precisely in tone and similarity.
Sound cloning is a process in which people use computers to generate the voice of a real individual and use artificial intelligence (AI) to create a clone of a specific, unique sound.
To clone someone's voice, there must be training data lost to the AI model. This data usually records examples of what the target person is saying.
Artificial intelligence can use this data to render a real sound, such as generating a piece of speech with anything you can type in text, a process called text-to-speech.
In previous text-to-speech (TTS) systems, training data was a key component that controlled the generation of speech output. In other words, the sound you hear should be the voice given in the dataset.
But now, with the introduction of the latest AI technology, using some features of the target sound, such as speech waveforms, can also be analyzed and extracted in more depth.
Synthetic sound is a term commonly referred to as Deepfake Voice, and synthetic sound is often used interchangeably with sound clones.
But in simple terms, synthesized speech is computer-generated speech, also known as speech synthesis, which is generally achieved through artificial intelligence (AI) and deep learning.
There are two main ways to synthesize sound: text-to-speech (TTS) and speech-to-speech (STS).
Text-to-speech (TTS) has been described above, and TTS software has been used to help visually impaired people read digital texts, as well as in other applications such as voice assistants.
Speech-to-speech (STS) is not about using text, but about using one piece of speech to modify the characteristics of its voice to create another piece of synthesized speech that sounds very realistic.
Speech synthesis in the past did not produce sounds that were fake and real. But with the development of technology, this has changed.
Traditional speech synthesis typically uses two basic techniques. These two techniques are splicing synthesis and formant synthesis.
Stitching synthesis takes the form of stitching short samples of recorded sound together to form a chain called a unit. These units are then used to generate user-defined sound patterns.
The technique of formant synthesis is most commonly used to replicate the sounds people make with vowels.
The downside of these methods is that from time to time they generate sounds that people can't make. But the advent of deep learning and artificial intelligence has taken TTS technology to new heights.
Often referred to as neural text-to-speech, AI text-to-speech utilizes neural networks and machine learning techniques to synthesize speech output from text.
First, the speech engine accepts audio input and recognizes the sound waves generated by human sounds.
This information is then translated into linguistic data, which is called automatic speech recognition (ASR). After obtaining this data, the speech engine must analyze the data to understand the meaning of the words it collects, which is known as natural language processing (NLP).
Finding training data is the first basic item of synthesizing sound. Without clear sound recordings, there is no way to successfully train an AI model to capture all the intricate details of a person's speech.
The recording process can take anywhere from a few hours to several hours, and the Voice Solutions team will provide a comprehensive list of phrases to capture all the characteristics of one person's voice.
Usually, this list doesn't go beyond 4,000 phrases, but the goal is really to capture as much data as possible around someone's unique voice — the more data you capture, the more accurate the sound clones will be.
Next, AI models speech data.
Use a neural network to acquire an ordered set of phonemes and then convert them into a set of spectrograms. A spectrogram is a visual representation of the spectrum of a signal band.
The neural network selects the appropriate spectrogram whose frequency bands can more accurately characterize the acoustic features used by the human brain in understanding speech. Neurosonographs then convert these spectrograms into speech waveforms that produce natural and realistic sounds.
In October, a project on GitHub snapped up 13k stars.
In just 5 seconds, it can use AI technology to simulate sound to generate arbitrary voice content, and it also supports Chinese.
According to the uploaded demonstration video, the sound is also realistic.
Key features of Mocking Bird include:
Support Mandarin and test with multiple Chinese datasets: aidatatang_200zh, magicdata, aishell3, biaobei, MozillaCommonVoice, etc
For pytorch, tested in version 1.9.0, GPU Tesla T4 and GTX 2060
Can run in Windows operating system and Linux operating system (Apple system M1 version also has community successful running cases)
Simply download or train a new synthesizer (synthesizer works well, reuse pre-trained encoder/vocoder, or live HiFi-GAN as a voodter)
Provide a Webserver to view the training results for remote call
Mocking Bird is also very simple to use in addition to having a column on Zhihu to share nanny-level tutorials and training tips.
Start by installing the remaining packages required in PyTorch, ffmpeg, webrtcvad-wheels, and requires .txt.
The second step is to prepare a pre-trained model, using models provided by the author or trained by others.
Important data processing operations are audio and Mel spectrogram preprocessing: python pre.py <datasets_root> can pass in the parameters --dataset {dataset} to support aidatatang_200zh, magicdata, aishell3
The third step is to launch a web program directly in the browser to debug.
Or launch a more complete toolbox software.
The author also thoughtfully attached all the papers and original code repositories that can be learned.
The warehouse's name, MockingBird, is a mock bird, an anti-tongue bird, known for its ability to imitate the sounds of other birds, insects, and amphibians, and is also a bird that often appears in Western literature or film and television works, and is biologically a common name for the mockingbird.
The english of the famous book's name, To Kill a Mocking Bird, is actually a translation error, and the English for robin is Robin.
Voice fraud brought on by Deepfake Voice is a big problem.
In 2019, criminals cloned the voice of the CEO of a UK-based energy company and defrauded $240,000 because the fake CEO sounded very real in both accent and tone. The incident was the first known cybercrime in Europe to directly use artificial intelligence.
Another incident occurred in 2020. A bank manager working in the United Arab Emirates answered a phone call when he thought he was talking to a director of a company and ended up falling into an outright voice scam that mistakenly approved a $35 million transfer.
As technology has evolved, Deepfake Voice scams have become more sophisticated, and many people may have already encountered some fake Deepfake Voice voices on social media.
So, how do you prevent Deepfake Voice fraud?
There are two ways.
The first method is to create a detector that analyzes the sound to determine if it was made using deepfake technology. Unfortunately, because Deepfake Voice technology will continue to evolve, detectors can't always stay correct.
The second method is relatively more realistic, mainly to achieve an audio watermark that the listener cannot hear and that people cannot edit. An audio watermark is essentially a record of sound being created, edited, and used. This makes it easier for people to know if a piece of sound is synthetic.
Resources:
https://www.axios.com/artificial-intelligence-voice-dubbing-synthetic-14bfb3c6-99db-4406-920d-91b37d00a99a.html
https://www.businesswire.com/news/home/20210514005132/en/Veritone-Launches-MARVEL.ai-a-Complete-End-to-End-Voice-as-a-Service-Solution-to-Create-and-Monetize-Hyper-Realistic-Synthetic-Voice-Content-at-Commercial-Scale
https://www.veritone.com/blog/combining-conversational-ai-and-synthetic-media/
https://www.veritone.com/blog/everything-you-need-to-know-about-deepfake-voice/
https://www.veritone.com/blog/how-ai-companies-are-tackling-deepfake-voice-fraud/
https://www.veritone.com/blog/how-to-create-a-synthetic-voice/
Special thanks to ifan
https://www.ifanr.com/1454818
synthetic-voice/