Editor: Editorial Department HXY

How terrible is Deepfake? The father of a lawyer abroad almost fell into a huge AI scam. The scammer used AI to clone his son's voice and faked the car accident to blackmail 30,000 bail. At the same time that GenAI technology is rampant with crimes, scientists are also looking for a way to break the demon.

In the era of AI flooding, who can distinguish the real from the fake?

Recently, Jay Shooster, a professional lawyer abroad, revealed that his father had fallen into a huge AI scam.

Scammers used AI to clone Shooster's voice and call his father: Your child was arrested for drunk driving and needs $30,000 bail to get out of jail.

Nearly, this father was deceived by AI.

AI audio becomes a scam artifact! The lawyer's father was defrauded of 210,000 yuan, and the original voice can be cloned in 3 seconds

I'm not sure if it's just a coincidence that this happened just a few days after my voice appeared on TV. Just 15 seconds of sound is enough to make a decent AI clone.

As a consumer protection lawyer, I've given speeches about this scam, posted online, and talked to my family, but they still came close to falling for it. That's why these scams are so effective.

Unfortunately, Shooster's last 15-second video of his face on TV was exploited by scammers.

And, even when Shooster had reminded his family of such scams, his father was still confused.

It can only be said that AI simulates human voices, which are already outrageously strong.

Another study by University College London proves that 27% of cases, regardless of language, people are unable to recognise AI-generated voices.

Moreover, repeated listening does not increase the detection rate.

This means that, in theory, one in four people could be scammed by an AI phone call, as human intuition isn't always that reliable.

Whether it's images, videos, or sounds, with the help of AI-generated technology, anyone can easily fake them, and deepfakes have deeply affected everyone's life.

The level of crime with AI technology has reached a level that we can't imagine.

AI voice cloning, 3 seconds of original sound is enough

Shooster's sharing intention is to tell everyone that part of the reason why this scam is effective is -

Humans cannot reliably recognize the voice of AI.

In an IBM experiment, security experts show how to implement "audio hijacking".

They developed a method that combines speech recognition, text generation, and voice cloning to detect the trigger word "bank account" in a conversation, and then replace the original account with their own account number.

Replacing a small piece of text is easier than cloning a voice conversation by AI, and it can be extended to more areas, according to the researchers.

And for a good enough voice cloning technology, only 3 seconds of original sound is enough.

In addition, any delays in text and audio generation can be compensated for by bridging sentences, or eliminated if there is sufficient processing power.

In response, the researchers warn that future attacks may also manipulate live video calls.

And this technology isn't just being misused for fraud, with voice actress Amelia Tyler claiming that AI-cloned voices are being used to read content that is not suitable for children without her permission.

Deepfakes are rampant

In addition to AI cloning voices, there are also AI face-swapping videos and AI fake image generation, which has long been common.

Some time ago, Korea set off the "N Room 2.0" incident, and deepfakes were used on minors, causing great panic.

Even, the whole network once opened a hot topic of "how terrible is deepfake".

Image generation Midjourney, Flux, video generation Gen-3, sound generation NotebookLM, etc., have become potential crime tools.

Last year, Midjourney's generation of a pope walking down the street in a down jacket was reposted by many people in a frenzy.

And this year, Flux, the king of AI images, was born, and the realistic photos of various TED speakers, coupled with AI video tools, moved to almost everyone.

In the real-time face swapping of AI videos, foreign netizens have developed a lot of open-source tools this year.

For example, Facecam can instantly generate a live video by simply adding a single image, and it can be operated with a mobile phone.

The authors of the project showed how he could easily and seamlessly change his face to Sam Altman and Musk, and all the organs on his face had no dead ends at all.

There is also Deep-Live-Cam, an AI face-swapping project that exploded overnight, and it is also just a photo to directly swap faces, and Musk started a live broadcast.

And the hotter AI voice generation in the past two days is Google NotebookLM. It can quickly convert text content into podcast videos.

Even AI boss Karpathy can't put it down, and strongly recommends that it may usher in its ChatGPT moment.

However, a foreign minesweeper game expert exclaimed that he was scared after listening to the AI generate a podcast sound from his own book.

And, even more frighteningly, the two NotebookLM podcast "hosts" discovered that they were AI rather than humans, and were on the verge of an existential collapse.

If such a powerful AI is applied to real-world fraud, it will only bring more serious consequences.

"The magic is one foot high, and the road is one foot high."

While DeepFake is gradually turning into a "dragon", the research community is also actively developing "dragon slaying" tools.

Either watermark GenAI-generated content at the source, put guardrails on real content to prevent abuse, or develop systems that can detect auto-generated content.

Not long ago, an engineer from the Chinese Academy of Sciences open-sourced an AI model that can recognize fake images to combat DeepFake.

As soon as it was released, the project hit the Hacker News hot list, and its popularity is evident.

Currently, the complete code and documentation have been published on GitHub repositories.

The developer said that he has been working on the deepfake detection algorithm since graduating in 2023, so that everyone in need can use the model for free to fight deepfakes.

In addition, there are many scientists in the industry who have made many contributions on this road.

Antifake

At the ACM Computer and Communications Security Conference in Copenhagen, Denmark, in November 2023, Zhiyuan Yu, a doctoral student at Washington University in St. Louis, United States, presented the AntiFake he developed in collaboration with Professor Ning Zhang.

With an innovative watermarking technology, AntiFake can provide creative ways to protect people from deepfake voice scams.

Address: https://dl.acm.org/doi/pdf/10.1145/3576915.3623209

Creating a deepfake voice only requires real audio or video of someone speaking. Typically, AI models only need about 30 seconds of speech to learn to mimic someone's voice by creating "embeddings."

These embedding vectors are like addresses pointing to the speaker's identity in a huge digital map of all sounds, and similar-sounding voices are closer together in this map.

Of course, humans do not use this "map" to identify sounds, but by frequency. We pay more attention to certain frequencies of sound waves and less to others, and AI models take advantage of all of these frequencies to create good embeddings.

AntiFake protects voice recordings by adding some noise on frequencies that people don't pay much attention to, so that human listeners can still understand it, but it can seriously interfere with the AI.

Eventually, AntiFake will have the AI create a low-quality embedding, equivalent to an address pointing to the wrong part of the map, so that any deepfake generated will not be able to mimic the original sound.

To test AntiFake, Yu's team played the role of "scammers", generating 60,000 voice files using 5 different AI models and adding AntiFake protection to 600 of them.

It was found that with the addition of protection, more than 95% of the samples could no longer fool humans or voice authentication systems.

It is worth mentioning that DeFake, a derivative version of AntiFake, also won the first prize in the voice cloning challenge held by the Federal Trade Commission of United States in early April this year.

SafeEar

Coincidentally, the Intelligent Systems Security Laboratory of Zhejiang University (USSLAB) and Tsinghua University have also jointly developed a voice forgery detection method for content privacy protection - SafeEar.

Project Homepage: https://safeearweb.github.io/Project/

The core idea of SafeEar is to design a decoupling model based on the Neural Audio Codec, which can separate the acoustic information of speech from the semantic information, and only use the acoustic information for forgery detection, so as to realize the voice forgery detection of content privacy protection.

The results show that the framework shows good detection and generalization capabilities for various audio forgery technologies, and the Detection Equal Error Rate (EER) can be as low as 2.02%, which is close to the SOTA performance of forgery detection based on complete voice information.

At the same time, the experiment also proves that the attacker cannot recover the speech content based on the acoustic information, and the word error rate (WER) based on the human ear and machine recognition methods is higher than 93.93%.

Specifically, SafeEar uses a serial detector structure to obtain the discrete acoustic signature of the target from the input speech, which is then fed into the back-end detector.

(4) Real-world Augmentation in the dotted box only appears during training, and only (1), (2), and (3) modules are available in the inference phase

1. 基于神经音频编解码器的前端解耦模型（Frontend Codec-based Decoupling Model, Frontend CDM）

模型包括编码器（Encoder）、多层残差向量量化器（Residual Vector Quantizers, RVQs）、解码器（Decoder）、鉴别器（Discriminator）四个核心部分。

Among them, RVQs mainly include cascaded eight-layer quantizers, in the first layer of quantizers, Hubert features are used as the semantic features of supervised signal separation, and the output features of the subsequent layers of quantizers are accumulated as acoustic features.

2. 瓶颈层和混淆层（Bottleneck & Shuffle）

The bottleneck layer is used for feature dimensionality reduction characterization and regularization.

The obfuscation layer randomly resets the acoustic features within a fixed time window, thereby increasing the complexity of the features and ensuring that content stealing attackers cannot forcibly extract semantic information from the acoustic features even with the help of SOTA's speech recognition (ASR) model.

In the end, the audio protected by both unwrapping and obfuscation can effectively resist malicious voice content theft by both the human ear and the model.

3. 伪造检测器（Deepfake Detector）

In this paper, a Transformer-based classifier based on acoustic input is designed for the backend of the fake audio detection of the SafeEar framework, which uses the alternating form of sine and cosine functions to encode the position of the speech signal in the time and frequency domains.

4. 真实环境增强（Real-world Augment）

In view of the channel diversity in the real world, representative audio codecs (such as G.711, G.722, gsm, vorbis, and ogg) are used for data augmentation to simulate the diversity of bandwidth and bit rate in the actual environment, so as to generalize to invisible communication scenarios.

Here's what it looks like:

However, even with all the progress and results, defending against deepfakes is still a very challenging task, and people need all the help they can to protect their online identity and information from being compromised.

The police use AI to solve unsolved cases

In addition to using "magic" against "models", a police department in the United Kingdom has recently tested an AI system that can greatly reduce investigation time and help solve old cases.

Specifically, the tool, called "Soze," can simultaneously analyze video footage, financial transactions, social media, emails, and other documents to identify potential clues that may otherwise go undetected during a manual search for evidence.

Assessments have shown that it was able to analyze evidence from 27 complex cases in just 30 hours, compared to the 81 years it would have taken humans to complete.

Obviously, this is a huge appeal for law enforcement, which may be stretched thin in terms of personnel and budget constraints.

In response, Gavin Stephens, chairman of the United Kingdom National Police Chiefs Committee, said: "You may have an unsolved case review that seems impossible to complete because there is so much material, but you can feed it into a system like this, and the system can absorb it and then give you an assessment. I think it's very, very helpful."

We live in a world of deepfakes, or rather, a world of "matrix simulations".

In this world, there is no reality, everything is AI.

AI audio becomes a scam artifact! The lawyer's father was defrauded of 210,000 yuan, and the original voice can be cloned in 3 seconds

Antifake

SafeEar

In addition to using "magic" against "models", a police department in the United Kingdom has recently tested an AI system that can greatly reduce investigation time and help solve old cases.

Read on

Dyson OnTrac noise-cancelling headphones, a good partner for long journeys to Germany. Every time you travel across the country, you naturally have a lot of things to bring, but as a long-distance flight

How do I adjust the playback speed of my audio? Here are a few ways to adjust what everyone is using!

The audio of the explosion of the iPhone is on! iPhone 16 series battery disassembly: it is indeed better to disassemble

AI Daily: Confirm! The Three Sheep Recording Gate audio is an AI clone; Meitu MOKI is fully open; Google NotebookLM has launched new features

How do you adjust the audio speed? These adjustment methods are very easy to operate!

How to convert M4A to MP3 format? These audio conversion methods are very practical!

Listen to persuasion, starting from the first grade to primary school, don't let your child listen to audio anymore, which will affect Chinese performance

How do I speed up my audio? Try these four adjustment methods!

How do I increase the audio speed? These are a few ways to turn up the audio speed easily!

The big guy likes it, and the big factory follows up! AI podcasts are detonating the audio economy

How do I merge audio files? These merging methods are super easy to use!

What grade does Enco belong to? Audio and peripheral audio products are recommended

Still entangled with system sound? Reset Windows 11 audio settings now

Teach you how to extract audio from a video? These extraction methods are simple and easy to use!

How to convert M4A to MP3? Come and try these audio conversion methods!