Microsoft, Google, and Meta are betting on synthetic data to build AI models

Behind every clever response of a chatbot is a huge amount of data – in some cases, trillions of words need to be extracted from articles, books, and online reviews to teach the AI system to understand the user's inquiry. The conventional wisdom in the industry is that building the next generation of AI products will require more and more information.

However, there is a big problem with this plan: there is a limit to the high-quality data that can be provided on the network. In order to get their hands on this data, AI companies typically either pay millions of dollars to publishers to license the content or download the data from websites, putting themselves at risk of copyright disputes. A growing number of top-tier AI companies are exploring another way to divide the industry: using synthetic data, which is inherently fake.

Here's how it works: Tech companies can use their own AI systems to produce text and other media. This data can then be used to train future versions of the same system, which Anthropic's chief executive, Dario Amodei, calls a potential "infinite data generation engine." In this way, AI companies can avoid raising many legal, ethical, and privacy issues.

The idea of synthesizing data in computing is not new – the technology has been used for decades in everything from the deanonymization of personal information to the simulation of road conditions in autonomous driving technology. But the rise of generative AI has made it easier to build higher-quality synthetic data at scale, and it has also given new urgency to the practice.

Anthropic said it used synthetic data to build the latest model to power its chatbot, Claude. Meta and Google are already using this data to develop their recent open-source models. GoogleDeepMind recently said it relies on this approach to help train a model that can solve Olympiad-level geometry problems. There has been much speculation as to whether OpenAI is using this data to train its text-to-video image generator, Sora. (OpenAI revealed that it is exploring the use of synthetic data, but would not confirm further details.) ）

At Microsoft, the generative AI research team used synthetic data in a recent project. They want to build a smaller, less resource-intensive AI model that still has effective language and reasoning capabilities. To do this, they try to mimic the way children learn language by reading stories.

Instead of providing the AI model with a large number of children's books, the team listed 3,000 words that a four-year-old could understand. They then asked the AI model to create a children's story using a noun, a verb, and an adjective from the vocabulary. The researchers repeated the prompt millions of times over the course of several days, resulting in millions of short stories that eventually helped develop another, more powerful language model. Microsoft has open-sourced and made this new "small" language model family, Phi-3, open source and open to the public.

Sébastien Bubeck, Microsoft's vice president of generative AI, said: "All of a sudden, you have far more control than you used to. You can decide at a more granular level what you want your model to learn. ”

With synthetic data, Bubeck says, you can also add more explanations to the data to better guide AI systems through the learning process, which would otherwise confuse machines in the process.

However, some AI experts are concerned about the risks of this technology. A team of researchers from Oxford, Cambridge and several other well-known universities published a paper last year explaining why using synthetic data generated by ChatGPT to build new AI models can lead to what they call "model collapse."

In their experiments, AI models built on ChatGPT's output began to develop "irreversible flaws" and appeared to have lost memory of what was originally trained. For example, the researchers used text about historic buildings in the United Kingdom to prompt a large language AI model. When they retrained the model several times using synthetic data, the model began to produce meaningless gibberish about the long-eared hare.

The researchers are also concerned that synthetic data could amplify bias and toxicity in datasets. Some proponents of synthetic data say that by taking the right measures, models developed in this way can be as accurate or even better than models built on real data.

Zakhar Shumaylov, a Ph.D. at the University of Cambridge, said in an email: "Synthetic data can be useful if done properly. However, there is no clear answer as to how this can be handled properly; Some biases can be difficult for humans to detect. "Shumelov is one of the co-authors of the aforementioned paper on the collapse of the model.

There's also a more philosophical debate: if large language models get stuck in an endless loop of training on their own content, will AI eventually become less of a machine that mimics human intelligence and more of a machine that mimics other machine languages?

Percy Liang, a professor of computer science at Stanford University, said that in order to produce useful synthetic data, companies still need true human ingenuity, such as books, articles and program code. "Synthetic data is not real data, just like you dream of climbing Mount Everest but you don't really reach the summit," Liang said in an email. ”

Pioneers in the field of synthetic data and artificial intelligence agree that you can't exclude humans from the process. We still need real people to build and refine artificial datasets.

"Synthetic data is not as simple as pressing a button and saying, 'Hey, help me generate some data,'" Bubeck said. "It's a very complicated process. The process of building synthetic data at scale requires a significant investment of manpower. ”

Microsoft, Google, and Meta are betting on synthetic data to build AI models

Read on

Shadowless Cloud Classroom at an altitude of 3,200 meters: Children under the snow-capped mountains meet AI models

Xiao Xin shared: cellular automata model

The man stole 800 yuan of mobile phone models and was detained

Only Google's injured world has been achieved, but should the "all-round model" be followed?

Unraveling the Mystery of Memory: Ebbinghaus's Forgetting Curve and Mind Model Playing Cards Help You Grow and Leap

After GPU, NPU becomes the standard configuration again, how do mobile phones and PCs carry large AI models?

Be a sneak peek! ByteDance is unprecedented! The large model is stunningly unveiled, and the price is as low as 99%!

39 million people watched Lei Jun's live test drive; Musk recruits second brain-computer experiment patient; DeepMind launches a large-scale model risk assessment framework

From "sky-high prices" to "fracture prices", large models are about to change

If you want to land a large model, let everyone afford to use it first

Direct interaction with hundreds of millions of users Third-party AI models accelerate access to the Weibo ecosystem

iFLYTEK Xinghuo large model empowerment, opening up the "new consciousness" of virtual people

When open source meets large models, what kind of changes will occur?

It is said that the senior management of the Tsinghua Department of the large model company has changed

58.com Sun Qiming: How to build a large model of life service vertical? Self-developed + open source with both hands

AI Dimensity Full Push, China's First End-to-End Large Model Mass Production on the Car Xpeng opens the era of AI intelligent driving