laitimes

Microsoft, Google, and Meta are betting on synthetic data to build AI models

author:Sina Finance

Behind every clever response of a chatbot is a huge amount of data – in some cases, trillions of words need to be extracted from articles, books, and online reviews to teach the AI system to understand the user's inquiry. The conventional wisdom in the industry is that building the next generation of AI products will require more and more information.

However, there is a big problem with this plan: there is a limit to the high-quality data that can be provided on the network. In order to get their hands on this data, AI companies typically either pay millions of dollars to publishers to license the content or download the data from websites, putting themselves at risk of copyright disputes. A growing number of top-tier AI companies are exploring another way to divide the industry: using synthetic data, which is inherently fake.

Here's how it works: Tech companies can use their own AI systems to produce text and other media. This data can then be used to train future versions of the same system, which Anthropic's chief executive, Dario Amodei, calls a potential "infinite data generation engine." In this way, AI companies can avoid raising many legal, ethical, and privacy issues.

The idea of synthesizing data in computing is not new – the technology has been used for decades in everything from the deanonymization of personal information to the simulation of road conditions in autonomous driving technology. But the rise of generative AI has made it easier to build higher-quality synthetic data at scale, and it has also given new urgency to the practice.

Anthropic said it used synthetic data to build the latest model to power its chatbot, Claude. Meta and Google are already using this data to develop their recent open-source models. GoogleDeepMind recently said it relies on this approach to help train a model that can solve Olympiad-level geometry problems. There has been much speculation as to whether OpenAI is using this data to train its text-to-video image generator, Sora. (OpenAI revealed that it is exploring the use of synthetic data, but would not confirm further details.) )

At Microsoft, the generative AI research team used synthetic data in a recent project. They want to build a smaller, less resource-intensive AI model that still has effective language and reasoning capabilities. To do this, they try to mimic the way children learn language by reading stories.

Instead of providing the AI model with a large number of children's books, the team listed 3,000 words that a four-year-old could understand. They then asked the AI model to create a children's story using a noun, a verb, and an adjective from the vocabulary. The researchers repeated the prompt millions of times over the course of several days, resulting in millions of short stories that eventually helped develop another, more powerful language model. Microsoft has open-sourced and made this new "small" language model family, Phi-3, open source and open to the public.

Sébastien Bubeck, Microsoft's vice president of generative AI, said: "All of a sudden, you have far more control than you used to. You can decide at a more granular level what you want your model to learn. ”

With synthetic data, Bubeck says, you can also add more explanations to the data to better guide AI systems through the learning process, which would otherwise confuse machines in the process.

However, some AI experts are concerned about the risks of this technology. A team of researchers from Oxford, Cambridge and several other well-known universities published a paper last year explaining why using synthetic data generated by ChatGPT to build new AI models can lead to what they call "model collapse."

In their experiments, AI models built on ChatGPT's output began to develop "irreversible flaws" and appeared to have lost memory of what was originally trained. For example, the researchers used text about historic buildings in the United Kingdom to prompt a large language AI model. When they retrained the model several times using synthetic data, the model began to produce meaningless gibberish about the long-eared hare.

The researchers are also concerned that synthetic data could amplify bias and toxicity in datasets. Some proponents of synthetic data say that by taking the right measures, models developed in this way can be as accurate or even better than models built on real data.

Zakhar Shumaylov, a Ph.D. at the University of Cambridge, said in an email: "Synthetic data can be useful if done properly. However, there is no clear answer as to how this can be handled properly; Some biases can be difficult for humans to detect. "Shumelov is one of the co-authors of the aforementioned paper on the collapse of the model.

There's also a more philosophical debate: if large language models get stuck in an endless loop of training on their own content, will AI eventually become less of a machine that mimics human intelligence and more of a machine that mimics other machine languages?

Percy Liang, a professor of computer science at Stanford University, said that in order to produce useful synthetic data, companies still need true human ingenuity, such as books, articles and program code. "Synthetic data is not real data, just like you dream of climbing Mount Everest but you don't really reach the summit," Liang said in an email. ”

Pioneers in the field of synthetic data and artificial intelligence agree that you can't exclude humans from the process. We still need real people to build and refine artificial datasets.

"Synthetic data is not as simple as pressing a button and saying, 'Hey, help me generate some data,'" Bubeck said. "It's a very complicated process. The process of building synthetic data at scale requires a significant investment of manpower. ”

Read on