laitimes

Train AI with AI synthetic data, and new markets for AI model training emerge

Train AI with AI synthetic data, and new markets for AI model training emerge

Smart stuff

compile | Mingyi Edit | Xu Shan

Companies such as Microsoft, OpenAI and Cohere are testing the use of synthetic data (computer-generated information) to train large language models (LLM). Synthetic data is generally used to train LLM. At present, the most complex form of using artificial data is synthetic data, and training with synthetic data may help further train AI models.

Developers say that common data from the network is no longer enough to further improve the performance of AI models. Several AI companies have turned their attention to synthetic data.

In May, OpenAI CEO Sam Altman was asked if he was concerned about regulators' investigations into ChatGPT's potential privacy violations. Altman said he was "very confident that all data would be replaced with synthetic data."

First, synthetic data can further meet the needs of AI model training

Aidan Gomez, CEO of AI startup Cohere, once said, "If you can get all the data you need from the web, that's great." But in reality, the network environment is noisy and chaotic. It doesn't meet our need for data. ”

Train AI with AI synthetic data, and new markets for AI model training emerge

Pictured is Cohere's CEO Aidan Gomez

In order to improve the performance of AI models and apply them in fields such as science, medicine, or business, AI models require specialized and complex datasets for training. This data is either created by experts in areas such as scientists, doctors, or engineers, or specialized data is obtained from large companies such as pharmaceuticals, banks, and retailers. But "human-created data is very expensive," Gomez said.

The use of synthetic data avoids this expense. AI companies can use AI models to synthesize data related to healthcare or finance. This synthetic data is then used to train LLM.

Gomez said Cohere and several other AI companies have used synthetic data that humans then fine-tune. "Even if synthetic data isn't widely disseminated, the amount of content it contains is already huge." Gomez said.

For example, to train an AI model, Cohere might have two AI models talk to each other, with one acting as a math teacher and the other acting as a student.

"The dialogue between the two AI models revolves around the trigonometry of mathematics, which is generated by AI." Gomez said, "All this conversation is just the imagination of AI models. The human would then look at the conversation, and if the model said something wrong, a human would step in and correct it. That's what we're doing. ”

Two recent studies by Microsoft Research have shown that synthetic data can be used to train smaller, simpler models than LLMs like OpenAI's GPT-4 or Google's PaLM-2.

The first study was a comprehensive dataset of short stories generated by GPT-4, which contained only one word that a typical four-year-old might understand. This dataset is called TinyStories, and is then used to train a simple LLM that is capable of generating smooth and grammatically correct stories.

Another study is that AI can be trained in textbook and practice form by synthesizing Python code. The study found that these codes performed relatively well on coding tasks.

In the emerging market of synthetic data, startups such as Scale AI and Gretel.ai have sprung up to provide synthetic data services. Founded by former intelligence analysts at the NSA and CIA, Gretel has worked with companies such as Google, HSBC, Riot Games and Illumina to help AI companies train better AI models by synthesizing existing data.

Second, the potential risks of synthetic data cannot be ignored

Ali Golshan, CEO of Gretel, said synthetic data can protect the privacy of individuals in the data while still maintaining the statistical integrity of the data.

He added that the adjusted synthetic data could also remove bias and imbalances from existing data. "Creating AI models for hedge funds can be used to observe black swan events (small probability events that are difficult to predict, but occur suddenly, cause a chain reaction and have a huge negative impact, which exist in various fields such as nature, economics, and politics). Let's say create a hundred variants to see if our model crashes," Golshan said. For banks, where fraud typically accounts for less than one percent of total data, Gretel's software can generate thousands of edge case scenarios about fraud and use it to train AI models.

However, critics of synthetic data point out that not all synthetic data uses data that truly reflects or improves on the real world. As AI-generated text and images flood the internet, AI companies continue to scrape training data on the web, likely eventually moving toward repeatedly scraping raw data generated by the original version of their own models — a phenomenon known as "dog-fooding."

Train AI with AI synthetic data, and new markets for AI model training emerge

Recent studies at universities such as Oxford and Cambridge caution against this. Training AI models based on their raw output, which may contain fake or fabricated outputs, has the potential to break and degrade technical performance over time, leading to "irreversible flaws," the study said.

Golshan agrees, agreeing that training with bad synthetic data can hinder AI model iteration. "The web is flooded with more and more AI-generated content. I also think this will lead to generative content degradation over time, as LLM just keeps repeating old knowledge without any new insights. ”

Despite these risks, AI researchers such as Cohere's Gomez say synthetic data also has the potential to accelerate the development of superintelligent AI systems.

Gomez said: "What we really want is a model that can teach itself. You want them to be able to ask their own questions, discover new truths and create their own knowledge. That's the dream. ”

Conclusion: Whether AI companies will apply synthetic data on a large scale remains to be seen

At present, the training of AI models by AI enterprises is mainly based on general data. Under the existing situation, if AI companies intend to seek new data to train AI models, the options include specialized domain databases and synthetic data. However, due to professional value and personal privacy, data in professional fields is difficult to use for AI model training. Therefore, some AI companies will choose relatively low-cost synthetic data to train new AI models.

However, in the process of using synthetic data, there are two points worth being vigilant about: first, for the personal privacy issue associated with data, synthetic data should first ensure that the data is legal; The second is the repeated use of data, that is, "dog-fooding". If the data repeatedly fed into the AI model does not undergo substantial iteration, the functionality of the AI model may have performance problems such as defects.