What to do about the AI "data shortage"? Companies such as Microsoft and Google are using "synthetic data" to train AI

AI chatbots need massive amounts of high-quality data to support them. Traditionally, AI systems have relied on large amounts of data extracted from various web sources, such as articles, books, and online reviews, to make sense of a user's queries and generate responses.

For a long time, how to obtain more high-quality data has become a major challenge for AI companies. The limited availability of data on the internet has prompted AI companies to look for an alternative solution, synthetic data.

Synthetic data, i.e., artificial data generated by AI systems. Tech companies use their own AI models to generate synthetic data (which is also considered fake) and then use that data to train future iterations of their systems.

Talking about how synthetic data is generated, the process involves setting specific parameters and prompts for the AI model to create content, an approach that allows for more precise control over the data used to train the AI system.

For example, researchers at Microsoft gave an AI model a list of 3,000 words that a four-year-old could understand, and then asked the model to create a children's story using a noun, a verb and an adjective from the vocabulary. Through millions of repeated prompts over a period of several days, the model ended up producing millions of short stories.

While synthetic data in computing is not a new concept, the rise of generative AI has facilitated the creation of higher-quality synthetic data at scale.

Dario Amodei, CEO of AI startup Anthropic, calls this approach an "infinite data generation engine" and aims to avoid some of the copyright, privacy and other issues associated with traditional data collection methods.

Existing use cases and divergent views

Currently, major AI companies such as Meta, Google, and Microsoft have begun to develop advanced models using synthetic data, including chatbots and language processors.

For example, Anthropic uses synthetic data to power its chatbot, Claude; Google DeepMind uses this approach to train models capable of solving complex geometric problems; At the same time, Microsoft has made public small language models developed using synthetic data.

Some proponents argue that synthetic data can produce accurate and reliable models if implemented properly.

However, some AI experts have expressed concerns about the risks associated with synthetic data. Researchers at prestigious universities have observed examples of "model crashes," in which AI models trained on synthetic data have irreversible flaws and produce absurd outputs. In addition, there are concerns that synthetic data may exacerbate bias and errors in datasets.

Zakhar Shumaylov, a Ph.D. at the University of Cambridge, wrote in an email, "Synthetic data can be useful if done properly." However, there is no clear answer as to how this can be handled properly; Some biases can be difficult for humans to detect. ”

In addition, there is a philosophical debate around the reliance on synthetic data, which has questioned the nature of AI – if machine-synthesized data is used, is AI still a machine mimicking human intelligence?

Percy Liang, a professor at Stanford University, emphasized the importance of integrating true human intelligence into the data generation process and highlighted the complexity of creating synthetic data at scale. "Synthetic data is not real data, just like you dream of climbing Mount Everest but you don't really reach the summit," he argues. ”

There is currently no consensus on best practices for generating synthetic data, highlighting the need for further research and development in this area. As the field continues to evolve, collaboration between AI researchers and domain experts is essential to harness the full potential of AI to develop synthetic data.

Source | Finance Associated Press

Recommended Reading - Heavy! OpenAI robot Figure 01's reaction speed is close to that of a human

AI has taken another big leap! "Decoding" Sora, who swiped the screen overnight

挑战GPT-4,谷歌CEO皮查伊详解最强大模型Gemini

IBM's new AI chip, the strongest in the world? Ministry of Industry and Information Technology: The scale of China's AI core industry will reach 500 billion yuan, and Beijing issued the "Implementation Plan for Artificial Intelligence Computing Power Vouchers (2023-2025)" Gartner released the 2023 China Data Analysis and Artificial Intelligence Technology Maturity Curve Seven Departments: Interim Measures for the Management of Generative AI Services

☞ Business Cooperation: ☏ Please call 010-82306118 / ✐ or to [email protected]