The secret of the AI circle: a big copy of the world's models

Plagiarism has become an open secret in the AI world.

According to Monday's article by The Information, many startups' AI chatbots are likely to have been developed using data from OpenAI and other companies. These bots are comparable to GPT-4 in some tasks, but charge a fraction of the charge.

The startup did not disclose the technology used by OpenAI during its development. However, The Information reports that OpenAI CEO Sam Altman told startup founders last summer that it was acceptable for startups to use OpenAI's technology in this way.

While Altman's response has come as a relief to some startups, the practice has essentially hurt OpenAI's growth, and Altman could change his mind at any time.

The secret of the AI circle: a big copy of the world's models

In startups, plagiarism has become the norm

Startups copy OpenAI by opening up a membership to GPT-4 and then asking it a series of questions, such as "What's wrong with this line of code?" and they use these questions and answers to train their own competing models.

There are many startups that have adopted this strategy.

Daniel Han, co-founder of Unsloth AI, estimates that about half of his clients take data from GPT-4 or Anthropic's Claude model and use it to improve their own models. Many companies also get this kind of data from ShareGPT, a website where developers share answers generated using OpenAI models.

Models from smaller developers are often based on popular open-source models that are freely available from Meta Platforms or Mistral AI, but by incorporating answers from OpenAI models, the quality of the output of these models can be significantly improved. Han said that some developers are using a service called OpenPipe to automate this process.

"This is what will happen in a new ecosystem where there are no clear rules in place," said Matt Murphy, managing director of Menlo Ventures, which invested in OpenAI's rival Anthropic. Murphy says:

If everyone is using the same data, how can you be better than everyone else?

It's unclear to what extent OpenAI, Google, Anthropic, and other big developers will allow startup rivals to use their data to catch up.

Radical Ventures合伙人Rob Toews表示：

The quality and source of training data for AI models is becoming one of the most important hot issues. No one knows exactly how things are going to play out, but any AI startup that doesn't have a thorough and strategic consideration of [the data source] is falling behind.

Developers who have secretly relied on other AI services while developing their models may face an awkward position if they are exposed.

For example, Paris-based Mistral created its own AI using Meta's open-source AI model, Llama 2, but didn't disclose it until it was accidentally leaked, causing some developers to be unhappy. Mistral has raised hundreds of millions of dollars.

The same goes for large companies?

In fact, startups use OpenAI's data to train models in the same way that AI giants like OpenAI do.

OpenAI's chief technology officer, Mira Murati, showed hesitation and confusion last month in response to questions about whether the company was using data from Google's YouTube, as well as Meta Platforms' Facebook and Instagram, to train Sora, which generates AI videos.

It wouldn't be surprising if OpenAI actually used this data.

According to a recent report by the New York Times, OpenAI has created a speech recognition tool called Whisper for transcribing YouTube videos as a way to improve GPT-4. Previously, some media reported that OpenAI secretly used YouTube data to train its early AI models.

Just earlier this month, YouTube CEO Neal Mohan also said he disapproved of OpenAI's use of YouTube videos to develop a Wensheng video model like Sora.

This behavior also led to OpenAI incurring an infringement lawsuit. The New York Times sued OpenAI and its biggest supporter, Microsoft, in December, alleging that they illegally copied the newspaper's news articles while training the model. OpenAI's chatbot "can generate Times content word for word," the lawsuit alleges.

In response, OpenAI argued that it had worked to establish partnerships with news publishers and that its training practices fell within the scope of "fair use" permitted by the U.S. copyright doctrine.

Still, both OpenAI and Google have struck multimillion-dollar licensing deals with publishers like Axel Springer and larger deals with major sites like Reddit.

Even the tech giants can't resist the lure of shortcuts.

The Information reported that Google had transcribed YouTube videos, Meta hired contractors to summarize copyrighted books, and Adobe used Midjourney's AI to generate photos, all to train its own AI models. A Google engineer has resigned amid concerns about the company's use of OpenAI's ChatGPT data.

Sharon Zhou, CEO of startup Lamini, said the rapid pace of AI development and fierce competition have forced developers to turn to controversial sources of training data, such as copyrighted content or LLMs.

Zhou says:

In this space, investors need to see very fast progress.

⭐ Star Wall Street news, good content is not missed ⭐

This article does not constitute personal investment advice, does not represent the views of the platform, the market is risky, investment needs to be cautious, please make independent judgment and decision-making.

The secret of the AI circle: a big copy of the world's models

Read on

iFLYTEK Spark Large Model V3.5 Updated Long Text/Long Graphics/Long Voice to Help Office More Efficient

7 common "decimal processing" models

The secret of the domestic Sora is hidden in this Tsinghua large-scale model team

Jiuqi Nuwa Platform 2.0 is newly upgraded, and the AI model is accelerated, empowering the government and enterprises to govern the future digitally

Mentougou District released multi-dimensional large model application scenarios and continued to expand the closed-loop ecosystem of artificial intelligence industrialization

Special attention is paid to the comparison and selection of animal models of pancreatic cancer

The spring of open source large models is coming?

Q Jie responded to the collision and fire on the M7 highway that killed 3 people, and the iPhone16 model was exposed丨Bang Zaobao

Almost Real Toyota Century 1997 汽车模型

Spy photos of iPhone 16 series models are exposed, and the standard version camera adopts the design of iPhone X

Community renovation model, self-made "space probe" ...... Changning students' creative competition!

Taking advantage of the iterative growth of Geely's large model, Geely Radar's management explained in detail the "new species" of pickup trucks to lead the industry reform plan

This time, Blizzard's new meal model looks delicious!

The Conference on Ecological Construction and Application Development of Large Models was successfully held [Zhongguancun Forum]

The development trend of large models: from dialog boxes to the industrial side

The strongest open-source medical AI model based on Llama 3 was released, refreshing the list