Reported by the Heart of the Machine
Author: Dapan Chicken, Jiaqi
OpenAI: The New York Times' lawsuit is baseless.
At the end of 2023, the New York Times produced strong evidence against Microsoft and OpenAI. According to Cecilia Ziniti, chief counsel to several tech companies, the odds of the New York Times winning are extremely high.
Wu Enda, a well-known scholar in the field of machine Xi, posted two tweets in response to this matter to explain his views. In his first tweet, express sympathy for OpenAI and Microsoft. He suspects that many of the duplicate articles are actually generated through a mechanism similar to RAG (Retrieval Enhancement Generation), rather than relying solely on the weights of model training.
Source: https://twitter.com/AndrewYNg/status/1744145064115446040
However, Ng's speculation has also been refuted. Gary Marcus, a professor at New York University, said that "plagiarism" in the field of visual generation has nothing to do with RAG.
Today, Ng tweeted again with a new explanation of the previous claim. He made it clear that it is not right for any company to copy someone else's copyrighted content on a large scale without permission or a valid reason to use it. However, he argues that LLMs only "regurgitate" based on specific prompts in rare cases. And the average average user hardly adopts these specific tips. Regarding the specific way to suggest that GPT-4 can copy the text of the New York Times, Ng also said that this rarely happens. He added that the new version of ChatGPT seems to have improved the vulnerability.
Source: https://twitter.com/AndrewYNg/status/1744433663969022090
When trying to replicate the worst-looking examples of copyright infringement in lawsuits, such as trying to use ChatGPT to bypass paywalls, or getting Wirecutter's results, Ng found that this triggered GPT-4's web browsing capabilities. This suggests that RAGs may be involved in these examples. GPT-4 can browse the web to download additional information to generate responses, such as doing a web search or downloading a specific article. He argues that the prominence of these examples in the lawsuit could lead to the misconception that LLMs trained on New York Times texts directly led to the copying of these texts, but if RAGs are involved, then the root cause of these copying examples is not that LLMs were trained on NYT texts.
Since there are two points of view, we have already read the "condemnation" of the New York Times, what is OpenAI's view on this matter and what kind of response it has, let's take a look.
Blog address: https://openai.com/blog/openai-and-journalism
OpenAI takes a stand
OpenAI says its goal is to develop AI tools that empower people to solve problems that are out of reach. Their technology is being used by people around the world to improve everyday life.
OpenAI disagrees with the claims in the New York Times lawsuit, but sees it as an opportunity to shed light on the company's business, intentions, and how technology is structured. They summarized their position in the following four points:
- partnering with news organizations and creating new opportunities;
- Training is fair use, but requires an option to opt out;
- "Retelling" is a rare bug that OpenAI is working to reduce to zero;
- The New York Times narrative is incomplete.
OpenAI also elaborated on these four points in its blog.
OpenAI partners with news organizations and creates new opportunities
OpenAI strives to support news organizations in the technology design process. They met with a number of media outlets and leading industry organizations to discuss needs and provide solutions. OpenAI'Xi s goal is to learn, educate, listen to feedback, and adapt to support a healthy news ecosystem and create mutually beneficial opportunities.
- They have partnerships with news organizations:
- to help reporters and editors with a lot of tedious, time-consuming work, etc.;
- On top of that, OpenAI can let AI models understand the world by training on more historical, non-public content;
- Displaying real-time content in ChatGPT with attribution gives news publishers a new way to connect with their readers.
Training is fair use
However, you will need to provide an option to opt out
The rationale for training AI models using publicly available internet material is long-standing and widely accepted and supported. This support comes from a wide range of academics, library associations, civil society groups, start-ups, leading U.S. companies, creators, authors, and more, all of whom agree that AI model training is considered fair use. In the European Union, Japan, Singapore, and Israel, there are also laws that allow models to be trained on copyrighted content. This is the advantage of AI innovation, advancement, and investment.
OpenAI says it's the first in the AI industry to offer an easy exit process, which The New York Times adopted in August 2023 to prevent OpenAI's tools from accessing their website.
"Retelling" is a rare mistake
OpenAI is working to reduce it to zero
"Paraphrasing" is a rare glitch in the AI training process. If a specific piece of content appears more than once in the training data, such as when the same piece of content is repeatedly retweeted by different websites, "retelling" by AI models is more common. As a result, OpenAI has taken some steps to prevent duplicate content in the model's output.
OpenAI also follows this principle when designing AI models that Xi concepts and then applying them to new problems, which they hope will absorb fresh information from around the world. Since the model's "Xi learning data" is a collection of all human knowledge, training data from journalism is only the tip of the iceberg, and any single data source, including the New York Times, has no meaning for the model's learning Xi behavior.
The New York Times narrative is incomplete
On December 19 of last year, OpenAI and The New York Times had smooth negotiations to reach a partnership. The focus of the negotiations is that ChatGPT will display the citation sources in real-time in the answers, and the New York Times will connect with new readers in this way. At the time, OpenAI had already explained to the New York Times that their content did not contribute substantially to the training of existing models and would not involve future model training.
The New York Times declined to share any examples of GPT "allegedly plagiarizing" its stories with OpenAI. In July, OpenAI had already offered good faith in resolving the issue, and they immediately removed the content after learning that ChatGPT might accidentally copy content on a live web page.
However, the "plagiarism" provided by the New York Times seems to be articles from many years ago. These articles have been widely reposted and disseminated on several third-party websites. OpenAI believes that the New York Times may have deliberately manipulated the prompt words by typing in large excerpts from "plagiarized" articles to induce the AI to respond with a high degree of repetition of the original text. Even with such prompts, OpenAI's models don't typically have such a high rate of repetition in appeals. As a result, OpenAI speculates that The New York Times either manipulated the prompt words or carefully selected "examples" through trial and error.
This repeated multiple rounds of dialogue is a violation of the terms of use. OpenAI is constantly improving the resilience of the system to resist malicious attacks on ruminating training data, and has made great progress recently.
OpenAI concluded its blog by saying that the New York Times lawsuit is baseless. They still want to have a constructive partnership with The New York Times and respect its long history.
The outcome of this debate will be crucial for the future of AI. It can hinder the training of AI models, or it can explore new ways for AI to evolve with enterprises. What do you think about this matter, please leave a message in the comment area to discuss.