laitimes

A "data uprising" breaks out in the United States: Hollywood literature, the press and social media rebel against AI

author:The Paper

Eric Goldman, a professor at Santa Clara University School of Law, believes that the wave of litigation has just begun, and the "second and third waves" are coming, which will define the future of artificial intelligence.

A "data uprising" breaks out in the United States: Hollywood literature, the press and social media rebel against AI

The Writers Guild of America has been on strike for more than 70 days to demand higher wages, higher percentages for streaming platforms, and regulation of artificial intelligence.

A "data uprising" is breaking out in the United States, with Hollywood, artists, writers, social media companies and news organizations all rebels.

Everything is directed at generative AI tools such as ChatGPT and Stable Diffusion, which are accused of illegally using the work of content creators to train large-scale language models without permission or compensation.

At the heart of this "data uprising" is a new recognition that online information—stories, artwork, news articles, web posts, and photos—can have significant untapped value. The practice of scraping public content on the Internet has a long history, and most companies and nonprofits that do so do so do so publicly. But before ChatGPT was released, data owners didn't know much about it and didn't think it was a particularly serious problem. Now, this has changed when the public learned more about the basics of AI training.

"This is a fundamental reshaping of the value of data." Brandon Duderstadt, founder and CEO of Nomic, said in an interview with the media, "Previously, people got value from data by making it accessible to everyone and advertising. Now, people think about protecting their data. ”

The tide rises

In recent months, social media companies such as Reddit and Twitter, news organizations such as The New York Times and NBC, science fiction writer Paul Tremblay and actress Sarah Silverman have taken action against the unauthorized collection of their work and data by artificial intelligence. This series of moves has been dubbed "Data Revolt" by the US media.

Last week, Silverman filed a lawsuit against OpenAI and Meta, accusing them of using pirated content from their books on training data because the two companies' chatbots can accurately summarize the contents of their books. In addition, more than 5,000 writers, including Jodi Picoult, Margaret Atwood and Viet Thanh Nguyen, signed a petition demanding that tech companies ask permission to use their books as training data, and to give them a signature and compensation.

In order to protect their work, writers and artists have taken different forms of protest. Some choose to lock the work and prevent artificial intelligence from accessing it; Some choose to boycott sites that publish AI-generated content; Others choose to write disruptive content to interfere with AI learning.

On July 13, SAG-AFTRA, one of Hollywood's three largest unions with 160,000 members, announced a strike, after the American Writers Guild had been on strike for more than 70 days. The New York Times said that the strike brought the $134 billion US film and television industry to a standstill, and the SAG-AFTRA union demanded that the streaming giants provide them with fairer profit distribution and better working conditions, and demanded that production companies guarantee that they would not replace actors with AI and computer-generated faces and voices.

At the same time, some news organizations are resisting AI. In June, in an internal memo on the use of generative AI, The New York Times said, "AI companies should respect our intellectual property." That same month, in a statement released by Digital Content Next, a trade group that represents the interests of online publishers, online publishers such as The New York Times and The Washington Post argued that using copyrighted news articles as training data for AI presents potential risks and legal issues, calling on AI companies to respect publishers' intellectual property and creative labor.

Social media companies have also taken a stand. In April, social news site Reddit said it wanted to charge third parties for access to its application programming interface (API). Reddit CEO Steve Hoffman said his company "doesn't need to give all the value to some of the biggest companies in the world for free." In July, Twitter owner Elon Musk also said that some companies and organizations "illegally" scraped Twitter's data in large quantities, and in response to "extreme data scraping and system manipulation," Twitter decided to limit the number of tweets that individual accounts could view.

A "data uprising" breaks out in the United States: Hollywood literature, the press and social media rebel against AI

Reddit founder and CEO Steve Hoffman wants to charge third parties for accessing its application programming interface (API), sparking a huge protest from netizens.

This "data uprising" also includes a "wave of lawsuits," with some AI companies being sued multiple times over data privacy concerns. Last November, a group of programmers filed a class-action lawsuit against Microsoft and OpenAI, alleging that the two companies violated their copyrights by using their code to train AI programming assistants. In June, the Los Angeles-based law firm Clarkson filed a 151-page class-action lawsuit against OpenAI and Microsoft, pointing out how OpenAI collects data from minors and saying web scraping violates copyright law and constitutes "theft." The firm has since filed a similar lawsuit against Google.

Eric Goldman, a professor at Santa Clara University School of Law, said in an interview with the media that the arguments in the lawsuit are too broad and unlikely to be accepted by the court. But he believes that the wave of litigation has only just begun, and that "second and third waves" are coming, and that will define the future of artificial intelligence.

Legal disputes

Generative AIs such as OpenAI's ChatGPT and Dall-E, Google's Bard, and Stability AI's Stable Diffusion are all trained on a massive number of news articles, books, images, videos, and blog posts scraped from the internet, many of which are copyrighted.

In March, OpenAI released an analysis of the agency's primary language model, showing that the text portion of the training data used data from news websites, Wikipedia, and a pirated book database (LibGen), which has been shut down by the U.S. Department of Justice.

On July 13, the U.S. Federal Trade Commission (FTC) sent OpenAI a 20-page document asking OpenAI to provide records on risk management, data security, and information audits of its AI models to investigate whether it violated consumer rights.

A "data uprising" breaks out in the United States: Hollywood literature, the press and social media rebel against AI

On July 12, a U.S. Senate subcommittee held a hearing on artificial intelligence, intellectual property, and copyright, and witnesses present were sworn in court. The hearing heard from the music industry, Photoshop maker Adobe, artificial intelligence company Stability AI, and illustrator Karla Ortiz.

But in public appearances and responses to lawsuits, AI companies have argued that it is reasonable to use copyrighted works to train AI — a reference to the concept of "transformative use" in U.S. copyright law, creating an exception if material is altered in a "transformative" way.

"AI models are basically learning from all the information. It's like a student reading a book in a library and then learning how to write and read. Kent Walker, Google's president of global affairs, said in an interview, "At the same time, you have to make sure that you don't copy someone else's work or do something that infringes copyright." ”

Halimah DeLaine Prado, Google's general counsel, told the media: "For years, it has been clear to everyone that we use data from public sources, such as information posted to the open web and public datasets, to train AI models behind services such as Google Translate. "U.S. law supports the use of public information to create new beneficial uses, and we look forward to refuting these baseless claims." ”

Andres Sawicki, a professor of intellectual property law at the University of Miami, said in an interview that there are precedents that could favor tech companies, such as a 1992 U.S. Court of Appeals ruling that allows companies to reverse engineer other companies' software code to design competing products. But many say it's intuitively unfair for large companies to use creators' work to create new moneymakers. "It's really hard to answer questions about generative AI." He said.

Jessica M. Smith, professor of copyright law at the University of Miami. Jessica D. Litman Sawicki said that fair use is a strong defense of AI companies because most of the output of AI models does not explicitly resemble the work of a particular human being. But she argues that if creators suing AI companies can show enough examples of AI output that closely resemble their work, they will have good reason to believe that copyright is being infringed.

AI companies are starting to respond

Soviki said AI companies can avoid this by installing filters in their products to ensure they don't generate anything that is too similar to existing work. For example, YouTube has used technology to detect and automatically remove copyrighted works uploaded to its site. In theory, AI companies could also build algorithms to find outputs that are highly similar to existing works of art, music, or writing.

This "data uprising" may not make waves in the long run. Tech giants like Google and Microsoft already have vast amounts of proprietary data and the ability to access more. But as access to content becomes more difficult, start-ups and nonprofits that want to compete with big companies may not have enough data to train their systems.

Just in early July, Stuart Russell, a professor of computer science at the University of California, Berkeley and author of "Artificial Intelligence – Modern Methods," warned that AI-powered robots such as ChatGPT could soon "run out of text in the universe," and that technology to train robots by collecting large amounts of text "began to run into difficulties."

Some companies are also responding to this wave with a cooperative attitude. OpenAI said in a statement, "We respect the rights of creatives and authors and look forward to continuing to work with them to protect their interests." On July 14, the Associated Press agreed to license post-1985 news reporting archives to OpenAI, which will also leverage OpenAI's technology and products.

Google also said in a statement that it is involved in negotiations on how publishers will manage their content in the future. "We believe that everyone benefits from a vibrant content ecosystem," the company said. ”

Margaret Mitchell, chief ethics scientist at AI company HuggingFace, said in an interview with the media that "the whole data collection system needs to change, and unfortunately, it needs to be achieved through litigation, but this is often the way that drives technology companies to make changes." She said she wouldn't be surprised if OpenAI pulled one of its products completely before the end of the year because of lawsuits or new regulations.

Read on