Reported by the Heart of the Machine
Editor: Jia Qi
Although the prompt is only to generate an "animated version of the toy", the result is no different from "Toy Story".
Not long ago, the New York Times's accusation that OpenAI was allegedly using its content for artificial intelligence development caused a lot of attention and discussion in the community.
In many of the responses output by GPT-4, the New York Times report is plagiarized almost verbatim:
The red letter in the image is a duplication of GPT-4's report with the New York Times.
In this regard, various experts have different views.
Ng Ng, an authoritative scholar in the field of machine Xi, expressed sympathy for OpenAI and Microsoft, and he suspected that the reason for GPT's "plagiarism" was not just the use of unauthorized articles in the model training set, but from a mechanism similar to RAG (Retrieval Enhanced Generation). ChatGPT browsed the web to search for relevant information and downloaded an article to answer the user's question. He found that LLMs without LAG-like mechanisms were often transforms of inputs in pre-training, almost never "plagiarized" word for word.
New York University professor Gary Marcus has a different point of view, saying that "plagiarism" in the field of visual generation has nothing to do with RAG.
他在近日 IEEE Spectrum 发表的文章中,明确指出「Generative AI Has a Visual Plagiarism Problem」。
Now, let's take a look at what this article is about.
LLMs' "memory" of their training data has long been a problem. Recent empirical studies have shown that in some cases, LLMs are able to reproduce, or with slight modifications, large amounts of text in their training sets.
For example, researchers such as Milad Nasr proposed in a 2023 paper that LLMs can reveal private information such as emails and phone numbers when entering certain prompt words. Carlini from Google's Deepmind also recently concluded that larger chatbot models sometimes regurgitate large amounts of text word for word, while smaller models do not.
Recently, the New York Times accused OpenAI of allegedly using its content for AI development in violation of regulations, and the complaint provided by the New York Times provides plenty of evidence of repeated plagiarism.
Marcus calls this near-verbatim output "plagiarized output." If the author of these contents is human, then it will definitely be considered plagiarism. Although it is impossible to calculate how often "plagiarized output" occurs, or under what circumstances plagiarism occurs. But these intuitive results provide strong evidence that generative AI systems may plagiarize. Even if the user does not directly ask the AI to do so, they face infringement claims from copyright owners.
The problem of plagiarism in artificial intelligence is unclear and unclear, and the reason for this is that LLMs are still a "black box" for humans. We don't fully understand the relationship between inputs (training data) and outputs, and outputs can also change unpredictably at some point. The prevalence of "plagiarized output" depends largely on specific factors such as the size of the model and the training set.
Due to the black-box nature of LLMs, the question of "plagiarized output" can only be verified experimentally. These experiments may also only lead to some uncertain conclusions.
However, "plagiarized output" raises many important questions: in terms of technology, can "plagiarized output" be avoided by technical means? At the legal level, does such output constitute copyright infringement? In practice, when user LLMs generate content, is there a way to convince users who do not wish to infringe that they are not infringing?
The New York Times and OpenAI lawsuit has a key impact on the future of generative AI.
In the field of computer vision, the problem of plagiarism still exists. Can a model also be based on a copyrighted image to produce a "plagiarized output"?
Plagiarized visual output in Midjourney v6
Marcus' answer is yes, and there is no need to even enter a hint of plagiarism directly into the model.
With a short hint about certain commercial films, Midjourney v6 can generate a lot of "plagiarized output". As you can see from the example below, the images generated by Midjourney are almost identical to those in well-known movies like The Avengers, Dune, and video games.
They also found that cartoon characters are particularly easy to copy, as in The Simpsons below, even though the prompt is "Yellow-skinned animation popular in the 90s", which has nothing to do with The Simpsons, the resulting result is indistinguishable from the original animation.
Based on these results, it is almost certain that Midjourney V6 was trained on copyrighted material. It is unclear whether Midjourney V6 is licensed by the copyright owner, but Midjourney may be used for creations that infringe the rights of the original author.
In many of the above examples, the authors have verified that Midjourney can intentionally copy copyrighted material, but it is not clear whether someone is infringing the copyright without intentionally doing so.
In the New York Times lawsuit, one thing stands out. As you can see in the chart below, the New York Times provides evidence that GPT-4 gave the exact same answer as the original text by giving the first few words of the article instead of using the prompt "Can you write an article about so-and-so in the style of the New York Times?" This suggests that the model can trigger "plagiarism output" without intentional plagiarism.
t few words of an actual article.
When given the first few words of a New York Times article, it outputs an answer that appears to be plagiarized.
In the field of visual generation, the answer to this question is also yes. In the example shown below, they don't mention Star Wars or characters in the prompt, but Midjourney generates household names like Darth Vader, Luke Skywalker, R2-D2, and more.
"Toy Story", Minions, Sonic, Mario, these well-known big IPs have not escaped the "unconscious plagiarism output".
Even without direct nominations, Midjourney generated images of these highly recognizable movie and game characters.
Evoke cinematic images without direct instructions
In a third experiment, Marcus et al. explored whether Midjourney could output an entire movie frame that resembled the original source of the film without a prompt. Again, the answer to this question is yes.
Eventually, they found that when they entered the prompt word "screencap", even though they did not enter any specific movie, character or actor, it produced clearly infringing content. The image below uses "screencap" as a hint that Midjourney produces a result that closely resembles a frame from the movie.
While Midjourney may patch this particular prompt word soon, Midjourney's ability to generate potentially infringing behavior is obvious. Marcus and his companions found the following victims of "plagiarism", and more lists of movies, actors, and games will be released on their YouTube channel.
Plagiarism issues in Midjourney
Through the above experiments, it can be concluded that Midjourney's illegal use of copyrighted material training models may result in "plagiarized output" by some generative AI systems, which may expose users to copyright infringement claims even if the prompt does not involve plagiarism. Recent news supports the same conclusion. Midjourney recently received a joint lawsuit from more than 4,700 artists for using their work to train AI without consent.
How much of Midjourney's training data is copyrighted material that has been used without permission? The company has not disclosed its original material, nor what material has been properly licensed.
In fact, the company has been dismissive of the plagiarism issue in some public reviews. In an interview with Forbes magazine, Midjourney's CEO answered a question about copyright: "There's no way to get 100 million images and know where they're coming from. 」
Failure to obtain a license to the stock footage could expose Midjourney to a plethora of lawsuits from film studios, video game publishers, actors, and more.
The gist of copyright and trademark law is to protect content creators by restricting unauthorized commercial reuse. Since Midjourney charges a subscription fee and can be considered a competitor to visual content studios, this could be a cause for copyright owners to sue.
Midjourney is clearly trying to suppress Marcus' discovery. After he published some of the results of his experiment, the article was asked to be retracted by Midjourney.
However, not all use of copyrighted material is illegal. For example, in the United States, the use of unauthorized stock material is permitted if the use is short-lived, or if the material is used for criticism, commentary, scientific evaluation, or imitation. Marcus believes Midjourney may have relied on these arguments in the lawsuit.
To make matters worse, Marcus found evidence that a senior software engineer at Midjourney was involved in a conversation in February 2022 about how to "whitewash" data to evade copyright law by "fine-tuning the code."
Another participant, who wasn't sure if he worked for Midjourney, then said, "In a way, in the eyes of copyright law, there's really no way to trace what a derivative work is. 」
As far as Marcus knows, Midjourney has been punished, and there is a good chance that compensation will be made. According to sources, Midjourney may have created a long list of artists who pay them for not having permission to use their work for training.
In addition, Midjourney banned Marcus' collaborators and banned him from access even after he created the account.
Subsequently, Midjourney changed its Terms of Service to include: "You may not use the Service in an attempt to infringe the intellectual property rights of others, including copyrights, patents, or trademarks." Doing so may subject you to penalties including legal action or a permanent ban from using the service. Prompts.
This modification is often a common practice that hinders or even precludes security investigations into generative AI limitations, and is part of a commitment made by several large AI companies in a 2023 deal with the White House.
Other than that, Marcus doesn't think Midjourney is the software that produces the most granular results of any image generation AI out there. Therefore, they also put forward the conjecture of "whether the tendency of AI to create plagiarized images increases as its ability increases".
According to the research of existing researchers in the field of text output, this may be true. Intuitively, the more data a system has, the more statistical relevance it will have, but it may also be easier to accurately reconstruct the data in the training set. If this guess is correct, then models may also be more plagiaristic as generative AI companies collect more data and models get bigger.
DALL・E 3 的抄袭
Like Midjourney, DALL・E 3 is capable of creating near-exact replicas of the original, even without specific prompts.
As you can see in the image below, DALL・E 3 creates a list of potentially infringing works with the following simple prompt: "animated toy".
Like Midjourney, OpenAI's DALL・E 3 seems to draw on a plethora of copyrighted sources. OpenAI seems to be well aware of the fact that its software may infringe copyrights, and in November last year offered to provide compensation for users' copyright infringement lawsuits. Considering the scale of the infringement that Marcus has discovered, it seems that OpenAI is going to "bleed".
At the same time, there has been speculation that OpenAI has been changing its systems in real-time to rule out some of the behaviors revealed in Marcus' article.
How difficult is it to solve the "plagiarism problem" of large models?
Possible solution: Remove the copyrighted material
The cleanest solution is to retrain the image generation model without using copyrighted material, or to limit training to properly licensed datasets.
Removing copyrighted material only after a complaint is received, similar to takedown requests on YouTube, is very expensive to implement. It is not possible to remove specific copyrighted material from an existing model in any simple way. Large neural networks are not databases, where non-compliant records can be easily deleted, and each "takedown" is almost equivalent to retraining.
As a result, generative AI companies may want to patch their existing systems to limit certain types of queries and certain types of outputs. As you can see in the image below, they have seen some signs, but it is destined to be an uphill battle.
OpenAI may be trying to solve these problems one by one in real-time situations. An X user shared a DALL・E 3 prompt that first generated an image of C-3PO, but GPT said it couldn't generate the desired image.
At the same time, Marcus provides two workarounds that do not require retraining the model. The first is to filter out queries that may infringe copyright.
While low-level tasks like "don't generate Batman" can be filtered out, as shown in the following diagram, generating results across multiple queries is simply not preventable:
Experience has shown that guardrails in text generation systems tend to be too loose in some cases and too restrictive in others. Image generation may face similar difficulties. For example, ask Bing for "there is a toilet in a barren landscape under the sun". Bing declined to answer and returned a confusing "Unsafe image content detected" prompt.
In addition, some netizens have discovered how to break through OpenAI's content protection guardrails to make DALL・E 3 generate some images. They do this by having the prompts "include specific details that distinguish the character, such as different hairstyles, facial features, and body textures" and "use color to suggest unique tones, patterns, and arrangements in the original image."
Netizens Pitt.LOVEGOV on Reddit share how ChatGPT can generate images of Brad Pitt.
The second idea offered by Marcus is to filter copyright image sources.
On Twitter, some netizens have tried to identify the source by using ChatGPT and Google reverse image search, but the success rate of this method is not high, especially for relatively new material used in the dataset or the author is not very well-known. The reliability of this approach remains to be seen.
Importantly, while some AI companies and defenders of the status quo have suggested filtering out infringing output as a remedy, this filtering mechanism should never be the whole story of the solution. In accordance with the intent of international law to protect intellectual property and human rights, the work of any creator should not be used for commercial purposes without consent.
For more details, please refer to the original blog.
Reference Links:
https://spectrum.ieee.org/midjourney-copyright
https://www.deeplearning.ai/the-batch/issue-230/