laitimes

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

Reporting by XinZhiyuan

Editor: Good Sleepy Yuan Xie La Yan

Recently, a review report "Big Model Roadmap" by more than 100 authors of Zhiyuan Research Institute and other organizations was exposed to plagiarism, which shocked the whole thing

AI world!

At the beginning of April 2022, an academic misconduct incident in the AI industry can be said to have "detonated" the entire academic circle.

Of the 100 authors involved, all of them are big names in the industry.

Zhihu discussion has also soared from the initial tens of thousands of views on the first day to more than 6 million now.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

In this regard, we can quote a sentence that knows the user, a student of Queen Mary University of London, "Xie Yuan is not his real name" to summarize:

"Building an academic reputation is a lifelong affair, but it only takes a moment to tear it down."

Zhiyuan officially announced an apology

On the evening of April 13, 2022, as the organizer of this review article, Zhiyuan Research Institute published a public apology letter on Zhihu's official account, saying that it "learned about the matter from the Internet", acknowledged that the paper involved was plagiarized, and apologized to the academic community and the public.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

In the public apology letter, Zhiyuan Research Institute mentioned the nature of the review of the research field of the paper, which was synthesized by more than 100 authors in multiple groups "and individually signed" to write multiple articles. Zhiyuan failed to "respond to... All content is strictly reviewed".

In the admitted part, the apology letter of the Zhiyuan Research Institute admitted that some of the allegations of plagiarism exposed by nicholas Carlini on his personal blog were true, that the updated version of the paper on the pre-print website had been removed, and that other developments were awaiting formal investigation reports and that accountability had been pursued.

It is reported that this is a report and not a paper. It is equivalent to a collection of 16 articles, the content of which is independently completed and signed by each author.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

The apology letter concluded by saying that the Intellectuals Research Institute would "hold the relevant responsible persons accountable based on the formal findings of the investigation", although no specific responsible persons were mentioned.

On the morning of April 13, the official Twitter account of Zhiyuan Research Institute also released a brief apology statement, similar to the letter of apology.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

Jonathan Frankle, a Harvard preparatory lecturer and chief research scientist at the startup Mosaic Machine Learning, said in a follow-up: "I'm waiting to see the follow-up."

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

A review of the "bloody case"

The whole incident still starts with this "A Roadmap for Big Model" uploaded to arXiv on March 26.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

I have to say that such a large-scale "author group" can also be occasionally glimpsed in top journals such as Nature and Science.

Nearly half of the co-authors and a quarter of the co-corresponding authors are rare.

Subsequently, the author updated the version on March 30 and April 2, which also involved changes to the list of authors.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

The paper talks not only about the big model technique itself, but also about the prerequisites for training big models.

The study is divided into four parts: resources, models, key technologies, and applications.

It also introduces 16 big models, namely: data, knowledge, computational systems, parallel training systems, language models, visual models, multi-module models, theory & interpretability, common sense reasoning, reliability & safety, governance, evaluation, machine translation, text generation, and dialogue and protein research.

At the end of the paper, the researchers summarized the future development of the big model from a more macro perspective.

And that's just the beginning.

The Google researcher who was copied personally broke the news of plagiarism

On April 8, 2022, Nicholas Carlini, a researcher from Google Brain, posted an article on his personal blog titled "A Case of Plagiarism in Machine Learning Research."

It clearly and restrainedly indicates the plagiarism of the A Roadmap for Big Model:

The "Big Model Roadmap" article does plagiarize his research group's Paper, published on the preprint site in July 2021, "Deduplicating Training Data Makes Language Models Better." In addition, the "Big Model" article is also suspected of plagiarizing more than a dozen other papers.

Nicholas Carlini implicitly states that the "Big Model" article "copy-pastes" a paper on the effects of data reproduction, which is ironic enough to be ignored.

However, Nicholas Carlini also faithfully considered the people involved: "From the big picture, this copy-paste is not the worst thing. This is not a paper that directly copied the methods and conclusions of previous studies and then claimed that this is a groundbreaking new research result.

But even so, the value of a field overview is in how to reformulate/define the field of study. A long, all-encompassing review of the content of other papers directly copied and pasted is no more useful than a short list of citations."

On April 13, after the incident was known and followed by more people, Nicholas Carlini added an update in this article:

This article received too much attention that I expected. The number of new page views per hour for this article is more than the number of site-wide views on my blog last week.

Therefore, I implore you not to let this matter ferment into a witch hunt persecution. I have seen that there have been people who say that all people involved in the paper should be expelled immediately, and that the pre-printed website should be completely banned from entering, and so on.

I don't pretend to understand why the offending papers are so widely plagiarized, so I won't make more judgments.

It may be that some junior authors are not malicious and think that they can copy and paste if they have a reference source. It may also be that students are under pressure from their supervisors and feel that they have to take shortcuts to deliver their manuscripts on time. Senior authors may have read the text and thought it was okay to tinker with it and then release it, not knowing where the text came from.

The point is that the reasons behind this matter are still not made public. This paper has more than a hundred authors, and anything can happen.

My desire to post this post is to attract more attention to the common shortcomings of the academic community. Nearly 1% of published and accepted papers in the academic community have a higher data copy-paste ratio than the "Big Model" article.

I should have made this background clear at the beginning of this post. So again, please don't be too harsh on the paper that caused the accident. Plagiarism is a common disadvantage in academia, and I am more alert to this because it is my thesis that has been plagiarized. I hope that you can use this as a serious learning experience to improve the overall quality of the academic community.

Decision process

Nicholas Carlini said in his blog post that after discovering that the "big model" article was plagiarized, he and his research team colleagues downloaded PDF files from the top meetings and top papers in almost all the fields of machine learning, and then extracted all the text in them and then entered them into a single txt document to obtain a comparative data set.

Finally, Nicholas Carlini and his colleagues used the dataset copying tool - in their own plagiarized paper - to run the "big model" article with the comparative data set, and found the plagiarized part of the "big model" article.

The blog post lists ten of the most prominent plagiarisms, five of which have been acknowledged in a letter of apology from the Zhiyuan Institute.

The following is an example of the comparison between the plagiarism and the original text recognized by the Zhiyuan Institute in Nicholas Carlini's blog post, with the green part on the left for the plagiarized text and the original text on the right.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized
More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized
More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized
More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized
More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

To avoid false positives, Nicholas Carlini enumerates his criteria for identifying plagiarism:

1. After the text space is normalized, there are at least ten words of plagiarism and similarity;

2. Appears in the order of the "Big Model" article;

3. Appeared in other previous papers;

4. But not in more than one previous paper.

This prevents software tools from considering plagiarism the copyright claim part of the paper, the citations of the previous paper to the previous paper, and the authors of the previous papers.

Nicholas Carlini said that their software tools also ran out a lot of "big model" parts that the authors of the article self-copied. However, compared to the wanton and naked plagiarism of other people's papers, "I copied myself" is not a big deal.

Nicholas Carlini also said that due to the haste nature of the screening tool and the incompleteness of the comparative data set (which includes only papers that have been published in academic journals, excluding preprints of papers on this website), it is likely that more plagiarism has not yet been discovered. In any case, the extent of the present is already very sad.

The "Big Model" article was subsequently marked by arXiv officials: "coincident" with the text of other authors.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

In addition, there are also domestic netizens who have made a source comparison of the article, of which the purple one is non-plagiarized, and the yellow one is suspected of plagiarism. Some authors do not appear in specific chapters but in the list of total authors.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

In addition to a wave of screening of his own articles, Nicholas has also gotten in touch with other authors who may have been plagiarized.

One of the netizens who received the email said that many people now pay enough attention to and understand plagiarism.

copy-past is plagiarism, copy-paste-edit is plagiarism, screenshots are plagiarism, and copying the latex formula on other people's arciv is also plagiarism.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

The impact of this incident was so great that it dealt a heavy blow to the reputation of the entire Chinese scholar.

Researchers in the AI industry have expressed doubts on social networking sites: even if there is a division of tasks, or the phenomenon of naming, none of the more than 100 authors have read the things they want to name?

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized
More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

Academic misconduct is urgently needed!

Scientific ethics and academic norms, this is probably a necessary course for all graduate students in China.

In universities such as Peking University, there is also an annual test of basic knowledge of scientific ethics and academic norms, and at the same time, the behavior that does not conform to scientific ethics and academic norms is quantified, and a series of punitive measures are formulated from punishment to expulsion.

It seems that our system is perfect enough, but in fact plagiarism, plagiarism and other situations still occur from time to time.

Translate for me, what is "plagiarism"?

So, to what extent is plagiarism? What is the difference between plagiarism and citation?

These standards cannot be achieved by touching the upper and lower lips, but must have clear, quantifiable and enforceable standards.

PaperPass, a well-known Chinese and re-inspection platform, has given the criteria for identifying plagiarism on its official website.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

You can see the quantitative provisions for plagiarism determination: continuous citation of 200 words without attribution, direct translation or reproduction, rearrangement of more than 15% of the content, and so on. As well as copying experimental results, analysis, system design, and problem solving from other people's papers or works without attribution, without indicating the source of reference, etc.

For the sake of rigor, we give the definition of copying: copying as it is (ready-made methods, experiences, teaching materials, etc.).

At the same time, after being identified as plagiarism, there are also rules to be followed for determining the degree of plagiarism. The three lines of duplicate content are less than 30%, between 30% and 50%, and more than 50%, which are judged to be mild, moderate, and serious plagiarism, respectively.

In addition, the IEEE also has relevant regulations on plagiarism. There are very clear grading standards, a total of five levels.

The first level is the most serious, defined as: uncited, verbatim copying of the whole text; unspecified citations of more than 50% of the main parts, verbatim copying; and multiple papers of an author with verbatim copies of more than 50% of the total.

Level 2: Unquoted, verbatim copying of 20% to 50% of an article.

Level 3: Uncited, verbatim copying of paragraphs and sentences in an article, less than 20% of the total, and used in the main part of the plagiarized paper.

Level IV: Unquoted paraphrasings that are inappropriate for passages.

The fifth level is the lightest of the criteria: citations, but unclear boundaries, verbatim, copying of major parts of an essay.

Some domestic netizens said that this Zhiyuan plagiarism incident is the fifth level, and the disposal is quite timely and appropriate.

It can be seen that under this clear framework, any misconduct is nowhere to be seen.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

IEEE Rating Standard: https://www.ieee.org/content/dam/ieee-org/ieee/web/org/pubs/Level_description.pdf

Hanging out a "sea"

In addition, for this "big plagiarism" incident, a considerable number of netizens questioned the "signature" of the paper.

In previous academic misconduct incidents, "Summer Clover" pointed out that the current paper has not only changed from irrigation to plagiarism and washing, but also often appears to be "famous and popular".

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

In fact, the problem of naming has a long history. Generally speaking, it is divided into egoistic type and altruistic type.

Or hang up the names of well-known people to increase the probability of an article being published in a more advanced journal, and these well-known people may not have contributed to the writing of the paper.

Or just for emotional reasons, hang up the names of irrelevant people.

You hang my name, I hang your name, mutual benefit, mutual benefit, mutual gold.

And the article may have more moisture than in the Pacific Ocean.

Academic norms are on the horizon

I have to say that in the field of academic code of conduct, China still has a long way to go.

In writing this article, Xiaobian consulted a large number of news events related to plagiarism of graduation thesis in recent years. To sum it up in one word, "there is no end to it."

For example, Chen Mou, a 2016 master's degree graduate of Hunan University in software engineering, was reported for plagiarism for his master's degree thesis "Research on Key Technologies for News Abbreviations for New Media".

Dr. Zhang Huaping of Beijing Institute of Technology found that Chen copied the "Research on key technologies of news abbreviations for new media" by Zhao Lianwei, a graduate student he brought with him, without moving a word.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

Hunan University immediately carried out relevant verification work, and on November 3, 2021, it released a note on Weibo to revoke Chen's master's degree.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

At the same time, his supervisor Tang Moumou was also removed from his graduate supervisory qualifications.

After the instructions of Hunan University were issued, Dr. Zhang also sent a weibo post to put an end to the whole thing.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

The student's thesis was plagiarized, to a shallow extent, the tutor did not seriously check and guide, he gave it approval, and did not fulfill his responsibility. Going deeper, it may involve situations of connivance with plagiarism. Regardless of the reason why the tutor failed to detect the plagiarism in time, it was indeed a dereliction of duty.

In fact, there are only two ways to clean up the academic atmosphere. The first is to strengthen the construction of scientific ethics and academic norms, and the other is to severely punish individuals who engage in academic misconduct.

Since 2020, Peking University has launched an online learning platform for graduate students' scientific ethics and academic norms. After graduate students enter the school, they must first self-study the relevant construction publicity outline and normative guidelines. After self-study, the relevant tests must be completed, and the pass rate can be passed.

This is similar to the subject of the driver's license test. Just as a person is not allowed to drive on the road until he is familiar with traffic rules, graduate students do not fully understand the requirements of scientific ethics and academic norms, and there is no reason why they are allowed to start research.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized
More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

Of course, these tests are more for early warning purposes. What really matters is not passing the test, but keeping these requirements in mind and practicing them every moment when conducting research and writing papers.

In the event of academic misconduct, the relevant penalties cannot be absent.

For example, Tsinghua University clearly stipulates a series of punitive measures in the implementation rules of the regulations on the management of student disciplinary punishment.

More than 100 big guys signed AI papers were plagiarized! Zhiyuan has now apologized

It can be seen that any academic misconduct will be severely punished. Just like competitive sports playing fake matches and playing fake balls, academic fraud, plagiarism, plagiarism, and theft are all things that cannot be washed white, and they cannot be touched.

To borrow a phrase from Abu, a well-known e-sports coach, who commented on fake players, "If you touch a fake game, you will die, it doesn't matter."

I think the same is true of academic misconduct, which is about the character of a person and the clarity of the whole environment.

As can be seen from the Zhai Tianlin incident, academic misconduct is a big taboo.

There are no excuses, no appeasement, and heavy penalties.

Resources:

https://www.zhihu.com/question/527620020/answer/2436752217

https://zhuanlan.zhihu.com/p/498064778

https://nicholas.carlini.com/writing/2022/a-case-of-plagarism-in-machine-learning.html

Read on