The hottest models of big language love "nonsense", whose "hallucination" problem is the worst?

author：Wall Street Sights 2023-08-18 01:46:00

New York-based AI startup and machine learning monitoring platform Arthur AI released its latest research report on Thursday, August 17, comparing the ability of Microsoft-backed OpenAI, "metaverse" Meta, Google-backed Anthropic, and Nvidia-backed generative AI unicorn Cohere to "hallucinate" (AKA) in big language models (LLMs).

Arthur AI regularly updates the research plan, known as Generative AI Test Evaluation, which ranks the strengths and weaknesses of industry leaders and other open source LLM models.

The hottest models of big language love "nonsense", whose "hallucination" problem is the worst?

The latest test picked GPT-3.5 (175 billion parameters) and GPT-4 (1.76 trillion parameters) from OpenAI, Claude-2 from Anthropic (parameter unknown), Llama-2 from Meta (70 billion parameters), and Command from Cohere (50 billion parameters) and asked challenging questions about these top LLM models from both quantitative and qualitative research.

In the "AI Model Illusion Test," the researchers examined the answers given by different LLM models using different categories of questions, such as combinatorics, U.S. presidents and Moroccan political leaders, "designed to include the key factors that lead LLMs to make mistakes, namely that they require multiple reasoning steps on information." ”

The study found that overall, OpenAI's GPT-4 performed best across all the models tested, producing fewer "hallucination" problems than previous versions of GPT-3.5, such as a 33 to 50 percent reduction in hallucinations in the math problem category.

At the same time, Meta's Llama-2 ranked in the overall performance of the five models tested, and Anthropic's Claude-2 ranked second, behind GPT-4. And Cohere's LLM model is the most "" and "very confident in giving wrong answers."

Specifically, in complex mathematical problems, GPT-4 ranks first, followed by Claude-2; In U.S. presidential questions, Claude-2 ranked first in accuracy and GPT-4 second; In Moroccan political questions, GPT-4 returned to the top of the list, with Claude-2 and Llama 2 almost entirely choosing not to answer such questions.

The researchers also tested the extent to which AI models would "hedge" their answers with irrelevant warning phrases to avoid risk, common phrases such as "As an AI model, I can't provide an opinion."

GPT-4 represents a 50 percent increase in hedging warnings over GPT-3.5, which the report says "quantifies what users have mentioned as a more frustrating experience with GPT-4." Cohere's AI model provides no hedge at all in all three of these questions.

In contrast, Anthropic's Claude-2 is the most reliable in terms of "self-awareness", that is, it can accurately measure what you know and don't know, and only answer questions that are supported by training data.

Adam Wenchel, co-founder and CEO of Arthur AI, noted that this is the industry's first "comprehensive report on the incidence of hallucinations in AI models" and does not just provide a single data to illustrate the ranking of different LLMs:

"The most important takeaway for users and businesses from this kind of testing is that the exact workload can be tested, and it's critical to understand how LLM performs the task you want to accomplish. Many previous LLM-based measures were not the way they were used in real life. ”

On the same day that the research report was published, Arthur also launched Arthur Bench, an open-source AI model evaluation tool that can be used to evaluate and compare the performance and accuracy of multiple LLMs, and companies can add custom standards to meet their business needs, with the goal of helping enterprises make informed decisions when adopting artificial intelligence.

"AI hallucinations" refers to chatbots that completely fabricate information and act like they are gushing facts in response to the user's prompts.

In a promotional video shot by Google for its generative AI chatbot Bard in February, it made untrue statements about the James Webb Space Telescope. In June, ChatGPT cited a "fake" case in a federal court filing in New York that could face sanctions for lawyers involved in the document.

OpenAI researchers reported in early June that they had found a solution to the "AI illusion," in which AI models were trained to reward themselves for each correct step in reasoning out an answer, rather than just waiting until they came to a correct final conclusion. This strategy of "process supervision" will encourage AI models to reason in a more human-like way of "thinking."

In the report, OpenAI acknowledges:

"Even the most advanced AI models are prone to generating lies, and they show a tendency to fabricate facts in moments of uncertainty. These hallucinations are especially problematic in areas that require multi-step reasoning, where a single logical error is enough to undermine a larger solution. ”

Investment giant Soros also published a column in June saying that artificial intelligence can most aggravate the multiple crises facing the world at present, one of the reasons is the serious consequences of AI illusion:

"AI destroys this simple model because it has nothing to do with reality. Artificial intelligence creates its own reality, and when artificial reality does not correspond to the real world, which often happens, the AI illusion is created.

This makes me almost instinctively opposed to AI, and I completely agree with the experts that AI needs to be regulated. But AI regulations must be enforced globally because the incentive to cheat is too great, and those who evade regulations will gain an unfair advantage. Unfortunately, global regulation is not achievable.

Artificial intelligence is developing so fast that it is impossible for ordinary human intelligence to fully understand it. No one can predict where it will take us. ...... That's why I'm instinctively against AI, but I don't know how I can stop it.

With a presidential election in the United States in 2024 and likely in the United Kingdom, AI will undoubtedly play an important role and will not play anything other than danger.

AI is very good at creating disinformation and deep fakes, and there will be many malicious actors. What can we do about this? I don't have an answer. ”

Previously, Geoffrey Hinton, who is regarded as the "godfather of artificial intelligence" and left Google, has repeatedly publicly criticized the risks posed by AI, which may even destroy human civilization, and predicted that "artificial intelligence can surpass human intelligence in only 5 to 20 years."

This article is from Wall Street News, welcome to download the APP to see more

The hottest models of big language love "nonsense", whose "hallucination" problem is the worst?

Read on

Global AI Agent inventory, big language model entrepreneurship must refer to 60 AI agents

Reversing the Curse: The Powerlessness of Big Language Models

CNCC | Prospective problems and challenges of large language models in mathematics: theory, methods and applications

Recently, the desktop operating system, the three camps have very large version updates. First of all, domestic DeepinOS accesses AI large language models. Immediately after the 26th, Microsoft Wind

The implementation practice of large language model in data warehouse data governance

The breakthrough of the big language model is to equip AI with five senses and five senses

How to use big language models to build a private knowledge base?

🚀Langchain-Chatchat: The New Choice for Local Knowledge Base Q&A! 🌟 Project Highlights: Based on the Big Language Model: Combining Langchain and Ch

Microsoft launched the AutoGen framework to help developers create complex applications based on large language models

Live Review | Potential and resistance, explore the application of big language models in the field of financial risk control

Under the wave of ChatGPT, look at the development of China's large language model industry #Dongshroom Business School#

The Big Language Model of Federal Law

The bookstore picked it up casually and took a look, and stood for three hours to read it, the fastest reading speed 😂 ever#Large Language Model#OpenAI

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

MIT Amazing Proof: Big Language Model is the World Model? LLM understands space and time

How to Become LLM Word Master! "The Underlying Mental Method of Big Language Model"

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

The combination of deep learning and chemical language models is used for de novo drug design, which is published in the journal Nature

The tuyere belonging to major technology companies is here again! This large language model leads to the "new industrial revolution."

The landing of large language models Why the first step is to do customer service

OpenAI launches new large language model GPT-4o; Apple will start selling the Vision Pro in China; SoftBank sold almost all of its shares in Alibaba

探索大语言模型：理解Self Attention| 京东物流技术团队

The synergy of knowledge graphs with large language models

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

The parameters are improved slightly, and the performance index explodes! Google: Large language models hide mysterious skills

Learn more about large language model operations (LLMOps)