laitimes

The hottest models of big language love "nonsense", whose "hallucination" problem is the worst?

author:Wall Street Sights

New York-based AI startup and machine learning monitoring platform Arthur AI released its latest research report on Thursday, August 17, comparing the ability of Microsoft-backed OpenAI, "metaverse" Meta, Google-backed Anthropic, and Nvidia-backed generative AI unicorn Cohere to "hallucinate" (AKA) in big language models (LLMs).

Arthur AI regularly updates the research plan, known as Generative AI Test Evaluation, which ranks the strengths and weaknesses of industry leaders and other open source LLM models.

The hottest models of big language love "nonsense", whose "hallucination" problem is the worst?

The latest test picked GPT-3.5 (175 billion parameters) and GPT-4 (1.76 trillion parameters) from OpenAI, Claude-2 from Anthropic (parameter unknown), Llama-2 from Meta (70 billion parameters), and Command from Cohere (50 billion parameters) and asked challenging questions about these top LLM models from both quantitative and qualitative research.

In the "AI Model Illusion Test," the researchers examined the answers given by different LLM models using different categories of questions, such as combinatorics, U.S. presidents and Moroccan political leaders, "designed to include the key factors that lead LLMs to make mistakes, namely that they require multiple reasoning steps on information." ”

The study found that overall, OpenAI's GPT-4 performed best across all the models tested, producing fewer "hallucination" problems than previous versions of GPT-3.5, such as a 33 to 50 percent reduction in hallucinations in the math problem category.

At the same time, Meta's Llama-2 ranked in the overall performance of the five models tested, and Anthropic's Claude-2 ranked second, behind GPT-4. And Cohere's LLM model is the most "" and "very confident in giving wrong answers."

The hottest models of big language love "nonsense", whose "hallucination" problem is the worst?

Specifically, in complex mathematical problems, GPT-4 ranks first, followed by Claude-2; In U.S. presidential questions, Claude-2 ranked first in accuracy and GPT-4 second; In Moroccan political questions, GPT-4 returned to the top of the list, with Claude-2 and Llama 2 almost entirely choosing not to answer such questions.

The researchers also tested the extent to which AI models would "hedge" their answers with irrelevant warning phrases to avoid risk, common phrases such as "As an AI model, I can't provide an opinion."

GPT-4 represents a 50 percent increase in hedging warnings over GPT-3.5, which the report says "quantifies what users have mentioned as a more frustrating experience with GPT-4." Cohere's AI model provides no hedge at all in all three of these questions.

In contrast, Anthropic's Claude-2 is the most reliable in terms of "self-awareness", that is, it can accurately measure what you know and don't know, and only answer questions that are supported by training data.

The hottest models of big language love "nonsense", whose "hallucination" problem is the worst?

Adam Wenchel, co-founder and CEO of Arthur AI, noted that this is the industry's first "comprehensive report on the incidence of hallucinations in AI models" and does not just provide a single data to illustrate the ranking of different LLMs:

"The most important takeaway for users and businesses from this kind of testing is that the exact workload can be tested, and it's critical to understand how LLM performs the task you want to accomplish. Many previous LLM-based measures were not the way they were used in real life. ”

On the same day that the research report was published, Arthur also launched Arthur Bench, an open-source AI model evaluation tool that can be used to evaluate and compare the performance and accuracy of multiple LLMs, and companies can add custom standards to meet their business needs, with the goal of helping enterprises make informed decisions when adopting artificial intelligence.

"AI hallucinations" refers to chatbots that completely fabricate information and act like they are gushing facts in response to the user's prompts.

In a promotional video shot by Google for its generative AI chatbot Bard in February, it made untrue statements about the James Webb Space Telescope. In June, ChatGPT cited a "fake" case in a federal court filing in New York that could face sanctions for lawyers involved in the document.

OpenAI researchers reported in early June that they had found a solution to the "AI illusion," in which AI models were trained to reward themselves for each correct step in reasoning out an answer, rather than just waiting until they came to a correct final conclusion. This strategy of "process supervision" will encourage AI models to reason in a more human-like way of "thinking."

In the report, OpenAI acknowledges:

"Even the most advanced AI models are prone to generating lies, and they show a tendency to fabricate facts in moments of uncertainty. These hallucinations are especially problematic in areas that require multi-step reasoning, where a single logical error is enough to undermine a larger solution. ”

Investment giant Soros also published a column in June saying that artificial intelligence can most aggravate the multiple crises facing the world at present, one of the reasons is the serious consequences of AI illusion:

"AI destroys this simple model because it has nothing to do with reality. Artificial intelligence creates its own reality, and when artificial reality does not correspond to the real world, which often happens, the AI illusion is created.

This makes me almost instinctively opposed to AI, and I completely agree with the experts that AI needs to be regulated. But AI regulations must be enforced globally because the incentive to cheat is too great, and those who evade regulations will gain an unfair advantage. Unfortunately, global regulation is not achievable.

Artificial intelligence is developing so fast that it is impossible for ordinary human intelligence to fully understand it. No one can predict where it will take us. ...... That's why I'm instinctively against AI, but I don't know how I can stop it.

With a presidential election in the United States in 2024 and likely in the United Kingdom, AI will undoubtedly play an important role and will not play anything other than danger.

AI is very good at creating disinformation and deep fakes, and there will be many malicious actors. What can we do about this? I don't have an answer. ”

Previously, Geoffrey Hinton, who is regarded as the "godfather of artificial intelligence" and left Google, has repeatedly publicly criticized the risks posed by AI, which may even destroy human civilization, and predicted that "artificial intelligence can surpass human intelligence in only 5 to 20 years."

This article is from Wall Street News, welcome to download the APP to see more

Read on