GPT-3 search for answers on the Internet yourself! New openAI achievements, allowing AI to answer open-ended questions

2021-12-17 21:02:29

Zhi DongXi (public number: zhidxcom)

Compile the | Cheng Qian

Edit | Li Shuiqing

On Thursday, OpenAI released WebGPT based on the AI model GPT-3, which can search online for question organization answers and add citation sources to answer open-ended questions more accurately.

When answering questions, WebGPT can search through the browser, look for relevant answers in different links, and when the answers are finally organized, the citation source is marked so that the accuracy of the answer source can be traced.

In June 2020, OpenAI unveiled GPT-3, the most powerful AI model ever made available at the time, cost tens of millions of dollars and trained 200 billion words, and the WebGPT model was the result of fine-tuning GPT-3.

"We're excited to develop more general- and intelligent AI models, but challenges remain, such as not answering unfamiliar open-ended questions that aren't very accurate," OpenAI said. ”

Who is longer, the Yangtze River and the Nile? AI helps you search for answers online

Language models like GPT-3 can adequately cope with different tasks, but for those tasks that don't have an accurate answer in the real world, the model tends to produce some "confusing" information.

To solve this problem, the researchers taught the AI model to use a web browser to "Baidu" the problem. The model contains commands such as "Search...," "Find in a Page:..." or "Reference:...", and when encountering open-ended questions, WebGPT can automatically retrieve answers by mimicking the behavior of human searches.

In this way, the WebGPT model learns to collect paragraphs from web pages and then use those paragraphs to compose answers.

When asked: "Which is longer, the Nile or the Yangtze?" ”

WebGPT: "The Nile is longer than the Yangtze. The Nile stretches 6,650 km (4,130 mi) between Lake Victoria and the Mediterranean Sea. The Yangtze River stretches from the source of the Tanggula Mountains in Qinghai to the East China Sea, with a total length of 6,300 kilometers (3,915 miles). ”

GPT-3 search for answers on the Internet yourself! New openAI achievements, allowing AI to answer open-ended questions

WebGPT answered by searching "Who is longer, the Nile and the Yangtze?" " question

When asked: "Why was the Suez Canal blocked in March 2021?" ”

WebGPT: "The Suez Canal was cordoned off in March 2021 because a container ship called 'Forever' ran aground there. The ship, which had carried 20,000 boxes of cargo, was blown into the canal by strong winds in the canal with two waterways, so other ships could not bypass the channel that had been given. The Suez Canal Authority hired a tugboat to tow the once abandoned canal out of the canal. The blockage of the canal had a significant negative impact on trade between Europe, Asia and the Middle East. This triggered a new spike in the current freight rate of containers, which eventually began to retreat from record highs. ”

When asked: Which of the Harry Potter series and The Lord of the Rings has more words?

WebGPT: "The Lord of the Rings trilogy contains more words than the Harry Potter series. Including The Hobbit, the entire Lord of the Rings series contains 576459 words. In contrast, the Harry Potter series contains 1084170 words. ”

The team fine-tuned GPT-3 in the same way that it was initially trained. They first trained the model to mimic the way people answer questions, which enabled it to answer questions using content based on browser retrieval.

They then predicted human preferences by training a stimulus model and optimized it using reinforcement learning or rejection sampling (random sampling) to improve the usefulness and accuracy of the model's answers.

More than half of the answers are superior to humans, but may cite untrusted sources

ELI5 is a Python library for debugging machine learning models, and researchers have selected open-ended questions in the "Explain Like I'm Five" column of this library and found that the systematicly trained WebGPT can already answer them.

▲Evaluation results in ELI5

The researchers say they trained three different AI models that correspond to three different inference times to perform the calculations. The results showed that the answers produced by the best performing models were more popular than those written by human demonstrators 56 percent of the time, and the factual accuracy of the answers was similar.

In the future, the researchers say, they will use human feedback to improve the model's answers, potentially eventually surpassing human answers.

For the training distribution of the model, the answers of the best of the three models are almost as accurate as those written by human presenters. However, the researchers say it is a challenge to achieve robustness for all models.

To explore this question, the researchers evaluated the model on TrueQA, a dataset that contains a number of misconception-prone questions to test whether the model produces common misconceptions. In this process, the answer to the question is scored based on truthfulness and amount of information, for example, the answer "I have no comment" is considered true but does not provide information.

▲Evaluation results in TruthfulQA

WebGPT models outperform GPT-3 on TruthfulQA and exhibit more favorable scaling characteristics, though models lag behind human answers in part because they sometimes cite unreliable sources. The researchers hope to use techniques such as adversarial training to reduce the frequency of these failures.

Third, public information supports evidence, and AI can also be cited

To increase the effect of human feedback on improving factual accuracy, researchers must be able to assess the factual accuracy of the answers generated by the model. For researchers, this is extremely challenging, as some of the answers may be subjective or ambiguous.

For this reason, the researchers added a "citation source" feature to the model. This allows one to be supported by reliable information by examining the answers, and assessing the accuracy of the facts. In addition to making tasks more manageable, this feature also reduces answer ambiguity.

However, this approach also raises many questions. Reliable source? Which declarations are sufficient to determine that support is not required? What trade-offs should be made between assessing factual accuracy and other criteria? All of this is a tough judgment.

The researchers believe that there are still some fundamental errors in WebGPT, and many nuances go undetected. "But we expect that as AI systems improve, such decisions will become even more important, and interdisciplinary research will be needed to develop standards that are both practical and cognitive, and we also expect further considerations such as transparency to be important."

WebGPT is not sufficient to assess factual accuracy by citing sources. The researchers mention that a sufficiently powerful model would carefully pick sources that it deems convincing, even though there may be no accurate evidence for this source. The researchers have come up with a new solution: increase the debate process to get answers.

The WebGPT model is generally more "realistic" than GPT-3 and generates less frequent misrepresentations, but there are still many risks to this model. "Answers with citations are often considered authoritative, which may obscure the fact that our models still have fundamental errors," OpenAI said. WebGPT also tends to reinforce existing beliefs among users.

In addition to these deployment risks, this could also add new risks if WebGPT is allowed to send queries to various browsers and track links that already exist on the web.

Conclusion: The GPT-3 model has made new progress

The GPT-3 model, which used to be able to write novels, knock code, and write scripts, was in the hands of researchers and learned to retrieve automatically. But based on GPT-3's training experience, these risks may not be enough to create a crisis. However, as the intelligence of the model gradually deepens, will there be greater risks in the future?

Tools such as human feedback and web browsers offer a promising path to achieving robust, realistic, general-purpose AI models. OpenAI says AI models struggle with challenging or unfamiliar situations, but they've still made significant progress in this direction.

Source: OpenAI

GPT-3 search for answers on the Internet yourself! New openAI achievements, allowing AI to answer open-ended questions

Read on