Interview with AI Large Model Experts丨HKUST Ho Chun-yin: The evaluation benchmark is the compass for large model development

author：Red Star News 2023-12-25 13:14:00

On December 28, the 2023 Artificial Intelligence Large Model Benchmark Science and Technology Innovation Development Conference and Midwest Digital Economy Conference (hereinafter referred to as the "Conference") will be officially held in Chengdu.

On the one hand, the conference invited experts from authoritative institutions and universities to set up the "Large Model Benchmarking Expert Committee", which will carry out evaluation work on domestic large models and gain an in-depth understanding of the current ability level of domestic large models and the development of large model enterprises. On the other hand, leading enterprises, experts and scholars, and domestic authoritative standard-setting institutions will gather at the conference to discuss the development trend of the industry, build a communication platform for the upstream and downstream of the industry, and promote the progress of large-scale model technology.

On the eve of the conference, Red Star Capital had a conversation with Ho Chun-yin, an assistant professor in the Department of Computer Science at the Hong Kong University of Science and Technology, who focuses on the efficient adaption, factuality, reasoning, and evaluation of large language models. He Junxian served as the field chair of ACL and EMNLP, and his paper was nominated for the best systematic paper in ACL 2019, the most influential paper digest in ICLR 2022, and won honors such as Baidu AI Doctoral Scholarship and AI Top 100 Chinese Rising Stars. He instructed students to release C-Eval, the authoritative Chinese evaluation benchmark for large models, which has been downloaded more than 500,000 times since its release.

Ho said that large language models are very close to the public, and new technologies will soon be perceived by the public. The ultimate goal of their work is to achieve strong artificial intelligence in the true sense of the word.

The following is a transcript of the conversation:

Red Star Capital Bureau: ChatGPT is very popular, and it is also a large model of the language direction you are studying. How do you measure how intelligent a large language model is?

Ho Chun Xian: True intelligence is that the user can no longer distinguish whether the opposite side is an expert or a machine.

True intelligence is not just small talk, such as recommending products, asking about today's weather, but also asking all kinds of knowledge about history, mathematics and physics, and even uploading an exam question to directly ask how to do it, or it can help you write code and help you write press releases.

If you can do all these things well, feel smart, have access to the world's knowledge, and have strong reasoning skills. Then we feel that this is no different from a real person.

Red Star Capital Bureau: You guided students to release the Chinese authoritative evaluation benchmark C-Eval of the large model, what is the difference from the previous evaluation list?

He Junxian: C-Eval is the first evaluation benchmark for large-scale models in Chinese.

In the past, in the direction of natural language processing, there were also many Chinese datasets and evaluation benchmarks that were widely used. But with the release of large models like GPT at the end of last year, many of the previous evaluations were not so comprehensive, because the capabilities of large models are too strong. The previous benchmarks were not differentiated enough, and the industry suddenly underwent a major change, and new benchmarks were urgently needed to help you develop models.

Without a benchmark, development would be very difficult, like sailing without a compass. Because in the process of collating data for training to develop a model, there is no standard to tell you whether the direction is right or wrong.

Previously, the task of a traditional benchmark was like having a review that helped me predict whether it would be one or two stars, and whether it would be positive or negative, which was relatively simple.

Now the task of C-Eval is the real college entrance examination, postgraduate entrance examination, and Tsinghua University and Peking University to hand in the mathematics, physics and biology questions of undergraduates of these schools, with more than 50 subjects, which is very different from the previous difficulty.

Red Star Capital Bureau: If you want to deal with the higher difficulty now, where are the new requirements for large models?

Ho Chun-yin: Large models need to be able to accurately memorize more knowledge and be able to make more complex inferences.

Red Star Capital Bureau: From the perspective of C-Eval's questions, it not only tests the ability to reserve information, but also the ability to solve problems in mathematics and physics?

Ho Chun Yin: In addition to knowledge, a model is also very important to have the ability to analyze, because we believe that true intelligence requires reasoning.

On the one hand, in the context of Chinese, the model needs to know a lot of knowledge, including knowledge of history, politics, geography and other aspects related to Chinese culture. This requires memorization, but memorization is relatively superficial, and you only need to memorize.

On the other hand, the logical reasoning skills required by mathematics and physics are difficult. To know the principles of mathematics and physics, you must apply the principles and use a certain logic to solve the problems. This is essentially a test of strong logic, which is often very difficult for the brain, because in a way, this is what is related to true intelligence.

Red Star Capital Bureau: Among the nearly 100 models tested in the C-Eval list, what stage has the Chinese large language model reached?

Ho Junxian: There is still a big gap with ChatGPT 4.0. Because ChatGPT has no way to fully reflect its advantages in the Chinese benchmark.

When we tested it in May, ChatGPT4.0 was far ahead, much higher than the second place. But now ChatGPT4.0 is on our list, and it may only rank in the top 10 or so. On the one hand, because part of the C-Eval assessment requires rote memorization and tests Chinese culture, ChatGPT is not so good at it. On the other hand, because many domestic models have targeted optimization of C-Eval, the number of the list is inflated, which is what we often call the "brushing the list" behavior.

However, judging from more evaluations and everyone's intuitive feelings, in fact, there is still a big gap between the domestic model and ChatGPT4.0. The user's experience is the most intuitive, which makes it difficult to deceive the masses.

Red Star Capital Bureau: What are the gaps that need to be solved for the large model of Chinese?

He Junxian: The biggest gap between domestic models and ChatGPT is still strong reasoning ability. It's about higher levels of intelligence, and the real gap isn't the rote part.

The real gap is the ability to write code for you, to make it understand a long instruction, and then to reason on its own. This is a critical ability, and the gap in rote memorization is not that big.

Red Star News reporter Cheng Luyang

Edited by Yu Dongmei

(Download Red Star News, there are prizes for reporting!)

Interview with AI Large Model Experts丨HKUST Ho Chun-yin: The evaluation benchmark is the compass for large model development

Interview with AI Large Model Experts丨HKUST Ho Chun-yin: The evaluation benchmark is the compass for large model development

Read on