Scientists use STEM datasets to evaluate neural network model foundations and accelerate the implementation of artificial intelligence

2024-04-30 15:23:00

STEM skills in science and engineering are the foundation for solving real-world problems. For example, exploring protein structures, proving mathematical theorems, discovering new drugs, etc. (Editor's note: STEM, an acronym for Science, Technology, Engineering and Mathematics.) ）

For the field of artificial intelligence, understanding visual-text multimodal information is the key to mastering STEM skills.

However, the existing datasets mainly focus on testing the model's ability to solve expert-level problems, which is difficult to reflect the model's mastery of basic knowledge. In addition, they tend to only consider textual information and ignore visual information, or focus only on the ability of a single subject in STEM.

In addition, due to the lack of fine-grained information, scientists in this field are unable to better analyze and improve the weaknesses of neural network models.

Therefore, the content generated by the model in this situation can neither be fully trusted nor help guide the direction of future model development.

More importantly, due to the lack of data related to human performance, it is impossible for scientists to obtain more meaningful model performance references, which seriously hinders the healthy development of artificial intelligence.

In order to overcome the above limitations, a research team from Peking University and Washington University in St. Louis has recently successfully completed the construction of the first multimodal STEM dataset, and also realized the evaluation of large language models and multimodal basic models on this basis.

The results show that even the most advanced AI models have a lot of room for improvement in their STEM foundation level, and they do not yet have the ability to solve more difficult real-world problems. In other words, compared with human intelligence, there is still a certain gap in the current level of artificial intelligence.

Scientists use STEM datasets to evaluate neural network model foundations and accelerate the implementation of artificial intelligence

Figure丨Comprehensive evaluation effect (source: ICLR 2024)

近日，相关论文以《测量神经网络模型的视觉-语言理工科技能》（Measuring Vision-Language STEM Skills of Neural Models）为题收录于 2024 国际表征学习大会（ICLR 2024，International Conference on Learning Representations 2024）上[1]。

It is reported that the conference will be held in Vienna, the capital of Austria, from May 7 to May 11 this year.

Resources related to STEM datasets are below.

Review link: https://huggingface.co/spaces/stemdataset/stem-leaderboard

Dataset page: https://huggingface.co/datasets/stemdataset/STEM

代码 GitHub：https://github.com/stemdataset/STEM

Jianhao Shen and Ye Yuan, Ph.D. students at Peking University, are co-first authors, and Assistant Professor Chenguang Wang of Washington University in St. Louis and Professor Ming Zhang of Peking University serve as co-corresponding authors. Assistant Professor Wang Chenguang graduated from Peking University with a Ph.D. degree from Professor Zhang Ming.

Figure丨Related papers (Source: ICLR 2024)

Build a STEM dataset to comprehensively evaluate the basic science and engineering capabilities of neural network models

According to Wang Chenguang, after determining the research objectives and topics, the research team began to collect data.

Team members who have always been good at algorithm research will inevitably have some difficulties when facing crawler writing, data cleaning, and deduplication. Still, they rose to the challenge and devised a variety of rules for data cleansing and deduplication, ultimately achieving the first multimodal STEM dataset.

From left: Chenguang Wang, Ming Zhang, Jianhao Shen, Ye Yuan, Srbuhi Mirzoyan (Source: Research Group)

It is worth mentioning that the dataset contains 448 STEM skills and a total of 1073146 questions, which is the most extensive and multimodal STEM topic dataset with the largest number of questions.

Figure丨Related papers (Source: ICLR 2024)

Next, they began to evaluate and analyze the dataset.

Since the dataset contains three dimension labels (science, technology, engineering, mathematics), skills, and grades, the researchers chose to start from these three dimensions, and analyzed the data quantity distribution, question type distribution, and question length distribution in each dimension in detail.

At the same time, for each subject, they also divided the training set, the validation set, and the test set with undisclosed labels in a ratio of 6:2:2.

Subsequently, the researchers designed a model evaluation scheme.

In addition to accuracy, they also focused on test scores from one of the world's most recognized online practice sites (https://www.ixl.com/) when selecting metrics.

The latter is based on the real-life test scores of the site's tens of millions of users and is positively correlated with students' mastery of knowledge. When a score of 90 or higher, usually at the elementary school level, a student has mastered the skill.

"We had the model mimic the test taker's online answers, and then compared the resulting test scores with the results of real human exams. Wang Chenguang said.

This is one of the highlights of the work. The reason is that, in the past, when comparing human performance with artificial intelligence, the former was summarized from a relatively small sample (e.g., a few hundred to a few thousand people), the team's results were based on tens of millions of data, which is more credible.

Then, in the model evaluation session, the researchers chose to use the current mainstream large base models, including OpenAI's multimodal CLIP model, and the GPT3.5-Turbo version of the large language model ChatGPT.

The former makes a choice based on the degree to which the model judges the matching degree of the question option with the image, while the latter uses the subtitle model to generate a description for the image and uses the language model to select the answer.

"We evaluated the CLIP model and the GPT3.5-Turbo model at different scales and found that the model had a high error rate at a 0 sample setting. This suggests that existing models are not able to grasp this knowledge directly. Wang Chenguang said.

Further, they fine-tuned the CLIP model using the divided training dataset, and found that the fine-tuned model achieved significant improvements, with the overall accuracy increasing from 54.4% to 76.3%. However, this is still a certain gap from 90 points.

In addition, the research group also analyzed various aspects of the model results.

Specifically, first, at the grade level, they found that the model's test scores decreased with the grade level to which the questions belonged, which was in line with the expectation that the higher the grade level, the more difficult the questions were.

Figure丨Test scores change with grade level (Source: ICLR 2024)

Secondly, by evaluating the performance of the model on different skills, they found that the model performed poorly on abstract knowledge and complex reasoning tasks.

In addition, past experience has shown that the model should have a high predictive confidence in the correct answer, which means that the model is well calibrated.

"We found that the models that were fine-tuned on our dataset showed good calibration, with a clear correlation between model confidence and accuracy. Wang Chenguang said.

On the other hand, they also found a clear positive correlation in the study of the relationship between model size and effect.

At the same time, they also analyzed the relationship between model performance and other factors such as problem length, problem type, number of options, etc., and found that the model's performance decreased as the number of questions increased, the number of options increased, and the number of examples decreased.

In addition, they also assessed the correlation between accuracy and test scores, and found that they also showed a significant positive correlation.

"In the end, in terms of overall evaluation indicators, we confirmed that even the fine-tuned model had a significant gap with the level of the human equivalent in the grade level. Based on this, we still need to find more effective ways to equip the model with STEM knowledge and skills. Wang Chenguang said.

Figure丨Comparison with human performance (Source: ICLR 2024)

Try to introduce more datasets to evaluate large language models and accelerate the implementation of artificial general intelligence

Clearly, STEM datasets played a key role in this study.

Not only does it help the model enhance the foundational knowledge of STEM, but it also helps researchers assess the extent to which the model has mastered the foundational STEM skills and improve the model in a targeted manner through fine-grained data analysis.

Wang Chenguang said that he and his team hope that the dataset can further promote the research of current multimodal large models, and move closer to the goal of models that can fully understand STEM skills and solve STEM problems in real-world scenarios.

In addition, it is also hoped that the released test set can be widely used by the community as one of the standard evaluations to evaluate the capabilities of the basic AI model.

"More importantly, the comparison we provide with the real-world level of large-scale humans (mainly elementary school students) can be used as a target and reference for future model development to accelerate the process of achieving the goal of general AI. He said.

At present, based on this dataset, the research group has successfully evaluated the ability of neural network models in science and engineering in basic education.

Next, on the one hand, they plan to continue to collect data and try to launch datasets in areas such as humanities and social sciences to better evaluate the ability of large language models in other key disciplines.

In this regard, it is worth noting that the team has recently proposed a new social discipline dataset, Social, which contains large-scale text evaluation data that can be used to evaluate the basic social discipline capabilities of large language models.

Furthermore, a multi-agent interaction method is also designed, which can enhance the performance of large language models on the Social dataset.

相关论文以《衡量大语言模型的社会规范》（Measuring Social Norms of Large Language Models）为题收录于计算语言学协会北美分会 2024 年年会（NAACL 2024，2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics）上[2]。

It is reported that the conference will be held in Mexico City, the capital of Mexico, from June 16 to June 21 this year.

On the other hand, they also intend to find out where the model is lacking and how it can be improved by studying how the model performs on fine-grained datasets.

In addition, it is hoped that the basic capabilities of the model will be further enhanced by combining the RAG method of retrieval, designing a special model architecture and training method.

"We believe that only by achieving breakthroughs in the fields of basic science, engineering and liberal arts and laying a solid foundation can AI be further applied. Wang Chenguang said.

Resources:

1.J.,Shen,Y.,Yuan,S.,Mirzoyan.et al. Measuring Vision-Language STEM Skills of Neural Models.ICLR 2024. https://openreview.net/forum?id=spvaV5LELF

2.Yuan, Ye, et al. Measuring Social Norms of Large Language Models. NAACL 2024. https://arxiv.org/abs/2404.02491

Operation/Typesetting: He Chenlong

Scientists use STEM datasets to evaluate neural network model foundations and accelerate the implementation of artificial intelligence

Read on