laitimes

【Mathematical Dataset Summary】Genius Dr. Tao Zhexuan strongly recommends the dataset!

author:HyperAI

Last week, renowned mathematician Tao Zhexuan published a list of resources on his blog titled "AI for Math Resourses", which aims to help those interested in entering the field of AI mathematics, compiled by the "AI-Assisted Mathematical Reasoning" workshop. The symposium was co-organized by the National Academy of Sciences, the Academy of Engineering, and the School of Medicine, and Tao served as the moderator of the symposium.

The list has not yet been finalized, and Tao and other researchers are still working on it. HyperAI has selected some datasets for you to download and use, and in addition, it has also supplemented and summarized other mathematical datasets for you to help AI for Math.

1. OpenWebMath Network Math Dataset (OpenWebMath Open Network Math Training Dataset / Dataset / Superneuro)

Published by: University of Toronto, University of Cambridge, etc

Released: 2023

Estimated size: 44.21 GB

OpenWebMath contains most of the high-quality math text from the internet. It was filtered and extracted from over 200B HTML files on Common Crawl, resulting in a set of 6.3 million documents containing a total of 14.7B tokens.

2. Ape210K Chinese Primary School Level Math Problems (Ape210K Chinese Primary School Level Math Problems / Dataset / Hyperneuro)

Published by: Ape Tutoring AI Lab, Northwestern University

Released: 2020

Estimated size: 78.43 MB

Ape210K is a large-scale and template-rich math word problem dataset containing 210K Chinese elementary school-level math problems, each containing the best answer and the equation needed to arrive at the answer.

3. Proof-Pile-2 Mathematical Dataset (Proof-Pile-2 Mathematical Dataset / Dataset / Hyperneuro)

Published by: Princeton University

Released: 2023

Estimated size: 47.57 GB

Proof-Pile-2 is a token dataset of 55 billion math and science documents, fusing scientific papers, math-related web content, and math code, with knowledge as of April 2023.

4. Orca-Math-200K Math Problem Dataset (Orca-Math-200K Microsoft Math Word Problem Dataset / Dataset / Hyperneuro)

Published by: Microsoft

Released: 2024

Estimated size: 70.88 MB

Orca-Math-200K is a high-quality math problem dataset created by Microsoft that contains approximately 200,000 elementary school math problems, and all answers in this dataset are generated using Azure GPT4-Turbo.

5. Mizar Mathematical Dataset ("Link")

Issued by: Mizar

Released: 2018

Mizar is a mathematical formalization library based on the Mizar language, which has been created and modified by many authors and maintainers over many years. So far, the Mizar language system has formed a large Mizar Mathematical Library, which provides a good basis for future discussions on mathematics and related problems.

6. Math23K Math Word Problem Solving Dataset (Math23K Math Word Problem Dataset / Dataset / Hyperneuro)

Published by: Tencent AI Lab

Released: 2017

Estimated size: 8.36 MB

Math23K is a dataset created to solve math word problems with 23,162 Chinese problems crawled from the internet.

7. MathVista Mathematical Reasoning Dataset (MathVista Mathematical Reasoning Dataset / Dataset / Hyperneuro)

Published by: Microsoft, University of Washington

Released: 2023

Estimated size: 1.61 GB

MathVista is a comprehensive mathematical reasoning benchmark in visual environments. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which can be used to evaluate logical reasoning for puzzle test charts, algebraic reasoning for function charts, and scientific reasoning for academic paper graphs, respectively.

8. MetaMathQA Mathematical Reasoning Dataset (MetaMathQA Mathematical Reasoning Dataset / Dataset / Hyperneuro)

Published by: Huawei, University of Cambridge

Released: 2023

Estimated size: 84.34 MB

MetaMathQA is a broad and high-quality mathematical reasoning dataset consisting of 395K forward and reverse mathematical question and answer pairs generated by large language models.

9. AlgoPuzzleVQA Multimodal Algorithm Puzzle Dataset ("Link")

Issued by: Singapore University of Technology and Design

Released: 2024

Estimated size: 157.85 MB

The dataset contains 18 different puzzles covering diverse mathematical and algorithmic topics such as Boolean logic, combinatorics, graph theory, optimization, search, and more. The dataset generates puzzles from human-written code in an automated manner, ensuring that the dataset can scale inference complexity and dataset size arbitrarily.

10. TAL-SCQ5K Chinese Mathematics Competition Dataset (TAL-SCQ5K Good Future Chinese Mathematics Competition Dataset / Dataset / Super Neuro)

Issued by: Good Future

Released: 2023

Estimated size: 11.4 MB

TAL-SCQ5K is a set of high-quality Chinese mathematics competition datasets, containing 5K Chinese mathematics competition questions (3K for training and 2K for testing), available in both Chinese and English.

The above is the 10 mathematical classification datasets summarized by HyperAI for you, if you have resources that you want to include in the hyper.ai official website, you are also welcome to leave a message or contribute to tell us!

About HyperAI (hyper.ai)

HyperAI (hyper.ai) is a leading artificial intelligence and high-performance computing community in China, committed to becoming the infrastructure of the domestic data science field, providing rich and high-quality public resources for domestic developers.

* Provide domestic accelerated download nodes for 1200+ public datasets

* Includes 300+ classic and popular online tutorials

* 解读 100+ AI4Science 论文案例

* Support 500+ related terms query

* Hosted the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey: Super Neuro

Read on