laitimes

The "volume" of a 16-year-old high school student, with 13,000+ lines of code, wrote a C+ machine learning library from scratch

Reports from the Heart of the Machine

Machine Heart Editorial Department

Artificial intelligence is also popular now also popular high school students to save the world?

A teenager who loves computers can already make something at the age of 16, such as developing a Cantonese programming language, taking a Kaggle champion, writing a game, developing a cryptocurrency investment robot, building a C++ machine learning library from scratch, and so on.

Today's introduction is about a 16-year-old (@novak-99) who built a C++ machine learning library from scratch, and his self-recommendation post received hundreds of likes on reddit.

The "volume" of a 16-year-old high school student, with 13,000+ lines of code, wrote a C+ machine learning library from scratch

The library (ML++) he built had more than 13,000 lines of code covering topics such as statistics, linear algebra, numerical analysis, machine learning, and deep learning.

The "volume" of a 16-year-old high school student, with 13,000+ lines of code, wrote a C+ machine learning library from scratch

Project Address: https://github.com/novak-99/MLPP

@novak-99 says he built the library because C++ was the language of his choice, but at the front end of ML, C++ is used very little.

C++ is efficient and facilitates fast execution. So most libraries (such as TensorFlow, PyTorch, or Numpy) use C/C++ or some C/C++-derived language to optimize and improve speed.

But when he looked at the front-end implementations of various machine learning algorithms, he noticed that most of the algorithms were implemented in Python, MatLab, R, or Octave. He believes that the reason why C++ is used less in the ML front end is mainly due to the lack of user support and the complexity of the C++ syntax.

Compared to Python, C++ has few machine learning frameworks. In addition, even in popular frameworks such as PyTorch or TensorFlow, the implementation of C++ is not as complete as the implementation of Python, and the problems include: lack of documentation; not all major functions exist; not many people are willing to contribute, and so on.

In addition, C++ does not support the various key libraries of Python's ML suite. Neither Pandas nor Matplotlib supports C++. This increases the implementation time of ML algorithms because elements of data visualization and data analysis are more difficult to obtain.

So he decided to write a machine learning library for C++ himself.

He also notes that because ML algorithms are so easy to implement, some engineers may overlook the implementation and mathematical details behind them. This can pose some problems because it is not possible to tailor ML algorithms for specific use cases without understanding the mathematical details. So in addition to the library, he plans to publish comprehensive documentation to explain the mathematical background behind each machine learning algorithm in the library, covering statistics, linear regression, Jacobian matrices, and backpropagation. Here's a section on statistics:

The "volume" of a 16-year-old high school student, with 13,000+ lines of code, wrote a C+ machine learning library from scratch

Opening the project, we can see some of the details:

Covering 19 big topics, this ML++ is large enough and complete

Like most frameworks, the ML++ library created by this high school student is dynamic and constantly changing. This is especially important in the world of machine learning, where new algorithms and technologies are being developed every day.

The "volume" of a 16-year-old high school student, with 13,000+ lines of code, wrote a C+ machine learning library from scratch

Currently, the following models and techniques are being developed in the ML++ library:

Convolutional Neural Networks (CNNs)

Kernels that support vector machines (SVMs).

Vector regression is supported

Overall, the ML++ Library contains 19 big topics and related breakdowns, as follows:

Regression (Linear Regression, Logistic Regression, Softmax Regression, Exponential Regression, Probit Regression, Cloglog Regression, Tanh Regression)

Deep, dynamic, scaled neural networks (activation functions, optimization algorithms, loss functions, regularization methods, weight initialization methods, learning rate planners)

Prebuilt neural networks (multilayer perceptrons, autoencoders, Softmax networks)

Generative Modeling (Tabular Adversarial Generative Networks)

Natural language processing (Word2Vec, stemming, bag-of-word models, TFIDF, auxiliary text handling functions)

Computer Vision (Convolution Operations, Max/Min/Average Pooling, Global Max/Min/Average Pooling, Prebuilt Feature Vectors)

Principal component analysis

Naive Bayes classifier (multivariate distribution naïve bayesian, Bernoulli distribution naïve bayesian, Gaussian distribution naïve bayes)

Support vector classification (primitive formation, dual formation)

K-Means algorithm

K nearest neighbor algorithm

Outlier Finder (using standard scores)

Matrix decomposition (SVD decomposition, Cholesky decomposition, QR decomposition)

Numerical analysis (numerical differentiation, Jacobi vector calculator, Hessian matrix calculator, function approximator, differential equation solver)

Mathematical transformations (discrete cosine transformations)

Linear algebra module

Statistics module

Data processing modules (feature scaling, mean normalization, One Hot representation, anti-One Hot representation, supported color space conversion types)

Utilities (TP/FP/TN/FN functions, precision, recall, accuracy, F1 score)

For more details, please refer to the original project.

Netizen: Such a volume, what should I do

For 16 years old to be able to make such an excellent project, some netizens can't help but sigh, what are the high school students in this world doing?! I'm still "nibbling my fingers" at their age. And they've already presented papers at ICLR, NeurIPS conferences...

The "volume" of a 16-year-old high school student, with 13,000+ lines of code, wrote a C+ machine learning library from scratch

Some netizens said that if high school students are doing these things, imagine how intense the doctoral application will be in a few years. Now, you only need to publish more than 3 NeurIPS papers and win the Turing Award in the future.

If it seems to be a joke, it can also be said to be a "volume" to some extent at present.

The "volume" of a 16-year-old high school student, with 13,000+ lines of code, wrote a C+ machine learning library from scratch

However, some netizens pointed out that there are 13,000 lines of code in the project but have not been tested? Another netizen believes that this is a pet project created based on personal hobbies and does not apply to actual use cases. Therefore, testing is not important here.

The "volume" of a 16-year-old high school student, with 13,000+ lines of code, wrote a C+ machine learning library from scratch

Reference Links:

https://www.reddit.com/r/MachineLearning/comments/srbvnc/p_c_machine_learning_library_built_from_scratch/

Read on