What languages, frameworks, models are kaggle gods using? Here's a detailed statistic

2022-03-28 13:05:35

Selected from medium

By Eniola Olaleye

Machine Heart Compilation

Editor: Qian Zhang

For ML learners and practitioners, participating in competitions is a great exercise opportunity to earn some pocket money. So, do you know which platform has the most competitions and what architecture and what model are being used by the teams with the best results? In this article, a data science enthusiast named Eniola Olaleye presents their statistical results.

What languages, frameworks, models are kaggle gods using? Here's a detailed statistic

Statistics Website: https://mlcontests.com/

The authors draw several important conclusions:

1) Of all the competitions, kaggle still accounts for 1/3 of the number of competitions, and the number of prize money accounts for half of the total prize pool of $2.7 million;

2) Of all the competitions, 67 were played on the top 5 platforms (Kaggle, AIcrowd, Tianchi, DrivenData and Zindi), and 8 were held on platforms that held only one race last year;

3, almost all the champions used Python, only one champion used C++;

77% of deep learning solutions use PyTorch (up to 72% last year);

5. All award-winning CNN solutions use CNN;

6. All award-winning NLP solutions use Transformer.

Here are the details of the survey:

Platform type

In this survey, the authors counted a total of 83 competitions on 16 platforms. The total prize pool for these competitions is more than $2.7 million, with the most lucrative contest being the Facebook AI Image Similarity Challenge: Matching Track, hosted by Drive Data, with prize money of up to $200,000.

Contest type

The survey shows that the most common types of competition in 2021 are computer vision and natural language processing. This part has changed significantly compared to 2020, when NLP competitions accounted for only 7.5% of the total number of competitions.

Among the many NLP competitions, Zindi has the largest number of competitions in partnership with AI4D (Artificial Intelligence for Development Africa), which includes translating an African language into English or other languages and sentiment analysis of an African language.

Languages and frameworks

In this survey, the mainstream machine learning framework is still Python-based. Scikit-learn is very versatile and is used in almost every field.

Unsurprisingly, the two most popular machine learning libraries are Tensorflow and Pytorch. Among them, Pytorch is the most popular in deep learning competitions. Compared to 2020, the number of people using PyTorch in deep learning competitions is growing by leaps and bounds, and the PyTorch framework is growing rapidly every year.

Champion model

Supervised learning

In classic machine learning problems, gradient lifting models such as Catboost and LightGBM dominate.

For example, in a Kaggle competition for indoor positioning and navigation, contestants need to design algorithms that predict the location of smartphones indoors based on real-time sensor data. Champion Solution considers three modeling approaches: neural networks, LightGBM, and K-Nearest Neighbors. But in the final pipeline, they reached their highest scores with only LightGBM and K-Nearest Neighbours.

Computer Vision

Since AlexNet won the ImageNet competition in 2012, the CNN algorithm has become the algorithm used in many deep learning problems, especially in computer vision.

Recurrent and convolutional neural networks are not mutually exclusive. Although they seem to be used to solve different problems, it is important that both schemas can handle certain types of data. For example, an RNN uses a sequence as input. It is worth noting that sequences are not limited to text or music. A video is a collection of images and can also be used as a sequence.

Recurrent neural networks, such as LSTM, are used in situations where data has temporal characteristics (such as time series), and in cases where the data is context-sensitive (such as sentence completion), where the memory function of the feedback loop is key to achieving ideal performance. RNNs have also been successfully used in the following areas of computer vision:

"Daytime pictures" and "night pictures" are an example of image classification (one-to-one RNN);

Image description (one-to-many RNN) is the process of assigning a caption to an image based on its content, such as "lion hunting deer";

Handwriting recognition;

Finally, a combination of RNNs and CNNs is possible, which may be the most advanced application of computer vision. When the data fits into a CNN but contains temporal characteristics, techniques that mix RNNs and CNNs may be a favorable strategy.

Among other architectures, EfficientNet stands out because it focuses on improving the accuracy and efficiency of the model. EfficientNet uses a simple but effective technique, compound coefficient, to scale up the model, using scaling strategies to create 7 models of different dimensions with accuracy that exceeds the SOTA level of most convolutional neural networks.

NLP

As in 2020, the adoption of large language models in the NLP field, such as Transformers, increased significantly in 2021, a record high. The authors found about 6 NLP solutions, all of which are based on transformers.

Winning team situation

The authors tracked the winners of 35 matches in the dataset. Of those, only 9 had never won a prize in the competition before. Compared to 2020, it can be seen that the older participants who have won many competitions win again and again, with only a few winning the prize for the first time, with no really noticeable change in percentage.

Advantage scenario

In the winning scenario of the machine learning competition, the integrated model became one of the preferred methods. The most common approach to integration is averaging, which is to build multiple models and combine them by adding the average of the output and the sum to achieve more robust performance.

When adjusting a model, once you've reached a point where yields have fallen, it's usually best to start over building a new model that produces different types of errors and averaging their predictions.

Integration method application example

In a kaggle "Classification of Cassava Leaf Diseases" competition, contestants classify images of cassava leaves as healthy or four types of diseases. The champion solution includes 4 different models, CropNet, EfficientNet B4, ResNext50 and Vit, and uses an averaging approach.

The winner takes the average of the class weights from the ResNext and ViT models and combines this combination with MobileNet and EficientnetB4 in the second phase.

What languages, frameworks, models are kaggle gods using? Here's a detailed statistic

Read on