From small white to advanced | 10 Kaggle datasets for data people to practice

Produced by CDA Data Analyst

By Andrew Lombarti

Compile: Mika

Kaggle is a popular data science competition platform. Above, you can not only participate in various data analysis competitions, but also practice your skills through real data sets from various industries.

In this article we will introduce 10 datasets, from those suitable for novice xiaobai to advanced people. These datasets are very interesting, and they are also great for practicing your skills before an interview.

Let's take a look!

01、Titanic dataset (elementary)

The Titanic dataset is one of the most popular datasets on Kaggle. This is a good starter dataset with 13 variables and over 1500 records. The dataset contains information about passengers aboard the Titanic.

The goal is to predict whether passengers will survive based on their characteristics. Based on the dataset, you can see that married women have a higher probability of survival than single men.

The variables in the dataset are:

age
gender
Married or single
Ticket class (1st, 2nd, 3rd)
Embarkation location (London, Southampton)
Passenger ticket number
……

There are already many tutorials on how to handle this dataset online. If you want to challenge yourself, try to predict the survival rate of passengers boarding the ship at different locations.

Titanic dataset link:

https://www.kaggle.com/c/titanic

02、Iris dataset (elementary)

This dataset is a classic binary classification problem. The aim is to predict which of the three species the iris belongs to (Setosa), Versicolour (Variegated Iris), Virginica (Virginia iris)) by attributes such as calyx length, calyx width, etc.

For example, mountain irises have shorter petals and wider sepals. If the petal length is greater than 3 cm and the sepals are less than 6 cm, then the flower is likely to belong to the mountain iris.

The variables in this dataset are as follows:

Petal length
Sepal width

There are also many tutorials that you can use to work with this dataset. One of the most popular is "Using Scikit Learn on iris Datasets". This is a great tutorial for beginners, showing how to use scikit learn, as well as pre-built features that help you train models easily.

Iris dataset link:

https://www.kaggle.com/uciml/iris

03、Train dataset (elementary)

The train dataset is also a popular one on Kaggle. The dataset contains information about passengers on Amtrak trains that run between Boston and Washington, D.C.

The purpose is to predict whether passengers will get off at a certain station. Based on the dataset, it can be seen that passengers who get off the bus in Baltimore are more likely to get off than those who get off in Philadelphia.

The variables in the dataset are as follows:

Rail type (road, freight)
Weekends or holidays

Based on these variables, there are multiple ways to predict whether someone will get off at a certain stop.

Train dataset links:

https://www.kaggle.com/c/train-occupancy-prediction/data

04、Boston Housing Dataset (Elementary)

The Boston Housing dataset contains information about housing in the city of Boston. There are more than 200,000 records and 18 variables, and the goal is to predict whether house prices are expensive. There are three different categories of datasets: expensive, normal, and cheap.

The variables include:

Number of bedrooms
Bathroom quantity
Average number of rooms

If you're interested in the field of data science, this dataset is a great place to try. The content is interesting and not too difficult.

Boston Housing Dataset Links:

https://www.kaggle.com/c/boston-housing

05. Relationship between alcohol and drugs (intermediate)

The Alcohol and Drug Relationship Dataset is an excellent dataset for practicing your data visualization skills. It contains information about the interactions between different drugs.

The goal of the dataset is to predict whether two drugs will interact based on their chemical structure. For example, datasets indicate that ibuprofen and paracetamol can interact because they are both anti-inflammatory drugs (NSAIDs).

Variables in the dataset include:

Structure of drug A (compound)
Structure of drug B (compound)
Activity of drugs A and B (yes / no)

This is a good dataset that you can use to practice your data visualization skills. You can try to create charts showing the interactions between different drugs.

Alcohol and Drug Dataset Links:

https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018

06、 Wisconsin breast cancer (intermediate)

For those with more experience in data science, the Wisconsin breast cancer dataset is a big challenge. This dataset contains information about breast cancer patients in Wisconsin.

The goal of the dataset is to predict whether or not to have cancer based on the characteristics of the patient.

For example, you can see from the dataset that if the tumor size is less than 0.50 cm, the patient has a 98% chance of survival, while the tumor size is greater than or equal to 0.80 cm, and the patient has only a 15% chance of survival.

The variables in the dataset are:

Tumor size
The grade of the tumor
Affected lymph nodes

There are some tutorials online on how to work with this dataset. If you want to challenge yourself, you can try to predict the survival rate of different tumor sizes.

Wisconsin Breast Cancer Dataset Link:

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

07、 Indian diabetes (intermediate)

This dataset is about predicting diabetes. There are over 150,000 examples of this contest, and you need to predict whether a patient will develop diabetes (binary classification).

Variables are fairly simple because there is only one feature:

diabetes

The goal of this challenge is to predict whether a patient will develop diabetes within five years. This is a great way to practice your skills in binary classification problems.

Indian Diabetes Dataset Links:

https://www.kaggle.com/uciml/pima-indians-diabetes-database

08、Amazon Review Dataset (Intermediate)

The Amazon Comment dataset is great for practicing text analytics. This includes reviews of products on Amazon's website.

This dataset is interesting, with both positive and negative reviews, and the goal of the dataset is to predict whether the comments are positive or negative.

The variables are:

Comment text (a string)

There are also many tutorials on how to work with this dataset. If you want to make it more difficult, you can try predictive sentiment analysis and then build a model on top of that.

Amazon Review Dataset Link:

https://www.kaggle.com/bittlingmayer/amazonreviews

09、MNIST handwritten digital image recognition (advanced)

The dataset contains a number of handwritten digital images, consisting of images of 28x28 pixels in size, with 60,000 training instances and 10,000 test instances.

The goal of this dataset is to correctly classify all numbers in the training and test sets. For this type of problem, a convolutional neural network (CNN) is usually used.

There are a lot of tutorials online on how to deal with this type of problem, so I recommend that you start with the basics before moving on to more advanced methods.

MNIST handwritten digital dataset link:

https://www.kaggle.com/c/digit-recognizer

10、CIFAR-100(Advanced)

The CIFAR-100 dataset is ideal for practicing machine learning skills. The dataset contains images of 100 objects, divided into six categories: airplanes, cars, cats, deer, dogs, and boats. Each image is 32x32 pixels and has three color channels (red, green, blue).

The goal of the data is to predict which of these six categories each image falls into.

pixel
Red channel
Green channel
Blue channel

There are many tutorials on how to deal with this challenge. To make it more difficult, try predicting image labels that are distorted or transformed in some way.

CIFAR-100 Dataset Link:

https://www.kaggle.com/fedesoriano/cifar100

epilogue:

The 10 datasets listed in this article are great ways to hone your data analysis skills. If you are just getting started, you can try to make some relatively simple data sets, from shallow to difficult, and continue to deepen.

From small white to advanced | 10 Kaggle datasets for data people to practice

01、Titanic dataset (elementary)

02、Iris dataset (elementary)

03、Train dataset (elementary)

04、Boston Housing Dataset (Elementary)

05. Relationship between alcohol and drugs (intermediate)

06、 Wisconsin breast cancer (intermediate)

07、 Indian diabetes (intermediate)

08、Amazon Review Dataset (Intermediate)

09、MNIST handwritten digital image recognition (advanced)

10、CIFAR-100(Advanced)

Read on

From "fitness novice" to "health instructor", how does Aunt Yangpu achieve "reverse growth" after retirement?

This vegetarian dish is going to be collected, stir-fried cabbage with shiitake mushrooms, light, delicious and nutritious, and the method is simple

The life of the stockholders: After holding tens of millions of positions in the 70s, they watched calmly, and "Xiaobai" entered the market with 20,000 runners to feel the excitement

A must-see for kitchen novices: how to stir-fry delicious and not hard lean meat slices?

The cabbage is green, the mushrooms are fragrant, and they taste good when fried together, nutritious and healthy

Song Zhenyu: Little White Bull Market Survival Manual

The Queen of Bhutan and her family of five visit Australia! The 1-year-old little princess wears a little white dress and cutes, and the queen is too tender

Break! Let's see how HR novices counterattack and become a master of performance analysis

Keep treating the people as guinea pigs, are you really blind to the people?

A must-see for little whites! The hiking safety guide is here

When wearing a skirt, what kind of shoes should I wear to look foreign and look good? Look at the difference between small white shoes and low heels

Do you remember Sha Yi in "Wulin Gaiden"? In the past, the little white face became an uncle?

There are a few questions about Alipay [help you vote] 1, does this help you vote? 2, Xiaohongshu searched a lot of comments and said that it was a pit 3, and at present, I have invested more than 3,000 and made a profit of 7

Novice Xiaobai, don't buy Gyokuro, it's easy to rot roots, you can't feed it, and it will turn black as soon as the sun shines!

The little white shoes have fallen out of favor! In 2024, it is popular to wear "fairy shoes", which look good and reduce age with pants and skirts

A few home-cooked dishes suitable for novices, everyone still loves to eat, and cooking novices can collect them!