
Produced by CDA Data Analyst
By Andrew Lombarti
Compile: Mika
Kaggle is a popular data science competition platform. Above, you can not only participate in various data analysis competitions, but also practice your skills through real data sets from various industries.
In this article we will introduce 10 datasets, from those suitable for novice xiaobai to advanced people. These datasets are very interesting, and they are also great for practicing your skills before an interview.
Let's take a look!
01、Titanic dataset (elementary)
The Titanic dataset is one of the most popular datasets on Kaggle. This is a good starter dataset with 13 variables and over 1500 records. The dataset contains information about passengers aboard the Titanic.
The goal is to predict whether passengers will survive based on their characteristics. Based on the dataset, you can see that married women have a higher probability of survival than single men.
The variables in the dataset are:
- age
- gender
- Married or single
- Ticket class (1st, 2nd, 3rd)
- Embarkation location (London, Southampton)
- Passenger ticket number
- ……
There are already many tutorials on how to handle this dataset online. If you want to challenge yourself, try to predict the survival rate of passengers boarding the ship at different locations.
Titanic dataset link:
https://www.kaggle.com/c/titanic
02、Iris dataset (elementary)
This dataset is a classic binary classification problem. The aim is to predict which of the three species the iris belongs to (Setosa), Versicolour (Variegated Iris), Virginica (Virginia iris)) by attributes such as calyx length, calyx width, etc.
For example, mountain irises have shorter petals and wider sepals. If the petal length is greater than 3 cm and the sepals are less than 6 cm, then the flower is likely to belong to the mountain iris.
The variables in this dataset are as follows:
- Petal length
- Sepal width
There are also many tutorials that you can use to work with this dataset. One of the most popular is "Using Scikit Learn on iris Datasets". This is a great tutorial for beginners, showing how to use scikit learn, as well as pre-built features that help you train models easily.
Iris dataset link:
https://www.kaggle.com/uciml/iris
03、Train dataset (elementary)
The train dataset is also a popular one on Kaggle. The dataset contains information about passengers on Amtrak trains that run between Boston and Washington, D.C.
The purpose is to predict whether passengers will get off at a certain station. Based on the dataset, it can be seen that passengers who get off the bus in Baltimore are more likely to get off than those who get off in Philadelphia.
The variables in the dataset are as follows:
- Rail type (road, freight)
- Weekends or holidays
Based on these variables, there are multiple ways to predict whether someone will get off at a certain stop.
Train dataset links:
https://www.kaggle.com/c/train-occupancy-prediction/data
04、Boston Housing Dataset (Elementary)
The Boston Housing dataset contains information about housing in the city of Boston. There are more than 200,000 records and 18 variables, and the goal is to predict whether house prices are expensive. There are three different categories of datasets: expensive, normal, and cheap.
The variables include:
- Number of bedrooms
- Bathroom quantity
- Average number of rooms
If you're interested in the field of data science, this dataset is a great place to try. The content is interesting and not too difficult.
Boston Housing Dataset Links:
https://www.kaggle.com/c/boston-housing
05. Relationship between alcohol and drugs (intermediate)
The Alcohol and Drug Relationship Dataset is an excellent dataset for practicing your data visualization skills. It contains information about the interactions between different drugs.
The goal of the dataset is to predict whether two drugs will interact based on their chemical structure. For example, datasets indicate that ibuprofen and paracetamol can interact because they are both anti-inflammatory drugs (NSAIDs).
Variables in the dataset include:
- Structure of drug A (compound)
- Structure of drug B (compound)
- Activity of drugs A and B (yes / no)
This is a good dataset that you can use to practice your data visualization skills. You can try to create charts showing the interactions between different drugs.
Alcohol and Drug Dataset Links:
https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018
06、 Wisconsin breast cancer (intermediate)
For those with more experience in data science, the Wisconsin breast cancer dataset is a big challenge. This dataset contains information about breast cancer patients in Wisconsin.
The goal of the dataset is to predict whether or not to have cancer based on the characteristics of the patient.
For example, you can see from the dataset that if the tumor size is less than 0.50 cm, the patient has a 98% chance of survival, while the tumor size is greater than or equal to 0.80 cm, and the patient has only a 15% chance of survival.
The variables in the dataset are:
- Tumor size
- The grade of the tumor
- Affected lymph nodes
There are some tutorials online on how to work with this dataset. If you want to challenge yourself, you can try to predict the survival rate of different tumor sizes.
Wisconsin Breast Cancer Dataset Link:
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
07、 Indian diabetes (intermediate)
This dataset is about predicting diabetes. There are over 150,000 examples of this contest, and you need to predict whether a patient will develop diabetes (binary classification).
Variables are fairly simple because there is only one feature:
- diabetes
The goal of this challenge is to predict whether a patient will develop diabetes within five years. This is a great way to practice your skills in binary classification problems.
Indian Diabetes Dataset Links:
https://www.kaggle.com/uciml/pima-indians-diabetes-database
08、Amazon Review Dataset (Intermediate)
The Amazon Comment dataset is great for practicing text analytics. This includes reviews of products on Amazon's website.
This dataset is interesting, with both positive and negative reviews, and the goal of the dataset is to predict whether the comments are positive or negative.
The variables are:
- Comment text (a string)
There are also many tutorials on how to work with this dataset. If you want to make it more difficult, you can try predictive sentiment analysis and then build a model on top of that.
Amazon Review Dataset Link:
https://www.kaggle.com/bittlingmayer/amazonreviews
09、MNIST handwritten digital image recognition (advanced)
The dataset contains a number of handwritten digital images, consisting of images of 28x28 pixels in size, with 60,000 training instances and 10,000 test instances.
The goal of this dataset is to correctly classify all numbers in the training and test sets. For this type of problem, a convolutional neural network (CNN) is usually used.
There are a lot of tutorials online on how to deal with this type of problem, so I recommend that you start with the basics before moving on to more advanced methods.
MNIST handwritten digital dataset link:
https://www.kaggle.com/c/digit-recognizer
10、CIFAR-100(Advanced)
The CIFAR-100 dataset is ideal for practicing machine learning skills. The dataset contains images of 100 objects, divided into six categories: airplanes, cars, cats, deer, dogs, and boats. Each image is 32x32 pixels and has three color channels (red, green, blue).
The goal of the data is to predict which of these six categories each image falls into.
- pixel
- Red channel
- Green channel
- Blue channel
There are many tutorials on how to deal with this challenge. To make it more difficult, try predicting image labels that are distorted or transformed in some way.
CIFAR-100 Dataset Link:
https://www.kaggle.com/fedesoriano/cifar100
epilogue:
The 10 datasets listed in this article are great ways to hone your data analysis skills. If you are just getting started, you can try to make some relatively simple data sets, from shallow to difficult, and continue to deepen.