laitimes

From small white to advanced | 10 Kaggle datasets for data people to practice

author:CDA Data Analyst
From small white to advanced | 10 Kaggle datasets for data people to practice

Produced by CDA Data Analyst

By Andrew Lombarti

Compile: Mika

Kaggle is a popular data science competition platform. Above, you can not only participate in various data analysis competitions, but also practice your skills through real data sets from various industries.

In this article we will introduce 10 datasets, from those suitable for novice xiaobai to advanced people. These datasets are very interesting, and they are also great for practicing your skills before an interview.

Let's take a look!

01、Titanic dataset (elementary)

From small white to advanced | 10 Kaggle datasets for data people to practice

The Titanic dataset is one of the most popular datasets on Kaggle. This is a good starter dataset with 13 variables and over 1500 records. The dataset contains information about passengers aboard the Titanic.

The goal is to predict whether passengers will survive based on their characteristics. Based on the dataset, you can see that married women have a higher probability of survival than single men.

The variables in the dataset are:

  • age
  • gender
  • Married or single
  • Ticket class (1st, 2nd, 3rd)
  • Embarkation location (London, Southampton)
  • Passenger ticket number
  • ……

There are already many tutorials on how to handle this dataset online. If you want to challenge yourself, try to predict the survival rate of passengers boarding the ship at different locations.

Titanic dataset link:

https://www.kaggle.com/c/titanic

02、Iris dataset (elementary)

From small white to advanced | 10 Kaggle datasets for data people to practice

This dataset is a classic binary classification problem. The aim is to predict which of the three species the iris belongs to (Setosa), Versicolour (Variegated Iris), Virginica (Virginia iris)) by attributes such as calyx length, calyx width, etc.

For example, mountain irises have shorter petals and wider sepals. If the petal length is greater than 3 cm and the sepals are less than 6 cm, then the flower is likely to belong to the mountain iris.

The variables in this dataset are as follows:

  • Petal length
  • Sepal width

There are also many tutorials that you can use to work with this dataset. One of the most popular is "Using Scikit Learn on iris Datasets". This is a great tutorial for beginners, showing how to use scikit learn, as well as pre-built features that help you train models easily.

Iris dataset link:

https://www.kaggle.com/uciml/iris

03、Train dataset (elementary)

From small white to advanced | 10 Kaggle datasets for data people to practice

The train dataset is also a popular one on Kaggle. The dataset contains information about passengers on Amtrak trains that run between Boston and Washington, D.C.

The purpose is to predict whether passengers will get off at a certain station. Based on the dataset, it can be seen that passengers who get off the bus in Baltimore are more likely to get off than those who get off in Philadelphia.

The variables in the dataset are as follows:

  • Rail type (road, freight)
  • Weekends or holidays

Based on these variables, there are multiple ways to predict whether someone will get off at a certain stop.

Train dataset links:

https://www.kaggle.com/c/train-occupancy-prediction/data

04、Boston Housing Dataset (Elementary)

From small white to advanced | 10 Kaggle datasets for data people to practice

The Boston Housing dataset contains information about housing in the city of Boston. There are more than 200,000 records and 18 variables, and the goal is to predict whether house prices are expensive. There are three different categories of datasets: expensive, normal, and cheap.

The variables include:

  • Number of bedrooms
  • Bathroom quantity
  • Average number of rooms

If you're interested in the field of data science, this dataset is a great place to try. The content is interesting and not too difficult.

Boston Housing Dataset Links:

https://www.kaggle.com/c/boston-housing

05. Relationship between alcohol and drugs (intermediate)

From small white to advanced | 10 Kaggle datasets for data people to practice

The Alcohol and Drug Relationship Dataset is an excellent dataset for practicing your data visualization skills. It contains information about the interactions between different drugs.

The goal of the dataset is to predict whether two drugs will interact based on their chemical structure. For example, datasets indicate that ibuprofen and paracetamol can interact because they are both anti-inflammatory drugs (NSAIDs).

Variables in the dataset include:

  • Structure of drug A (compound)
  • Structure of drug B (compound)
  • Activity of drugs A and B (yes / no)

This is a good dataset that you can use to practice your data visualization skills. You can try to create charts showing the interactions between different drugs.

Alcohol and Drug Dataset Links:

https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018

06、 Wisconsin breast cancer (intermediate)

From small white to advanced | 10 Kaggle datasets for data people to practice

For those with more experience in data science, the Wisconsin breast cancer dataset is a big challenge. This dataset contains information about breast cancer patients in Wisconsin.

The goal of the dataset is to predict whether or not to have cancer based on the characteristics of the patient.

For example, you can see from the dataset that if the tumor size is less than 0.50 cm, the patient has a 98% chance of survival, while the tumor size is greater than or equal to 0.80 cm, and the patient has only a 15% chance of survival.

The variables in the dataset are:

  • Tumor size
  • The grade of the tumor
  • Affected lymph nodes

There are some tutorials online on how to work with this dataset. If you want to challenge yourself, you can try to predict the survival rate of different tumor sizes.

Wisconsin Breast Cancer Dataset Link:

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

07、 Indian diabetes (intermediate)

From small white to advanced | 10 Kaggle datasets for data people to practice

This dataset is about predicting diabetes. There are over 150,000 examples of this contest, and you need to predict whether a patient will develop diabetes (binary classification).

Variables are fairly simple because there is only one feature:

  • diabetes

The goal of this challenge is to predict whether a patient will develop diabetes within five years. This is a great way to practice your skills in binary classification problems.

Indian Diabetes Dataset Links:

https://www.kaggle.com/uciml/pima-indians-diabetes-database

08、Amazon Review Dataset (Intermediate)

From small white to advanced | 10 Kaggle datasets for data people to practice

The Amazon Comment dataset is great for practicing text analytics. This includes reviews of products on Amazon's website.

This dataset is interesting, with both positive and negative reviews, and the goal of the dataset is to predict whether the comments are positive or negative.

The variables are:

  • Comment text (a string)

There are also many tutorials on how to work with this dataset. If you want to make it more difficult, you can try predictive sentiment analysis and then build a model on top of that.

Amazon Review Dataset Link:

https://www.kaggle.com/bittlingmayer/amazonreviews

09、MNIST handwritten digital image recognition (advanced)

From small white to advanced | 10 Kaggle datasets for data people to practice

The dataset contains a number of handwritten digital images, consisting of images of 28x28 pixels in size, with 60,000 training instances and 10,000 test instances.

The goal of this dataset is to correctly classify all numbers in the training and test sets. For this type of problem, a convolutional neural network (CNN) is usually used.

There are a lot of tutorials online on how to deal with this type of problem, so I recommend that you start with the basics before moving on to more advanced methods.

MNIST handwritten digital dataset link:

https://www.kaggle.com/c/digit-recognizer

10、CIFAR-100(Advanced)

From small white to advanced | 10 Kaggle datasets for data people to practice

The CIFAR-100 dataset is ideal for practicing machine learning skills. The dataset contains images of 100 objects, divided into six categories: airplanes, cars, cats, deer, dogs, and boats. Each image is 32x32 pixels and has three color channels (red, green, blue).

The goal of the data is to predict which of these six categories each image falls into.

  • pixel
  • Red channel
  • Green channel
  • Blue channel

There are many tutorials on how to deal with this challenge. To make it more difficult, try predicting image labels that are distorted or transformed in some way.

CIFAR-100 Dataset Link:

https://www.kaggle.com/fedesoriano/cifar100

epilogue:

The 10 datasets listed in this article are great ways to hone your data analysis skills. If you are just getting started, you can try to make some relatively simple data sets, from shallow to difficult, and continue to deepen.

Read on