Python Machine Learning: 8 projects for beginners

No amount of theory is a substitute for hands-on practice.

Textbooks and courses can fool you into thinking that you are proficient because the material is right in front of you. But when you try to apply it, you may find it harder than it seems. Projects help you quickly improve your app's ML skills while giving you the opportunity to explore interesting topics.

Plus, you can add projects to your portfolio, making it easier to find jobs, find cool career opportunities, and even negotiate higher salaries.

In this post, we will introduce 8 interesting machine learning projects for beginners. You can do any of them in one weekend, or if you enjoy them, you can expand it into a longer project.

1. Machine learning gladiator

We affectionately call it "machine learning gladiators," but it's not new. This is one of the fastest ways to build practical intuition around machine learning.

The goal is to take an out-of-the-box model and apply it to different datasets. There are 3 main reasons why this project is great:

First, you'll build the intuition that the model fits the problem. Which models are robust to missing data? Which models handle categorical features well? Yes, you can flip through textbooks to find out, but you'll learn better by doing.

Secondly, this project will teach you valuable skills in rapid prototyping. In the real world, it's often hard to know which model performs best without simply trying them.

Finally, this exercise can help you master the workflow of model building. For example, you will start practicing...

Import data

Clean the data

Split it into train/test or cross-validation sets

pretreatment

transformation

Feature engineering

Because you'll be using an out-of-the-box model, you'll have the opportunity to focus on honing these key steps.

Check out the sklearn (Python) or caret (R) documentation page for instructions. You should practice regression, classification, and clustering algorithms.

tutorial

Python: sklearn – The official tutorial for the sklearn package

• Predict wine quality with Scikit-Learn – a step-by-step tutorial for training machine learning models

• R: caret – A webinar provided by the author of the caret package

Data source

• UCI Machine Learning Repository – Over 350 searchable datasets covering virtually any topic. You will surely find the dataset you are interested in.

• Kaggle Datasets – More than 100 datasets uploaded by the Kaggle community. There are some very interesting datasets here, including PokemonGo spawning sites and burritos in San Diego.

• data.gov – Open datasets published by the U.S. government. If you're interested in the social sciences, check it out.

2. Play money ball

In the book Penalty Money, the Auckland A Team revolutionized the game of baseball by analyzing player scouting. They built a competitive team that cost only 1/3 of the salary paid by large market teams like the Yankees.

First of all, if you haven't read the book yet, you should check it out. This is one of our favorites!

Fortunately, there is a wealth of data available in the world of sports. Team, matches, scores, and player data can all be tracked online and available for free.

There are a lot of interesting machine learning projects for starters. For example, you can try...

• Sports betting... Predict box scores based on available data before each new match.

• Talent scouting... Use college statistics to predict which players will have the best careers.

• General management... Create clusters of players based on their strengths to build a well-rounded team.

Sports are also an excellent area for practicing data visualization and exploratory analysis. You can use these skills to help you decide what types of data to include in your analysis.

Data source

• Sports Statistics Database – Sports statistics and historical data covering many professional sports and some collegiate sports. The clean interface makes web scraping easier.

Sports Reference – Another sports statistics database. The interface is more cluttered, but individual tables can be exported as CSV files.

• cricsheet.org – Ball-by-ball data for international and IPL cricket matches. Provides CSV files for IPL and T20 international competitions.

3. Predict stock prices

For any data scientist interested in finance, the stock market is like candy land.

First, you have multiple types of data to choose from. You can find prices, fundamentals, global macroeconomic indicators, volatility indices and more... Abound.

Second, the data can be very granular. You can easily get time series data for each company by day (or even by the minute), allowing you to think creatively about trading strategies.

Finally, financial markets typically have short feedback cycles. As a result, you can quickly validate your predictions for new data.

Some examples of beginner-friendly machine learning projects you can try include...

• Quantitative value investing... Forecast 6-month price action based on fundamental indicators reported by the company's quarterly reports.

•Forecast...... Build time series models or even recurrent neural networks on the difference between implied and actual volatility.

• Statistical arbitrage... Find similar stocks based on price action and other factors, and look for periods when prices diverge.

Building trading models to practice machine learning is simple. Making them profitable is extremely difficult. There is no financial advice here and we do not recommend trading real money.

tutorial

Python: sklearn for Investing – A YouTube video series that applies machine learning to investing.

• R: Quantitative Trading with R – Detailed class notes on quantitative finance using R.

Data source

Quandl – A data marketplace that provides free (and high-quality) financial and economic data. For example, you can bulk download the end-of-day stock prices of more than 3,000 U.S. companies or the economic data of the Federal Reserve.

• Quantopian – Quantitative financial community, a free platform for developing trading algorithms. Include datasets.

US Fundamentals Archive – 5 years of fundamental data for more than 5,000 U.S. companies.

4. Teach neural networks to read handwriting

Neural networks and deep learning are two success stories of modern artificial intelligence. They have made significant advances in image recognition, automatic text generation, and even self-driving cars.

To venture into this exciting field, you should start with manageable datasets.

The MNIST Handwritten Number Classification Challenge is the classic entry point. Image data is often more difficult to process than "flat" relational data. MNIST data is beginner-friendly and small enough to fit on a single computer.

Handwriting recognition will challenge you, but it doesn't require high computing power.

First, we recommend using the first chapter in the tutorial below. It will teach you how to build neural networks from scratch to solve MNIST challenges with high accuracy.

tutorial

• Neural Networks and Deep Learning (online book) – Chapter 1 describes how to write neural networks from scratch in Python to classify numbers from MNIST. The authors also give a good explanation of the intuition behind neural networks.

Data source

• MNIST – MNIST is a modified subset of two datasets collected by the National Institute of Standards and Technology. It contains 70,000 labeled handwritten digit images.

5. Investigate Enron

The Enron scandal and collapse were among the biggest corporate crashes in history.

In 2000, Enron was one of the largest energy companies in the United States. Then, after being exposed for fraud, it spiraled down to bankruptcy within a year.

Luckily, we have the Enron email database. It contains 500,000 emails between 150 former Enron employees, mostly senior executives. It's also the only large public database of real emails, which makes it even more valuable.

In fact, data scientists have been using this dataset for education and research for years.

Examples of beginner machine learning projects you can try include...

• Anomaly detection... Map and receive email by the hour and try to detect anomalous behavior that leads to public scandals.

• Social network analysis... Model the network diagram among employees to find key influencers.

• Natural language processing... Analyze body messages in conjunction with email metadata to categorize emails based on their purpose.

Data source

• Enron Email Dataset – This is an Enron email archive hosted by CMU.

• Enron Data Description (PDF) – An exploratory analysis of Enron email data that can help you get to the ground.

6. Write ML algorithms from scratch

Writing machine learning algorithms from scratch is an excellent learning tool for two main reasons.

First, there is no better way to build a true understanding of their mechanisms. You will be forced to think about every step, which will lead to true mastery.

Secondly, you will learn how to convert mathematical instructions into working code. You will need this skill when tweaking algorithms from academic research.

We recommend choosing a less complex algorithm. Even with the simplest algorithms, you need to make many delicate decisions. Once you're familiar with building simple algorithms, try extending them to get more functionality. For example, try to extend an ordinary logistic regression algorithm to a lasso/ridge regression by adding regularization parameters.

Finally, here's a tip that every beginner should know: don't be discouraged because your algorithm isn't as fast or fancy as the ones in existing packages. These packages are the result of years of development!

tutorial

Python: Logistic regression from scratch

Python: Zero-based k-nearest neighbor

• R: Logistic regression from zero

7. Tap into social media sentiment

Due to the sheer volume of user-generated content, social media has become almost synonymous with "big data."

Digging into this wealth of data can prove that opinions, trends, and public sentiment have been grasped in a way that has never been done before. Facebook, Twitter, YouTube, WeChat, WhatsApp, Reddit... The list goes on.

In addition, each generation spends more time on social media than their predecessors. This means that social media data will be more relevant to marketing, brands, and the business as a whole.

While there are many popular social media platforms, Twitter is a classic entry point for practicing machine learning.

With Twitter data, you get an interesting mix of data (tweet content) and metadata (location, hashtags, users, forwarded tweets, etc.), opening up almost endless paths for analysis.

tutorial

Python: Mining Twitter data – how to perform sentiment analysis on Twitter data

• R: Sentiment Analysis Using Machine Learning – Short and sweet sentiment analysis tutorial

Data source

Twitter API – The Twitter API is a classic source of streaming data. You can track tweets, hashtags, and more.

StockTwits API – StockTwits is like a tweet for traders and investors. You can extend this dataset in many interesting ways by connecting it to a time series dataset using a timestamp and ticker.

8. Improve health care

Another industry that is experiencing rapid change thanks to machine learning is global health and healthcare.

In most countries, becoming a doctor requires years of education. This is a demanding field, long working hours, high risk, and a higher barrier to entry.

Therefore, significant efforts have recently been made to reduce the workload of doctors with the help of machine learning and improve the overall efficiency of the healthcare system.

Use cases include:

• Preventive care... Predict disease outbreaks at the individual and community levels.

• Diagnostic care... Automatically classify image data such as scans, X-rays, etc.

•Insurance...... Adjust premiums based on disclosed risk factors.

As hospitals continue to modernize patient records, and as we collect more granular health data, there will be plenty of low-hanging fruit opportunities for data scientists to make a difference.

tutorial

• R: Build meaningful machine learning models for disease prediction

Machine Learning in Healthcare – A fascinating presentation from Microsoft Research

Data source

• Large health datasets – A collection of large health-related datasets

• data.gov/health – Health and healthcare-related datasets provided by the U.S. government.

• Healthy Nutrition and Demographics – Global health, nutrition and demographic statistics from the World Bank.