A machine learning project is roughly divided into 5 steps from start to finish: defining the problem, collecting data and preprocessing, selecting an algorithm and determining the model, training a fitting model, and evaluating and optimizing model performance. These 5 steps are a cyclic and iterative process, you can refer to the image below:

How to build a model to estimate the click-through rate of 10w+ advertorials?

All of our projects follow these 5 steps, which I refer to as the 5 Steps in Action.

Step 1 Define the problem

Let's start with the first step, defining the problem. In defining the problem, we need to analyze the business scenario, set clear goals, and determine which type of machine learning the problem is in. If we don't figure this out, we won't be able to choose a model later.

So first of all, let's understand the business scenario of our project. Suppose you have joined the operation department of "Easy Speed Flowers" and are analyzing the operational efficiency of WeChat official account promotion copywriting. You collect a lot of advertorial data, including the number of likes, retweets, views, etc., like this:

Because after the WeChat official account has more than 100,000 views, it cannot be displayed in terms of its specific number of views. So in response to this problem, the goal of our project is to build a machine learning model that estimates how many views an article can achieve based on metrics such as the number of likes and retweets.

Because the number of views is to be estimated, in this dataset: the number of likes, the number of retweets, the popularity index, and the article rating, these 4 fields are all characteristics, and the number of views is the tag. Here we already have the label to estimate, so it's a supervised learning problem. Add to that the fact that our label is a continuous numeric value, so it's a regression problem.

It is not difficult to see that in this dataset, there is a clear correlation between features and tags: when there are more likes and retweets, there are often more views. But what specific function can this correlation be depicted? We don't know yet, so our task in this project is to find it.

Now that we have the problem defined, the next step is data collection and preprocessing.

Step 2 Collect data and preprocess

Data collection and pre-processing" is present in all machine learning projects, and its role is to provide good fuel for machine learning models. If the data is good, the model will run more energetically. This step seems to be only one sentence, but in fact, it contains several small steps, and there are 6 steps in general:

(1) collect data;

(2) data visualization;

(3) data cleaning;

(4) feature engineering;

(5) construct feature sets and label sets;

(6) Split the training set, validation set, and test set.

You may not understand the meaning of these 6 steps at first glance. Don't worry, I'll explain them one by one.

1. Collect data

The first step is to collect data, which is also called collecting data. In reality, collecting data is usually very hard, and it is necessary to do a lot of data burying in the operation process, obtain behavioral information and interest preference information such as user consumption, and sometimes crawl data on the Internet. With the dataset in hand, the next thing we need to do is data visualization, that is, to observe the data through visualization and find a sense for choosing a specific machine learning model.

2. Data visualization

# 首先安装 
pip3.10 install jupyter

# 运行
jupyter notebook

Data visualization is a jack-of-all-trades skill that can be done in a lot of ways. For example, you can look at the possible relationship between features and labels, and you can also see if there are "dirty data" and "outliers" in the data.

However, before we can do the visualization, we need to import the collected data into the runtime environment. For data import, we need to use the Pandas Data Processing Toolkit. This package is a powerful tool for manipulating data, and we will use it in every project in the future. Now let's import it with the import statement:

# 导入最基本的数据处理工具
import pandas as pd # 导入Pandas数据处理工具包

Then, we use the following code to read the dataset of this project into the Python runtime environment and present it in the form of a DataFrame:

df_ads = pd.read_csv('xxx微信软文.csv') # 读入数据
df_ads.head() # 显示前几行数据

DataFrame is a 2D tabular type of data structure commonly used in machine learning. In the above code, I used the read_csv API to read the CSV file into the DataFrame of Pandas and named it df_ads. The output of this code is as follows:

This completes the import of the data, and we can officially enter the "visualization". As a rule of thumb, we guess that there is most likely a linear relationship between "likes" and "views". Is that really the case? We can draw a diagram to verify that.

In this "validation" session, we need to use two packages: one is the Python drawing tool library "Matplotlib", and the other is the statistical data visualization tool library "Seaborn". Both packages are essential toolkits for Python data visualization, and they are part of the default installation package for Anaconda and do not require repeated installation of the pip install statement.

When importing both packages, we still use the import statement. Please note that in order to save the amount of code, I did not import the full matplotlib package, but only the plotting module pyplot from the matplotlib package:

#导入数据可视化所需要的库
import matplotlib.pyplot as plt # Matplotlib – Python画图工具库
import seaborn as sns # Seaborn – 统计学数据可视化工具库

Because the linear relationship can be simply verified with a scatter plot. So let's use the plot API in the matplotlib package to plot a scatter plot between "likes" and "views" to see how they are distributed.

plt.plot(df_ads['点赞数'],df_ads['浏览量'],'r.', label='Training data') # 用matplotlib.pyplot的plot方法显示散点图
plt.xlabel('点赞数') # x轴Label
plt.ylabel('浏览量') # y轴Label
plt.legend() # 显示图例
plt.show() # 显示绘图结果！

The output result is shown in the following figure:

As we can see from this graph, the data is basically concentrated around a line, so there seems to be a linear relationship between its labels and features, which can provide reference information for our future models.

Next, I'm going to use Seaborn's boxplot tool to draw a boxplot. Let's see if there are any "outliers" in this dataset. I've chosen the heat index feature here, but you can also try boxplotting for other features.

data = pd.concat([df_ads['浏览量'], df_ads['热度指数']], axis=1) # 浏览量和热度指数
fig = sns.boxplot(x='热度指数', y="浏览量", data=data) # 用seaborn的箱线图画图
fig.axis(ymin=0, ymax=800000); #设定y轴坐标

Here's the boxplot we output:

The boxplot is made up of five numeric points: the minimum (min), the lower quartile (Q1), the median (median), the upper quartile (Q3), and the maximum (max). Statistically, this is called a five-number generalization. These five values give us a clear indication of how distributed and discrete the data is.

In this diagram, the lower quartile, median, and upper quartiles form a "box with compartments", which is called a box, an extension line is established between the upper quartile and the maximum, which is the so-called line, also called "beard", the two poles of the beard are the minimum and maximum values, and the box plot will also plot the outlier data points separately.

In the boxplot above, it is not difficult to find that the higher the popularity index, the greater the median number of views. We can also see that there are some outlier data points, which have a much larger number of views than other articles, and these "outliers" are what we call "explosive articles".

At this point, the data visualization work is almost complete. After data visualization, the next step is data cleansing.

3. Data cleaning

Many people compare data cleaning to "washing vegetables" before "stir-frying", which means that the cleaner the data, the better the model. Cleaned data is generally divided into 4 scenarios:

The first is to deal with the missing data. If there is missing data in the backup system, we try to fill it up, if not, we can eliminate the missing data, or we can make up the value with the average, random value or 0 value of other data records. This process of replenishment is called data repair.

The second is to deal with duplicate data: if it's the exact same duplicate data processing, just delete it. However, if there are two different lines of data in the same primary key, for example, there are two different addresses after the same ID number, we need to see if there is any other auxiliary information that can help us judge (such as timestamp), if it cannot be judged, we can only delete it randomly or keep it all.

The third is to deal with incorrect data: for example, if the sales volume and sales amount of the product are negative, then it needs to be deleted or converted to a meaningful positive value. Another example is a field that represents a percentage or probability, and if the value is greater than 1, it is also logically incorrect data.

The fourth is to deal with unusable data: this refers to the format of collating data, for example, some goods are in yuan and some are in dollars, which need to be unified first. Another common example is to convert "yes" and "no" to "1" or "0" values and then input them into a machine learning model.

So how do you see if there is dirty data in the dataset?

As far as the dataset of our project is concerned, you may have noticed in the DataFrame diagram that the value of "Number of Forwards" in the data with row index 6 is "NaN", which means Not A Number. In Python, it represents a value that cannot be represented, nor can it be processed. This is typically dirty data.

We can use the isna().sum() function of the DataFrame to count the number of all NaN. In this way, we can look at the number of times NaN is present at the same time as we can see if there is NaN. If there is too much NaN, then the quality of the dataset is not good, and it is necessary to find out what is wrong with the data source.

df_ads.isna().sum() # NaN出现的次数

The output is as follows:

df_ads = df_ads.dropna() # 删除NaN值

By removing outliers, the model fits nicely to ordinary data. But in real life, there are such outliers that make the model less pretty. If you remove the outliers here, the model won't work as well. So, it's a process of balancing and choosing.

We can train models that contain these outliers, and models that don't, and compare them. Here, I propose to keep these "outliers".

Now, we're done with a simple cleanse of this data. Different types of data have different cleaning methods, so we won't introduce them all here. In the follow-up projects, we will go into more detail about specific projects and datasets. Let's move on to the next step, which is to convert the data into a format that is easier for machines to read, which is feature engineering.

4. Feature engineering

Feature engineering is a specialized subfield of machine learning, and it is the most creative part of the data processing process. Let me give you an example to explain what feature engineering is. Do you know what a BMI is? It's the square of your weight divided by your height, and that's a feature project.

计算公式 BMI = 体重(kg) ÷ 身高(m)^2

BMI is an index that replaces the original two characteristics - weight and height, and it can objectively paint a picture of our body shape.

So, with this feature engineering, we can take the BMI index as a new feature and feed it into a machine learning model that assesses health.

What are the benefits of this, you might ask? Take BMI feature engineering as an example, which reduces the dimensionality of the feature dataset. A dimension is the number of features in a dataset. It is important to know that for each additional feature in the dataset, the feature space during model fitting is larger, and the amount of computation is larger. Therefore, eliminating redundant features and reducing the dimension of features can make machine learning models train faster.

This is just one of the many benefits of feature engineering, which can also better represent business logic and improve the performance of machine learning models.

Since the problem of this project is relatively simple, the requirements for feature engineering are not high, so we will not do feature engineering here for the time being. Next, let's move on to the next sub-step, which is to build the feature set and label set for machine learning.

5. Build feature sets and label sets

Features are the individual data points that are collected and are the variables that are to be fed into the machine learning model. Labels are the content to be predicted, judged, or classified. For all supervised learning algorithms, we need to input two sets of data into the model: the "feature set" and the "label set". Therefore, before starting to build a machine learning model, we need to build a feature dataset and a label dataset.

The construction process is also very simple, we just need to delete the data we don't need from the original dataset. In this project, the characteristics are the number of likes, retweets, popularity index, and article rating, so it is only necessary to remove the "views" from the original data set.

X = df_ads.drop(['浏览量'],axis=1) # 特征集，Drop掉标签相关字段

Labels, on the other hand, are the number of views we want to predict, so we only keep the Views field in the label dataset:

y = df_ads.浏览量 # 标签集

Let's take a look at what data is in the feature set and tag set.

X.head() # 显示前几行数据
y.head() #显示前几行数据

Because a notebook can only have one output per cell. So I put the code that shows the two data in different cells. Their output is shown in the image below:

As you can see, all fields except the number of views are still in the feature dataset, and only the views are saved in the label dataset, which means that the original dataset is split into the feature set and the label set for machine learning.

Unsupervised learning algorithms do not require this step. Because unsupervised algorithms don't have labels at all.

However, after splitting the original dataset vertically from the column dimension to the feature set and label set, it needs to be further split horizontally from the row dimension. You may be asking, why do we need to split it? Because machine learning doesn't end with training datasets to find a model, we need to use the validation dataset to see if the model is good, and then use the test dataset to see if the model can be used on the new data.

So how do you split it? That's what we're going to do.

6. Split the training, validation, and testing sets.

Before I break it down, let me make it clear that for learning projects, validation is often omitted in order to simplify the process. Our project today is relatively simple, so we also omit verification and only split the training set and the test set, and the test set is responsible for both verification and testing.

When splitting, the percentage of data that is set aside for testing is typically 20% or 30%. But if you have a lot of data, say more than 1 million, you don't necessarily have to keep that much. Generally speaking, tens of thousands of pieces of test data are sufficient. Here I'm going to split the data in an 80/20 ratio. For specific splitting, we will use the dataset splitting tool train_test_split in the machine learning toolkit scikit-learn to complete:

#将数据集进行80%（训练集）和20%（验证集）的分割
from sklearn.model_selection import train_test_split #导入train_test_split工具
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                   test_size=0.2, random_state=0)

It is important to note here that although it is a random split, we need to specify a random_state value, so that the program will split the same training set and test set every time it runs. If the training set and the test set are different each time, the comparison model will lose a fixed standard before and after parameter tuning. Now that the training and test splits are complete, you'll see that the raw data is now four datasets, which are:

Feature Training Set (X_train)
Characteristic Test Set (X_test)
Label Training Set (y_train)
Label Test Set (y_test)

At this point, all of our data preprocessing is complete.

Step 3 Select an algorithm and build the model

Stay tuned...

Step 4 Train the model

Stay tuned...

Step 5 Model evaluation and optimization

Stay tuned...

How to build a model to estimate the click-through rate of 10w+ advertorials?

Step 1 Define the problem

Step 2 Collect data and preprocess

Step 3 Select an algorithm and build the model

Step 4 Train the model

Step 5 Model evaluation and optimization

Read on

How do people of different ages make pension investments?

Why do you advise everyone not to go back to the countryside for retirement after retirement? Five practical reasons are unraveling the mystery

How to build an all-media communication system? Industry insiders "learn from each other" in Chongqing

How was the Federal Reserve, the world's largest monetary policy body influencing the global economy, established?

Eastern Han Dynasty: At the beginning of the Eastern Han Dynasty, how to construct the peak moment?

Why is it advisable not to do "MRI"? Most people are kept in the dark, and there are four reasons related to this

How was the British Straits Settlements established in 1786-1826?

How directors, officers and their spouses can build debt firewalls from the perspective of the new corporate law

"Party Building Reading" He Jianming: All my creations and all my efforts are following the rhythm of this era

How to build the company's talent strategy and incentive mechanism, and talent incentive system?

The I Ching "Zhongfu" hexagram explains: What is the role of integrity, and how to establish integrity

First principles: how to build a personal body of knowledge

How to build a smart power plant?FineVis helps power plants carry out digital transformation!

OpenAI secretly launched a mysterious model, suspected to be ChatGPT4.5 for public testing

A summary of 9 models of geometric guide angle problems in the mathematics common test of the high school entrance examination

Five forces model to improve personal core competence

Meta AI released the most powerful open-source large model, Llama 3, which is available in versions 8B and 70B?

Marshal Xu Xiangqian has always been low-key and generous, why did he insist on not associating with an old subordinate after the founding of the People's Republic of China?

How to use AI models to solve practical problems?

In the era of large models, is the data center outdated now?

轩辕大模型的实践与应用 | ML-Summit 2024

The mobile UI model came out, and the Apple iPhone may welcome a new cycle of upgrades

iFLYTEK does not tell the "sexy story" of large models

How does HRBP build a business advantage?

Meta released the "strongest open-source AI model", and the next generation may be stronger than GPT

面壁新模型:早于Llama3、比肩 Llama3、推理超越 Llama3!

Huawei's profit soared by 564% in the first quarter, Tianya community recovered, and Xiaohongshu tested its self-developed large model

13 Models of Effective Communication Expression

Eat through an industrial chain in one day: NO.37 AI large model industrial chain

10 domestic large models vs. mentally handicapped - Chinese comprehension ability assessment

The most complete interpretation of the MoE hybrid expert model: revealing the key technologies and challenges