laitimes

How to build a model to estimate the click-through rate of 10w+ advertorials?

author:Big data and artificial intelligence sharing

A machine learning project is roughly divided into 5 steps from start to finish: defining the problem, collecting data and preprocessing, selecting an algorithm and determining the model, training a fitting model, and evaluating and optimizing model performance. These 5 steps are a cyclic and iterative process, you can refer to the image below:

How to build a model to estimate the click-through rate of 10w+ advertorials?

All of our projects follow these 5 steps, which I refer to as the 5 Steps in Action.

Step 1 Define the problem

Let's start with the first step, defining the problem. In defining the problem, we need to analyze the business scenario, set clear goals, and determine which type of machine learning the problem is in. If we don't figure this out, we won't be able to choose a model later.

So first of all, let's understand the business scenario of our project. Suppose you have joined the operation department of "Easy Speed Flowers" and are analyzing the operational efficiency of WeChat official account promotion copywriting. You collect a lot of advertorial data, including the number of likes, retweets, views, etc., like this:

How to build a model to estimate the click-through rate of 10w+ advertorials?

Because after the WeChat official account has more than 100,000 views, it cannot be displayed in terms of its specific number of views. So in response to this problem, the goal of our project is to build a machine learning model that estimates how many views an article can achieve based on metrics such as the number of likes and retweets.

Because the number of views is to be estimated, in this dataset: the number of likes, the number of retweets, the popularity index, and the article rating, these 4 fields are all characteristics, and the number of views is the tag. Here we already have the label to estimate, so it's a supervised learning problem. Add to that the fact that our label is a continuous numeric value, so it's a regression problem.

It is not difficult to see that in this dataset, there is a clear correlation between features and tags: when there are more likes and retweets, there are often more views. But what specific function can this correlation be depicted? We don't know yet, so our task in this project is to find it.

Now that we have the problem defined, the next step is data collection and preprocessing.

Step 2 Collect data and preprocess

Data collection and pre-processing" is present in all machine learning projects, and its role is to provide good fuel for machine learning models. If the data is good, the model will run more energetically. This step seems to be only one sentence, but in fact, it contains several small steps, and there are 6 steps in general:

(1) collect data;

(2) data visualization;

(3) data cleaning;

(4) feature engineering;

(5) construct feature sets and label sets;

(6) Split the training set, validation set, and test set.

You may not understand the meaning of these 6 steps at first glance. Don't worry, I'll explain them one by one.

1. Collect data

The first step is to collect data, which is also called collecting data. In reality, collecting data is usually very hard, and it is necessary to do a lot of data burying in the operation process, obtain behavioral information and interest preference information such as user consumption, and sometimes crawl data on the Internet. With the dataset in hand, the next thing we need to do is data visualization, that is, to observe the data through visualization and find a sense for choosing a specific machine learning model.

2. Data visualization

# 首先安装 
pip3.10 install jupyter

# 运行
jupyter notebook           

Data visualization is a jack-of-all-trades skill that can be done in a lot of ways. For example, you can look at the possible relationship between features and labels, and you can also see if there are "dirty data" and "outliers" in the data.

However, before we can do the visualization, we need to import the collected data into the runtime environment. For data import, we need to use the Pandas Data Processing Toolkit. This package is a powerful tool for manipulating data, and we will use it in every project in the future. Now let's import it with the import statement:

# 导入最基本的数据处理工具
import pandas as pd # 导入Pandas数据处理工具包           

Then, we use the following code to read the dataset of this project into the Python runtime environment and present it in the form of a DataFrame:

df_ads = pd.read_csv('xxx微信软文.csv') # 读入数据
df_ads.head() # 显示前几行数据           

DataFrame is a 2D tabular type of data structure commonly used in machine learning. In the above code, I used the read_csv API to read the CSV file into the DataFrame of Pandas and named it df_ads. The output of this code is as follows:

How to build a model to estimate the click-through rate of 10w+ advertorials?

This completes the import of the data, and we can officially enter the "visualization". As a rule of thumb, we guess that there is most likely a linear relationship between "likes" and "views". Is that really the case? We can draw a diagram to verify that.

In this "validation" session, we need to use two packages: one is the Python drawing tool library "Matplotlib", and the other is the statistical data visualization tool library "Seaborn". Both packages are essential toolkits for Python data visualization, and they are part of the default installation package for Anaconda and do not require repeated installation of the pip install statement.

When importing both packages, we still use the import statement. Please note that in order to save the amount of code, I did not import the full matplotlib package, but only the plotting module pyplot from the matplotlib package:

#导入数据可视化所需要的库
import matplotlib.pyplot as plt # Matplotlib – Python画图工具库
import seaborn as sns # Seaborn – 统计学数据可视化工具库           

Because the linear relationship can be simply verified with a scatter plot. So let's use the plot API in the matplotlib package to plot a scatter plot between "likes" and "views" to see how they are distributed.

plt.plot(df_ads['点赞数'],df_ads['浏览量'],'r.', label='Training data') # 用matplotlib.pyplot的plot方法显示散点图
plt.xlabel('点赞数') # x轴Label
plt.ylabel('浏览量') # y轴Label
plt.legend() # 显示图例
plt.show() # 显示绘图结果!           

The output result is shown in the following figure:

How to build a model to estimate the click-through rate of 10w+ advertorials?

As we can see from this graph, the data is basically concentrated around a line, so there seems to be a linear relationship between its labels and features, which can provide reference information for our future models.

Next, I'm going to use Seaborn's boxplot tool to draw a boxplot. Let's see if there are any "outliers" in this dataset. I've chosen the heat index feature here, but you can also try boxplotting for other features.

data = pd.concat([df_ads['浏览量'], df_ads['热度指数']], axis=1) # 浏览量和热度指数
fig = sns.boxplot(x='热度指数', y="浏览量", data=data) # 用seaborn的箱线图画图
fig.axis(ymin=0, ymax=800000); #设定y轴坐标           

Here's the boxplot we output:

How to build a model to estimate the click-through rate of 10w+ advertorials?

The boxplot is made up of five numeric points: the minimum (min), the lower quartile (Q1), the median (median), the upper quartile (Q3), and the maximum (max). Statistically, this is called a five-number generalization. These five values give us a clear indication of how distributed and discrete the data is.

In this diagram, the lower quartile, median, and upper quartiles form a "box with compartments", which is called a box, an extension line is established between the upper quartile and the maximum, which is the so-called line, also called "beard", the two poles of the beard are the minimum and maximum values, and the box plot will also plot the outlier data points separately.

In the boxplot above, it is not difficult to find that the higher the popularity index, the greater the median number of views. We can also see that there are some outlier data points, which have a much larger number of views than other articles, and these "outliers" are what we call "explosive articles".

At this point, the data visualization work is almost complete. After data visualization, the next step is data cleansing.

3. Data cleaning

Many people compare data cleaning to "washing vegetables" before "stir-frying", which means that the cleaner the data, the better the model. Cleaned data is generally divided into 4 scenarios:

The first is to deal with the missing data. If there is missing data in the backup system, we try to fill it up, if not, we can eliminate the missing data, or we can make up the value with the average, random value or 0 value of other data records. This process of replenishment is called data repair.

The second is to deal with duplicate data: if it's the exact same duplicate data processing, just delete it. However, if there are two different lines of data in the same primary key, for example, there are two different addresses after the same ID number, we need to see if there is any other auxiliary information that can help us judge (such as timestamp), if it cannot be judged, we can only delete it randomly or keep it all.

The third is to deal with incorrect data: for example, if the sales volume and sales amount of the product are negative, then it needs to be deleted or converted to a meaningful positive value. Another example is a field that represents a percentage or probability, and if the value is greater than 1, it is also logically incorrect data.

The fourth is to deal with unusable data: this refers to the format of collating data, for example, some goods are in yuan and some are in dollars, which need to be unified first. Another common example is to convert "yes" and "no" to "1" or "0" values and then input them into a machine learning model.

So how do you see if there is dirty data in the dataset?

As far as the dataset of our project is concerned, you may have noticed in the DataFrame diagram that the value of "Number of Forwards" in the data with row index 6 is "NaN", which means Not A Number. In Python, it represents a value that cannot be represented, nor can it be processed. This is typically dirty data.

How to build a model to estimate the click-through rate of 10w+ advertorials?

We can use the isna().sum() function of the DataFrame to count the number of all NaN. In this way, we can look at the number of times NaN is present at the same time as we can see if there is NaN. If there is too much NaN, then the quality of the dataset is not good, and it is necessary to find out what is wrong with the data source.

df_ads.isna().sum() # NaN出现的次数           

The output is as follows:

How to build a model to estimate the click-through rate of 10w+ advertorials?
df_ads = df_ads.dropna() # 删除NaN值           

By removing outliers, the model fits nicely to ordinary data. But in real life, there are such outliers that make the model less pretty. If you remove the outliers here, the model won't work as well. So, it's a process of balancing and choosing.

We can train models that contain these outliers, and models that don't, and compare them. Here, I propose to keep these "outliers".

Now, we're done with a simple cleanse of this data. Different types of data have different cleaning methods, so we won't introduce them all here. In the follow-up projects, we will go into more detail about specific projects and datasets. Let's move on to the next step, which is to convert the data into a format that is easier for machines to read, which is feature engineering.

4. Feature engineering

Feature engineering is a specialized subfield of machine learning, and it is the most creative part of the data processing process. Let me give you an example to explain what feature engineering is. Do you know what a BMI is? It's the square of your weight divided by your height, and that's a feature project.

计算公式 BMI = 体重(kg) ÷ 身高(m)^2           

BMI is an index that replaces the original two characteristics - weight and height, and it can objectively paint a picture of our body shape.

So, with this feature engineering, we can take the BMI index as a new feature and feed it into a machine learning model that assesses health.

What are the benefits of this, you might ask? Take BMI feature engineering as an example, which reduces the dimensionality of the feature dataset. A dimension is the number of features in a dataset. It is important to know that for each additional feature in the dataset, the feature space during model fitting is larger, and the amount of computation is larger. Therefore, eliminating redundant features and reducing the dimension of features can make machine learning models train faster.

This is just one of the many benefits of feature engineering, which can also better represent business logic and improve the performance of machine learning models.

Since the problem of this project is relatively simple, the requirements for feature engineering are not high, so we will not do feature engineering here for the time being. Next, let's move on to the next sub-step, which is to build the feature set and label set for machine learning.

5. Build feature sets and label sets

Features are the individual data points that are collected and are the variables that are to be fed into the machine learning model. Labels are the content to be predicted, judged, or classified. For all supervised learning algorithms, we need to input two sets of data into the model: the "feature set" and the "label set". Therefore, before starting to build a machine learning model, we need to build a feature dataset and a label dataset.

The construction process is also very simple, we just need to delete the data we don't need from the original dataset. In this project, the characteristics are the number of likes, retweets, popularity index, and article rating, so it is only necessary to remove the "views" from the original data set.

X = df_ads.drop(['浏览量'],axis=1) # 特征集,Drop掉标签相关字段           

Labels, on the other hand, are the number of views we want to predict, so we only keep the Views field in the label dataset:

y = df_ads.浏览量 # 标签集           

Let's take a look at what data is in the feature set and tag set.

X.head() # 显示前几行数据
y.head() #显示前几行数据           

Because a notebook can only have one output per cell. So I put the code that shows the two data in different cells. Their output is shown in the image below:

How to build a model to estimate the click-through rate of 10w+ advertorials?

As you can see, all fields except the number of views are still in the feature dataset, and only the views are saved in the label dataset, which means that the original dataset is split into the feature set and the label set for machine learning.

Unsupervised learning algorithms do not require this step. Because unsupervised algorithms don't have labels at all.

However, after splitting the original dataset vertically from the column dimension to the feature set and label set, it needs to be further split horizontally from the row dimension. You may be asking, why do we need to split it? Because machine learning doesn't end with training datasets to find a model, we need to use the validation dataset to see if the model is good, and then use the test dataset to see if the model can be used on the new data.

So how do you split it? That's what we're going to do.

6. Split the training, validation, and testing sets.

Before I break it down, let me make it clear that for learning projects, validation is often omitted in order to simplify the process. Our project today is relatively simple, so we also omit verification and only split the training set and the test set, and the test set is responsible for both verification and testing.

When splitting, the percentage of data that is set aside for testing is typically 20% or 30%. But if you have a lot of data, say more than 1 million, you don't necessarily have to keep that much. Generally speaking, tens of thousands of pieces of test data are sufficient. Here I'm going to split the data in an 80/20 ratio. For specific splitting, we will use the dataset splitting tool train_test_split in the machine learning toolkit scikit-learn to complete:

#将数据集进行80%(训练集)和20%(验证集)的分割
from sklearn.model_selection import train_test_split #导入train_test_split工具
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                   test_size=0.2, random_state=0)           

It is important to note here that although it is a random split, we need to specify a random_state value, so that the program will split the same training set and test set every time it runs. If the training set and the test set are different each time, the comparison model will lose a fixed standard before and after parameter tuning. Now that the training and test splits are complete, you'll see that the raw data is now four datasets, which are:

  • Feature Training Set (X_train)
  • Characteristic Test Set (X_test)
  • Label Training Set (y_train)
  • Label Test Set (y_test)

At this point, all of our data preprocessing is complete.

How to build a model to estimate the click-through rate of 10w+ advertorials?

Step 3 Select an algorithm and build the model

Stay tuned...

Step 4 Train the model

Stay tuned...

Step 5 Model evaluation and optimization

Stay tuned...

Read on