機器學習多變量線性回歸代碼

Linear Regression (LR) is one of the main algorithms in Supervised Machine Learning. It solves many regression problems and it is easy to implement. This paper is about Univariate Linear Regression(ULR) which is the simplest version of LR.

線性回歸(LR)是監督機器學習中的主要算法之一。它解決了許多回歸問題，并且易于實作。本文是關于單變量線性回歸(ULR)，它是LR的最簡單版本。

The paper contains following topics:

本文包含以下主題：

The basics of datasets in Machine Learning;

機器學習中資料集的基礎；
What is Univariate Linear Regression?

什麼是單變量線性回歸？
How to represent the algorithm(hypothesis), Graphs of functions;

如何表示算法(假設)，函數圖；
Cost function (Loss function);

成本函數(損失函數)；
Gradient Descent.

梯度下降。

機器學習中的資料集基礎 (The basics of datasets in Machine Learning)

In ML problems, beforehand some data is provided to build the model upon. The datasets contain of rows and columns. Each row represents an example, while every column corresponds to a feature.

在ML問題中，事先會提供一些資料以建立模型。資料集包含行和列。每行代表一個示例，而每列代表一個特征。

Then the data is divided into two parts — training and test sets. With percent, training set contains approximately 75%, while test set has 25% of total data. Training set is used to build the model. After model return success percent over about 90–95% on training set, it is tested with test set. Result with test set is considered more valid, because data in test set is absolutely new to the model.

然後将資料分為兩個部分-訓練集和測試集。如果使用百分比，則訓練集約占75％，而測試集占總資料的25％。訓練集用于構模組化型。在訓練集上模型傳回成功率超過大約90–95％之後，将使用測試集對其進行測試。帶有測試集的結果被認為更有效，因為測試集中的資料對于模型而言絕對是新的。

什麼是單變量線性回歸？ (What is Univariate Linear Regression?)

In Machine Learning problems, the complexity of algorithm depends on the provided data. When LR is used to build the ML model, if the number of features in training set is one, it is called Univariate LR, if the number is higher than one, it is called Multivariate LR. To learn Linear Regression, it is a good idea to start with Univariate Linear Regression, as it simpler and better to create first intuition about the algorithm.

在機器學習問題中，算法的複雜性取決于所提供的資料。當使用LR建構ML模型時，如果訓練集中的特征數為1，則稱為Univariate LR；如果數量大于1，則稱為Multivariate LR。要學習線性回歸，從單變量線性回歸開始是一個好主意，因為建立有關該算法的第一個直覺會更簡單，更好。

假設圖 (Hypothesis, graphs)

To get intuitions about the algorithm I will try to explain it with an example. The example is a set of data on Employee Satisfaction and Salary level.

為了獲得有關算法的直覺，我将嘗試通過一個示例進行解釋。該示例是一組有關員工滿意度和薪資水準的資料。

機器學習多變量線性回歸代碼_機器學習單變量線性回歸機器學習中的資料集基礎 (The basics of datasets in Machine Learning) 什麼是單變量線性回歸？ (What is Univariate Linear Regression?) 假設圖 (Hypothesis, graphs) 成本函數(損失函數) (Cost function (Loss function)) 梯度下降 (Gradient Descent) 謝謝。 (Thank you.)

Figure 1. Raw dataset 圖1.原始資料集

As it is seen from the picture, there is linear dependence between two variables. Here Employee Salary is a “X value”, and Employee Satisfaction Rating is a “Y value”. In this particular case there is only one variable, so Univariate Linear Regression can be used in order to solve this problem.

從圖檔中可以看出，兩個變量之間存線上性關系。在這裡，員工薪水是“ X值”，員工滿意度等級是“ Y值”。在此特定情況下，隻有一個變量，是以可以使用單變量線性回歸來解決此問題。

In the following picture you will see three different lines.

在下圖中，您将看到三行不同的内容。

Figure 2. 3 lines on the dataset 圖2.資料集上的3行

This is already implemented ULR example, but we have three solutions and we need to choose only one of them. Visually we can see that Line 2 is the best one among them, because it fits the data better than both Line 1 and Line 3. This is rather easier decision to make and most of the problems will be harder than that. The following paragraphs are about how to make these decisions precisely with the help of mathematical solutions and equations.

這已經實作了ULR示例，但是我們有三種解決方案，我們隻需要選擇其中一種即可。從視覺上我們可以看到，第2行是其中最好的一條，因為它比第1行和第3行都更适合資料。這是比較容易做出的決定，而且大多數問題都比這困難。以下各段是關于如何借助數學解和方程式精确地做出這些決定的。

Now let’s see how to represent the solution of Linear Regression Models (lines) mathematically:

現在，讓我們看看如何用數學方法表示線性回歸模型(線)的解：

Here,

這裡，

hθ(x) — the answer of the hypothesis

hθ(x)-假設的答案
θ0 and θ1 — parameters we have to calculate to fit the line to the data

θ0和θ1 —我們必須計算的參數才能使線适合資料
x — the point from the dataset

x-資料集中的點

This is exactly same as the equation of line — y = mx + b. As the solution of Univariate Linear Regression is a line, equation of line is used to represent the hypothesis(solution).

這與線y = mx + b的方程式完全相同。由于單變量線性回歸的解是一條線，是以使用線方程表示假設(解)。

Let’s look at an example. For instance, there is a point in the provided training set — (x = 1.9; y = 1.9) and the hypothesis of h(x) = -1.3 + 2x. When this hypothesis is applied to the point, we get the answer of approximately 2.5.

讓我們來看一個例子。例如，所提供的訓練集中有一個要點-(x = 1.9; y = 1.9)， h(x)的假設是-1.3 + 2x 。将這一假設應用于該點時，我們得到的答案約為2.5。

After the answer is got, it should be compared with y value (1.9 in the example) to check how well the equation works. In this particular example there is difference of 0.6 between real value — y, and the hypothesis. So for this particular case 0.6 is a big difference and it means we need to improve the hypothesis in order to fit it to the dataset better.

得到答案後，應将其與y值(本例中為1.9)進行比較，以檢查方程的工作效果。在此特定示例中，實際值y與假設之間存在0.6的差異。是以，對于這種特殊情況，0.6是一個很大的差異，這意味着我們需要改進假設以使其更好地适合資料集。

But here comes the question — how can the value of h(x) be manipulated to make it as possible as close to y? In order to answer the question, let’s analyze the equation. There are three parameters — θ0, θ1, and x. X is from the dataset, so it cannot be changed (in example the pair is (1.9; 1.9), and if you get h(x) = 2.5, you cannot change the point to (1.9; 2.5)). So we left with only two parameters (θ0 and θ1) to optimize the equation. In optimization two functions — Cost function and Gradient descent, play important roles, Cost function to find how well the hypothesis fit the data, Gradient descent to improve the solution.

但是，這裡出現了一個問題–如何操縱h(x)的值以使其盡可能接近y？為了回答這個問題，讓我們分析方程式。有三個參數- θ0，θ1，并且x。 X來自資料集，是以無法更改(例如，對為(1.9; 1.9)，如果得到h(x)= 2.5，則無法将點更改為(1.9; 2.5))。是以我們隻剩下兩個參數( θ0和 θ1)，以優化方程。在優化中，兩個函數-Cost函數和Gradient下降起着重要作用，Cost函數用于查找假設與資料的拟合程度，Gradient下降以改善求解。

成本函數(損失函數) (Cost function (Loss function))

In the examples above, we did some comparisons in order to determine whether the line is fit to the data or not. In the first one, it was just a choice between three lines, in the second, a simple subtraction. But how will we evaluate models for complicated datasets? It is when Cost function comes to aid. In a simple definition, Cost function evaluates how well the model (line in case of LR) fits to the training set. There are various versions of Cost function, but we will use the one below for ULR:

在上面的示例中，我們進行了一些比較，以确定該行是否适合資料。在第一行中，它隻是三行之間的選擇，在第二行中，是簡單的減法。但是，我們将如何評估複雜資料集的模型？這是成本函數提供幫助的時候。在一個簡單的定義中，成本函數評估模型(對于LR而言，線)與訓練集的拟合程度。 Cost函數有多種版本，但是我們将在ULR中使用以下版本：

Here,

這裡，

m — number of examples in training set;

m-訓練集中的示例數；
h — answer of hypothesis;

h-假設的答案；
y — y values of points in the dataset.

y —資料集中點的y值。

The optimization level of the model is related with the value of Cost function. The smaller the value is, the better the model is. Why? The answer is simple — Cost is equal to the sum of the squared differences between value of the hypothesis and y. If all the points were on the line, there will not be any difference and answer would be zero. To put it another way, if the points were far away from the line, the answer would be very large number. To sum up, the aim is to make it as small as possible.

模型的優化水準與成本函數的值有關。值越小，模型越好。為什麼？答案很簡單-成本等于假設值與y之間的平方差之和。如果所有點都線上上，則不會有任何差異，答案将為零。換句話說，如果點離線很遠，答案将是非常大的。綜上所述，目的是使其盡可能小。

So, from this point, we will try to minimize the value of the Cost function.

是以，從這一點出發，我們将嘗試最小化Cost函數的值。

梯度下降 (Gradient Descent)

In order to get proper intuition about Gradient Descent algorithm let’s first look at some graphs.

為了獲得關于梯度下降算法的正确直覺，讓我們首先看一些圖。

This is dependence graph of Cost function from theta. As mentioned above, the optimal solution is when the value of Cost function is minimum. In Univariate Linear Regression the graph of Cost function is always parabola and the solution is the minima.

這是Theta的Cost函數的依賴圖。如上所述，最佳解決方案是Cost函數的值最小時。在單變量線性回歸中，Cost函數的圖形始終為抛物線，且解為最小值。

Gradient Descent is the algorithm such that it finds the minima:

梯度下降是找到最小值的算法：

Here,

這裡，

α — learning rate;

α-學習率；

The equation may seem a little bit confusing, so let’s go over step by step.

該方程式似乎有些令人困惑，是以讓我們逐漸進行研究。

What is this symbol — ‘:=’?

這個符号是什麼-'：='？

Firstly, it is not same as ‘=’. ‘:=’ means “to update the left side value”, here it is not possible to use ‘=’ mathematically, because a number cannot be equal to subtraction of itself and something else (zero is an exception in this case).

首先，它與'='不同。 '：='表示“更新左側值” ，此處無法在數學上使用'='，因為數字不能等于其自身和其他東西的減法(在這種情況下，零是一個例外)。

2. What is ‘j’?

2.什麼是“ j” ？

‘j’ is related to the number of features in the dataset. In Univariate Linear Regression there is only one feature and j is equal to 2. ‘j’ = number of features + 1.

“ j”與資料集中的要素數量有關。在單變量線性回歸中，隻有一個特征，并且j等于2 。 'j'=特征數量+ 1。

3. What is ‘alpha’?

3.什麼是“ alpha”？

‘alpha’ is learning rate. Its value is usually between 0.001 and 0.1 and it is a positive number. If it is high the algorithm may ‘jump’ over the minima and diverge from solution. If it is low the convergence will be slow. In most cases several instances of ‘alpha’ is tired and the best one is picked.

“ alpha”是學習率。它的值通常在0.001和0.1之間，并且是一個正數。如果該值很高，則該算法可能會“跳過”最小值并偏離解。如果該值較低，則收斂速度将很慢。在大多數情況下，“ alpha”的幾種情況很累，最好的一種被選中。

4. The term of partial derivative.

4.偏導數項。

Cost function mentioned above:

上面提到的成本函數：

Cost function with definition of h(x) substituted:

定義了h(x)的成本函數：

Derivative of Cost function:

成本函數的導數：

5. Why is derivative used and sing before alpha is negative?

5.為什麼在alpha為負數之前使用導數并唱歌？

The answer of the derivative is the slope. The example graphs below show why derivate is so useful to find the minima.

導數的答案是斜率。下面的示例圖顯示了為什麼求導如此有用以找到最小值。

In the first graph above, the slope — derivative is positive. As is seen, the interception point of line and parabola should move towards left in order to reach optima. For that, the X value(theta) should decrease. Now let’s remember the equation of the Gradient descent — alpha is positive, derivative is positive (for this example) and the sign in front is negative. Overall the value is negative and theta will be decreased.

在上面的第一張圖中，斜率-導數為正。如圖所示，線和抛物線的截取點應向左移動以達到最佳狀态。為此，X值θ應減小。現在，讓我們記住梯度下降的方程式-alpha為正，導數為正(在此示例中)，前面的符号為負。總體而言，該值為負，θ将減小。

In the second example, the slope — derivative is negative. As is seen, the interception point of line and parabola should move towards right in order to reach optima. For that, the X value(theta) should increase. Now let’s remember the equation of the Gradient descent — alpha is positive, derivative is negative (for this example) and the sign in front is negative. Overall the value is positive and theta will be increased.

在第二個示例中，斜率-導數為負。如圖所示，直線和抛物線的截取點應向右移動以達到最佳狀态。為此，X值θ應增加。現在，讓我們記住梯度下降的方程式-alpha為正，導數為負(在此示例中)，前面的符号為負。總體而言，該值為正，θ将增加。

The coming section will be about Multivariate Linear Regression.

接下來的部分将涉及多元線性回歸。

謝謝。 (Thank you.)

翻譯自: https://medium.com/swlh/machine-learning-univariate-linear-regression-1acddb85aa0b

機器學習多變量線性回歸代碼

機器學習中的資料集基礎 (The basics of datasets in Machine Learning)

什麼是單變量線性回歸？ (What is Univariate Linear Regression?)

假設圖 (Hypothesis, graphs)

成本函數(損失函數) (Cost function (Loss function))

梯度下降 (Gradient Descent)

謝謝。 (Thank you.)

繼續閱讀

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡

Small tricks

libsvm for python 安裝

2021年危險化學品經營機關安全管理人員考試題庫及危險化學品經營機關安全管理人員考試技巧

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

無人機--飛控科普

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入