天天看點

分類算法-Logistic回歸 分類算法-Logistic回歸 (Classification Algorithms - Logistic Regression)

分類算法-Logistic回歸 (Classification Algorithms - Logistic Regression)

Logistic回歸簡介 (Introduction to Logistic Regression)

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would be only two possible classes.

邏輯回歸是一種監督學習分類算法,用于預測目标變量的機率。 目标或因變量的性質是二分法的,這意味着将隻有兩個可能的類。

In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no).

簡而言之,因變量本質上是二進制的,其資料編碼為1(代表成功/是)或0(代表失敗/否)。

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc.

在數學上,邏輯回歸模型預測P(Y = 1)作為X的函數。它是最簡單的ML算法之一,可用于各種分類問題,例如垃圾郵件檢測,糖尿病預測,癌症檢測等。

Logistic回歸的類型 (Types of Logistic Regression)

Generally, logistic regression means binary logistic regression having binary target variables, but there can be two more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression can be divided into following types −

通常,邏輯回歸是指具有二進制目标變量的二進制邏輯回歸,但是可以通過它預測兩類以上的目标變量。 基于這些類别,Logistic回歸可以分為以下幾種類型:

二進制或二項式 (Binary or Binomial)

In such a kind of classification, a dependent variable will have only two possible types either 1 and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.

在這種分類中,因變量将僅具有兩種可能的類型,即1和0。例如,這些變量可以表示成功或失敗,是或否,赢或輸等。

多項式 (Multinomial)

In such a kind of classification, dependent variable can have 3 or more possible unordered types or the types having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type C”.

在這種分類中,因變量可以具有3種或更多可能的無序類型或沒有定量意義的類型。 例如,這些變量可以表示“類型A”或“類型B”或“類型C”。

序數 (Ordinal)

In such a kind of classification, dependent variable can have 3 or more possible ordered types or the types having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.

在這種分類中,因變量可以具有3個或更多可能的有序類型或具有定量意義的類型。 例如,這些變量可以表示“差”或“好”,“非常好”,“優秀”,并且每個類别的得分都可以為0、1、2、3。

邏輯回歸假設 (Logistic Regression Assumptions)

Before diving into the implementation of logistic regression, we must be aware of the following assumptions about the same −

在深入研究邏輯回歸的實作之前,我們必須了解以下關于相同的假設-

  • In case of binary logistic regression, the target variables must be binary always and the desired outcome is represented by the factor level 1.

    如果是二進制邏輯回歸,則目标變量必須始終為二進制,并且期望結果由因子級别1表示。

  • There should not be any multi-collinearity in the model, which means the independent variables must be independent of each other .

    模型中不應存在任何多重共線性,這意味着自變量必須彼此獨立。

  • We must include meaningful variables in our model.

    我們必須在模型中包括有意義的變量。

  • We should choose a large sample size for logistic regression.

    我們應該選擇大樣本量進行邏輯回歸。

二進制Logistic回歸模型 (Binary Logistic Regression model)

The simplest form of logistic regression is binary or binomial logistic regression in which the target or dependent variable can have only 2 possible types either 1 or 0. It allows us to model a relationship between multiple predictor variables and a binary/binomial target variable. In case of logistic regression, the linear function is basically used as an input to another function such as 𝑔 in the following relation −

$$h_{\theta}{(x)}=g(\theta^{T}x)𝑤ℎ𝑒𝑟𝑒 0≤h_{\theta}≤1$$

Logistic回歸的最簡單形式是二進制或二項式Logistic回歸,其中目标或因變量隻能具有2種可能的類型,即1或0。它使我們能夠對多個預測變量與二進制/二項式目标變量之間的關系進行模組化。 在邏輯回歸的情況下,線性函數基本上用作以下關系的另一個函數的輸入,例如−-

$$ h _ {\ theta} {(x)} = g(\ theta ^ {T} x)𝑤ℎ𝑒𝑟𝑒0≤h_ {\ theta}≤1$$

Here, 𝑔 is the logistic or sigmoid function which can be given as follows −

$$g(z)= \frac{1}{1+e^{-z}}𝑤ℎ𝑒𝑟𝑒 𝑧=\theta ^{T}𝑥$$

𝑔是邏輯或S型函數,可以給出如下-

$$ g(z)= \ frac {1} {1 + e ^ {-z}}𝑤ℎ𝑒𝑟𝑒𝑧= \ theta ^ {T}𝑥$$

To sigmoid curve can be represented with the help of following graph. We can see the values of y-axis lie between 0 and 1 and crosses the axis at 0.5.

可以通過下圖來表示S形曲線。 我們可以看到y軸的值介于0和1之間,并且與該軸的交點為0.5。

分類算法-Logistic回歸 分類算法-Logistic回歸 (Classification Algorithms - Logistic Regression)

The classes can be divided into positive or negative. The output comes under the probability of positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis function as positive if it is ≥0.5, otherwise negative.

這些類别可以分為正面或負面。 如果輸出介于0和1之間,則輸出屬于正類别的機率。對于我們的實作,我們假設假設函數的輸出≥0.5則為正,否則為負。

We also need to define a loss function to measure how well the algorithm performs using the weights on functions, represented by theta as follows −

我們還需要定義一個損失函數,以使用權重來衡量算法的性能,以theta表示如下-

ℎ=𝑔(𝑋𝜃)

$$J(\theta) = \frac{1}{m}.(-y^{T}log(h) - (1 -y)^Tlog(1-h))$$

ℎ=𝑔(𝑋𝜃)

$$ J(\ theta)= \ frac {1} {m}。(-y ^ {T} log(h)-(1 -y)^ Tlog(1-h))$$

Now, after defining the loss function our prime goal is to minimize the loss function. It can be done with the help of fitting the weights which means by increasing or decreasing the weights. With the help of derivatives of the loss function w.r.t each weight, we would be able to know what parameters should have high weight and what should have smaller weight.

現在,在定義損失函數之後,我們的主要目标是使損失函數最小化。 可以借助調整權重來完成,這意味着可以增加或減少權重。 借助每個權重的損失函數的導數,我們将能夠知道哪些參數應具有較高的權重,哪些參數應具有較小的權重。

The following gradient descent equation tells us how loss would change if we modified the parameters −

$$\frac{𝛿𝐽(𝜃)}{𝛿\theta_{j}}=\frac{1}{m}X^{T}(𝑔(𝑋𝜃)−𝑦)$$

以下梯度下降方程式告訴我們,如果我們修改參數,損耗将如何變化-

$$ \ frac {𝛿𝐽(𝜃)} {𝛿 \ theta_ {j}} = \ frac {1} {m} X ^ {T}(𝑔(𝑋𝜃)−𝑦)$$

用Python實作 (Implementation in Python)

Now we will implement the above concept of binomial logistic regression in Python. For this purpose, we are using a multivariate flower dataset named ‘iris’ which have 3 classes of 50 instances each, but we will be using the first two feature columns. Every class represents a type of iris flower.

現在,我們将在Python中實作上述二項式邏輯回歸的概念。 為此,我們使用一個名為“ iris”的多元花卉資料集,該資料集具有3類,每類50個執行個體,但是我們将使用前兩個要素列。 每個類别代表一種鸢尾花。

First, we need to import the necessary libraries as follows −

首先,我們需要導入必要的庫,如下所示:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
           

Next, load the iris dataset as follows −

接下來,按以下方式加載虹膜資料集-

iris = datasets.load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1
           

We can plot our training data s follows −

我們可以按照以下方式繪制訓練資料-

plt.figure(figsize=(6, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend();
           
分類算法-Logistic回歸 分類算法-Logistic回歸 (Classification Algorithms - Logistic Regression)

Next, we will define sigmoid function, loss function and gradient descend as follows −

接下來,我們将定義S型函數,損失函數和梯度下降,如下所示:

class LogisticRegression:
   def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
      self.lr = lr
      self.num_iter = num_iter
      self.fit_intercept = fit_intercept
      self.verbose = verbose
   def __add_intercept(self, X):
      intercept = np.ones((X.shape[0], 1))
      return np.concatenate((intercept, X), axis=1)
   def __sigmoid(self, z):
      return 1 / (1 + np.exp(-z))
   def __loss(self, h, y):
      return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
   def fit(self, X, y):
      if self.fit_intercept:
         X = self.__add_intercept(X)
           

Now, initialize the weights as follows −

現在,按以下方式初始化權重-

self.theta = np.zeros(X.shape[1])
   for i in range(self.num_iter):
      z = np.dot(X, self.theta)
      h = self.__sigmoid(z)
      gradient = np.dot(X.T, (h - y)) / y.size
      self.theta -= self.lr * gradient
      z = np.dot(X, self.theta)
      h = self.__sigmoid(z)
      loss = self.__loss(h, y)
      if(self.verbose ==True and i % 10000 == 0):
         print(f'loss: {loss} \t')
           

With the help of the following script, we can predict the output probabilities −

借助以下腳本,我們可以預測輸出機率-

def predict_prob(self, X):
   if self.fit_intercept:
      X = self.__add_intercept(X)
   return self.__sigmoid(np.dot(X, self.theta))
def predict(self, X):
   return self.predict_prob(X).round()
           

Next, we can evaluate the model and plot it as follows −

接下來,我們可以評估模型并将其繪制如下:

model = LogisticRegression(lr=0.1, num_iter=300000)
preds = model.predict(X)
(preds == y).mean()

plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend()
x1_min, x1_max = X[:,0].min(), X[:,0].max(),
x2_min, x2_max = X[:,1].min(), X[:,1].max(),
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_prob(grid).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='red');
           
分類算法-Logistic回歸 分類算法-Logistic回歸 (Classification Algorithms - Logistic Regression)

多項式Lo​​gistic回歸模型 (Multinomial Logistic Regression Model)

Another useful form of logistic regression is multinomial logistic regression in which the target or dependent variable can have 3 or more possible unordered types i.e. the types having no quantitative significance.

Logistic回歸的另一種有用形式是多項式Lo​​gistic回歸,其中目标或因變量可以具有3種或更多種可能的無序類型,即沒有定量意義的類型。

用Python實作 (Implementation in Python)

Now we will implement the above concept of multinomial logistic regression in Python. For this purpose, we are using a dataset from sklearn named digit.

現在,我們将在Python中實作上述多項邏輯回歸的概念。 為此,我們使用來自sklearn的名為digit的資料集。

First, we need to import the necessary libraries as follows −

首先,我們需要導入必要的庫,如下所示:

Import sklearn
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
           

Next, we need to load digit dataset −

接下來,我們需要加載數字資料集-

digits = datasets.load_digits()
           

Now, define the feature matrix(X) and response vector(y)as follows −

現在,如下定義特征矩陣(X)和響應向量(y)-

X = digits.data
y = digits.target
           

With the help of next line of code, we can split X and y into training and testing sets −

在下一行代碼的幫助下,我們可以将X和y分為訓練和測試集-

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
           

Now create an object of logistic regression as follows −

現在建立一個邏輯回歸的對象,如下所示:

digreg = linear_model.LogisticRegression()
           

Now, we need to train the model by using the training sets as follows −

現在,我們需要使用訓練集來訓練模型,如下所示:

digreg.fit(X_train, y_train)
           

Next, make the predictions on testing set as follows −

接下來,對測試集進行如下預測:

y_pred = digreg.predict(X_test)
           

Next print the accuracy of the model as follows −

接下來列印模型的精度如下-

print("Accuracy of Logistic Regression model is:",
metrics.accuracy_score(y_test, y_pred)*100)
           

輸出量 (Output)

Accuracy of Logistic Regression model is: 95.6884561891516
           

From the above output we can see the accuracy of our model is around 96 percent.

從上面的輸出中,我們可以看到我們模型的準确性約為96%。

翻譯自: https://www.tutorialspoint.com/machine_learning_with_python/classification_algorithms_logistic_regression.htm