天天看点

分类算法-Logistic回归 分类算法-Logistic回归 (Classification Algorithms - Logistic Regression)

分类算法-Logistic回归 (Classification Algorithms - Logistic Regression)

Logistic回归简介 (Introduction to Logistic Regression)

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would be only two possible classes.

逻辑回归是一种监督学习分类算法,用于预测目标变量的概率。 目标或因变量的性质是二分法的,这意味着将只有两个可能的类。

In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no).

简而言之,因变量本质上是二进制的,其数据编码为1(代表成功/是)或0(代表失败/否)。

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc.

在数学上,逻辑回归模型预测P(Y = 1)作为X的函数。它是最简单的ML算法之一,可用于各种分类问题,例如垃圾邮件检测,糖尿病预测,癌症检测等。

Logistic回归的类型 (Types of Logistic Regression)

Generally, logistic regression means binary logistic regression having binary target variables, but there can be two more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression can be divided into following types −

通常,逻辑回归是指具有二进制目标变量的二进制逻辑回归,但是可以通过它预测两类以上的目标变量。 基于这些类别,Logistic回归可以分为以下几种类型:

二元或二项式 (Binary or Binomial)

In such a kind of classification, a dependent variable will have only two possible types either 1 and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.

在这种分类中,因变量将仅具有两种可能的类型,即1和0。例如,这些变量可以表示成功或失败,是或否,赢或输等。

多项式 (Multinomial)

In such a kind of classification, dependent variable can have 3 or more possible unordered types or the types having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type C”.

在这种分类中,因变量可以具有3种或更多可能的无序类型或没有定量意义的类型。 例如,这些变量可以表示“类型A”或“类型B”或“类型C”。

序数 (Ordinal)

In such a kind of classification, dependent variable can have 3 or more possible ordered types or the types having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.

在这种分类中,因变量可以具有3个或更多可能的有序类型或具有定量意义的类型。 例如,这些变量可以表示“差”或“好”,“非常好”,“优秀”,并且每个类别的得分都可以为0、1、2、3。

逻辑回归假设 (Logistic Regression Assumptions)

Before diving into the implementation of logistic regression, we must be aware of the following assumptions about the same −

在深入研究逻辑回归的实现之前,我们必须了解以下关于相同的假设-

  • In case of binary logistic regression, the target variables must be binary always and the desired outcome is represented by the factor level 1.

    如果是二进制逻辑回归,则目标变量必须始终为二进制,并且期望结果由因子级别1表示。

  • There should not be any multi-collinearity in the model, which means the independent variables must be independent of each other .

    模型中不应存在任何多重共线性,这意味着自变量必须彼此独立。

  • We must include meaningful variables in our model.

    我们必须在模型中包括有意义的变量。

  • We should choose a large sample size for logistic regression.

    我们应该选择大样本量进行逻辑回归。

二元Logistic回归模型 (Binary Logistic Regression model)

The simplest form of logistic regression is binary or binomial logistic regression in which the target or dependent variable can have only 2 possible types either 1 or 0. It allows us to model a relationship between multiple predictor variables and a binary/binomial target variable. In case of logistic regression, the linear function is basically used as an input to another function such as 𝑔 in the following relation −

$$h_{\theta}{(x)}=g(\theta^{T}x)𝑤ℎ𝑒𝑟𝑒 0≤h_{\theta}≤1$$

Logistic回归的最简单形式是二进制或二项式Logistic回归,其中目标或因变量只能具有2种可能的类型,即1或0。它使我们能够对多个预测变量与二进制/二项式目标变量之间的关系进行建模。 在逻辑回归的情况下,线性函数基本上用作以下关系的另一个函数的输入,例如−-

$$ h _ {\ theta} {(x)} = g(\ theta ^ {T} x)𝑤ℎ𝑒𝑟𝑒0≤h_ {\ theta}≤1$$

Here, 𝑔 is the logistic or sigmoid function which can be given as follows −

$$g(z)= \frac{1}{1+e^{-z}}𝑤ℎ𝑒𝑟𝑒 𝑧=\theta ^{T}𝑥$$

𝑔是逻辑或S型函数,可以给出如下-

$$ g(z)= \ frac {1} {1 + e ^ {-z}}𝑤ℎ𝑒𝑟𝑒𝑧= \ theta ^ {T}𝑥$$

To sigmoid curve can be represented with the help of following graph. We can see the values of y-axis lie between 0 and 1 and crosses the axis at 0.5.

可以通过下图来表示S形曲线。 我们可以看到y轴的值介于0和1之间,并且与该轴的交点为0.5。

分类算法-Logistic回归 分类算法-Logistic回归 (Classification Algorithms - Logistic Regression)

The classes can be divided into positive or negative. The output comes under the probability of positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis function as positive if it is ≥0.5, otherwise negative.

这些类别可以分为正面或负面。 如果输出介于0和1之间,则输出属于正类别的概率。对于我们的实现,我们假设假设函数的输出≥0.5则为正,否则为负。

We also need to define a loss function to measure how well the algorithm performs using the weights on functions, represented by theta as follows −

我们还需要定义一个损失函数,以使用权重来衡量算法的性能,以theta表示如下-

ℎ=𝑔(𝑋𝜃)

$$J(\theta) = \frac{1}{m}.(-y^{T}log(h) - (1 -y)^Tlog(1-h))$$

ℎ=𝑔(𝑋𝜃)

$$ J(\ theta)= \ frac {1} {m}。(-y ^ {T} log(h)-(1 -y)^ Tlog(1-h))$$

Now, after defining the loss function our prime goal is to minimize the loss function. It can be done with the help of fitting the weights which means by increasing or decreasing the weights. With the help of derivatives of the loss function w.r.t each weight, we would be able to know what parameters should have high weight and what should have smaller weight.

现在,在定义损失函数之后,我们的主要目标是使损失函数最小化。 可以借助调整权重来完成,这意味着可以增加或减少权重。 借助每个权重的损失函数的导数,我们将能够知道哪些参数应具有较高的权重,哪些参数应具有较小的权重。

The following gradient descent equation tells us how loss would change if we modified the parameters −

$$\frac{𝛿𝐽(𝜃)}{𝛿\theta_{j}}=\frac{1}{m}X^{T}(𝑔(𝑋𝜃)−𝑦)$$

以下梯度下降方程式告诉我们,如果我们修改参数,损耗将如何变化-

$$ \ frac {𝛿𝐽(𝜃)} {𝛿 \ theta_ {j}} = \ frac {1} {m} X ^ {T}(𝑔(𝑋𝜃)−𝑦)$$

用Python实现 (Implementation in Python)

Now we will implement the above concept of binomial logistic regression in Python. For this purpose, we are using a multivariate flower dataset named ‘iris’ which have 3 classes of 50 instances each, but we will be using the first two feature columns. Every class represents a type of iris flower.

现在,我们将在Python中实现上述二项式逻辑回归的概念。 为此,我们使用一个名为“ iris”的多元花卉数据集,该数据集具有3类,每类50个实例,但是我们将使用前两个要素列。 每个类别代表一种鸢尾花。

First, we need to import the necessary libraries as follows −

首先,我们需要导入必要的库,如下所示:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
           

Next, load the iris dataset as follows −

接下来,按以下方式加载虹膜数据集-

iris = datasets.load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1
           

We can plot our training data s follows −

我们可以按照以下方式绘制训练数据-

plt.figure(figsize=(6, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend();
           
分类算法-Logistic回归 分类算法-Logistic回归 (Classification Algorithms - Logistic Regression)

Next, we will define sigmoid function, loss function and gradient descend as follows −

接下来,我们将定义S型函数,损失函数和梯度下降,如下所示:

class LogisticRegression:
   def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
      self.lr = lr
      self.num_iter = num_iter
      self.fit_intercept = fit_intercept
      self.verbose = verbose
   def __add_intercept(self, X):
      intercept = np.ones((X.shape[0], 1))
      return np.concatenate((intercept, X), axis=1)
   def __sigmoid(self, z):
      return 1 / (1 + np.exp(-z))
   def __loss(self, h, y):
      return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
   def fit(self, X, y):
      if self.fit_intercept:
         X = self.__add_intercept(X)
           

Now, initialize the weights as follows −

现在,按以下方式初始化权重-

self.theta = np.zeros(X.shape[1])
   for i in range(self.num_iter):
      z = np.dot(X, self.theta)
      h = self.__sigmoid(z)
      gradient = np.dot(X.T, (h - y)) / y.size
      self.theta -= self.lr * gradient
      z = np.dot(X, self.theta)
      h = self.__sigmoid(z)
      loss = self.__loss(h, y)
      if(self.verbose ==True and i % 10000 == 0):
         print(f'loss: {loss} \t')
           

With the help of the following script, we can predict the output probabilities −

借助以下脚本,我们可以预测输出概率-

def predict_prob(self, X):
   if self.fit_intercept:
      X = self.__add_intercept(X)
   return self.__sigmoid(np.dot(X, self.theta))
def predict(self, X):
   return self.predict_prob(X).round()
           

Next, we can evaluate the model and plot it as follows −

接下来,我们可以评估模型并将其绘制如下:

model = LogisticRegression(lr=0.1, num_iter=300000)
preds = model.predict(X)
(preds == y).mean()

plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend()
x1_min, x1_max = X[:,0].min(), X[:,0].max(),
x2_min, x2_max = X[:,1].min(), X[:,1].max(),
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_prob(grid).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='red');
           
分类算法-Logistic回归 分类算法-Logistic回归 (Classification Algorithms - Logistic Regression)

多项式Lo​​gistic回归模型 (Multinomial Logistic Regression Model)

Another useful form of logistic regression is multinomial logistic regression in which the target or dependent variable can have 3 or more possible unordered types i.e. the types having no quantitative significance.

Logistic回归的另一种有用形式是多项式Lo​​gistic回归,其中目标或因变量可以具有3种或更多种可能的无序类型,即没有定量意义的类型。

用Python实现 (Implementation in Python)

Now we will implement the above concept of multinomial logistic regression in Python. For this purpose, we are using a dataset from sklearn named digit.

现在,我们将在Python中实现上述多项逻辑回归的概念。 为此,我们使用来自sklearn的名为digit的数据集。

First, we need to import the necessary libraries as follows −

首先,我们需要导入必要的库,如下所示:

Import sklearn
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
           

Next, we need to load digit dataset −

接下来,我们需要加载数字数据集-

digits = datasets.load_digits()
           

Now, define the feature matrix(X) and response vector(y)as follows −

现在,如下定义特征矩阵(X)和响应向量(y)-

X = digits.data
y = digits.target
           

With the help of next line of code, we can split X and y into training and testing sets −

在下一行代码的帮助下,我们可以将X和y分为训练和测试集-

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
           

Now create an object of logistic regression as follows −

现在创建一个逻辑回归的对象,如下所示:

digreg = linear_model.LogisticRegression()
           

Now, we need to train the model by using the training sets as follows −

现在,我们需要使用训练集来训练模型,如下所示:

digreg.fit(X_train, y_train)
           

Next, make the predictions on testing set as follows −

接下来,对测试集进行如下预测:

y_pred = digreg.predict(X_test)
           

Next print the accuracy of the model as follows −

接下来打印模型的精度如下-

print("Accuracy of Logistic Regression model is:",
metrics.accuracy_score(y_test, y_pred)*100)
           

输出量 (Output)

Accuracy of Logistic Regression model is: 95.6884561891516
           

From the above output we can see the accuracy of our model is around 96 percent.

从上面的输出中,我们可以看到我们模型的准确性约为96%。

翻译自: https://www.tutorialspoint.com/machine_learning_with_python/classification_algorithms_logistic_regression.htm