天天看点

Python Machine Learning chap3Chapter3A Tour of Machine Learning Classifiers Using Scikit-learn

Chapter3

A Tour of Machine Learning Classifiers Using Scikit-learn

3topoics:

- 主流分类算法

- Using the scikit-learn machine learning library

- Questions to ask when selecting a machine learning algorithm

Chosing a classification algorithm

训练机器学习算法的五个步骤:

  1. Selection of features.
  2. Choosing a performance metric.
  3. Choosing a classifier and optimization algorithm.
  4. Evaluating the performance of the model.
  5. Tuning the algorithm

First steps with scikit-learn

Now we will take a look at the scikit-learn API, which **combines a user-friendly interface with a highly optimized

implementation of several classification algorithms**.

However, the scikit-learn library offers not only a large variety of learning algorithms, but also many convenient functions to preprocess data and to fine-tune and evaluate our models

Training a perceptron via scikit-learn

数据集还是Iris,将前150个petal length和petal width作为特征矩阵X,对应标签作为特征y:

>>> from sklearn import datasets
>>> import numpy as np
>>> iris = datasets.load_iris()
>>> X = iris.data[:, [, ]]
>>> y = iris.target
           

如果执行

np.unique(y)

,则返回一个不同类的标签,存入

iris.target

,可以看到类名:Iris-Setosa, Iris-Versicolor,

and Iris-Virginica分别被标记为整数(0,1,2)。

为了评估训练模型的好坏,将数据集分为测试集(30%,45个样本)和训练集(70%,105个样本):

>>> from sklearn.cross_validation import train_test_split #train_test_split函数讲数据集随机分为两部分,测试集(30%,45个样本)和训练集(70%,105个样本)
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=, random_state=)
           

特征归一化:

>>> from sklearn.preprocessing import StandardScaler
>>> sc = StandardScaler()
>>> sc.fit(X_train)#用fit方法估计参数$\mu$和参数$\sigma$
>>> X_train_std = sc.transform(X_train) #用参数mu和sigma标准化训练数据和测试数据
>>> X_test_std = sc.transform(X_test)
           

scikit-learn里面大多数算法都支持多分类,默认调用One-vs.-Rest方法,这样我们可以一次性输入三种花的类别:

>>> from sklearn.linear_model import Perceptron
>>> ppn = Perceptron(n_iter=, eta0=, random_state=)
>>> ppn.fit(X_train_std, y_train)
           

sklearn里面的perceptron,和之前我们自己定义的perceptron可以说是很像了。

eta0

对应与之前的

eta

,都表示学习率,

n_iter

都表示迭代次数。

random_state

用来在每一轮迭代之后再现初始数据集。

预测部分:

>>> y_pred = ppn.predict(X_test_std)
>>> print('Misclassified samples: %d' % (y_test != y_pred).sum())
Misclassified samples: 
           

Scikit-learn 也实现了许多不同的性能度量。比如我们可以计算测试集的分类精度:

>>> from sklearn.metrics import accuracy_score
>>> print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

           

y_test

y_pred

分别是测试集真实标签和预测标签。

可视化结果:

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
def plot_decision_regions(X, y, classifier,
test_idx=None, resolution=):
# setup marker generator and color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
# plot the decision surface
x1_min, x1_max = X[:, ].min() - , X[:, ].max() + 
x2_min, x2_max = X[:, ].min() - , X[:, ].max() + 
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# plot all samples
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, ], y=X[y == cl, ],
alpha=, c=cmap(idx),
marker=markers[idx], label=cl)
# highlight test samples
if test_idx:
X_test, y_test = X[test_idx, :], y[test_idx]
plt.scatter(X_test[:, ], X_test[:, ], c='',
alpha=, linewidths=, marker='o',
s=, label='test set')
           

now specify the indices of the samples that we want to mark on the resulting plots:

>>> X_combined_std = np.vstack((X_train_std, X_test_std))
>>> y_combined = np.hstack((y_train, y_test))
>>> plot_decision_regions(X=X_combined_std,
... y=y_combined,
... classifier=ppn,
... test_idx=range(,))
>>> plt.xlabel('petal length [standardized]')
>>> plt.ylabel('petal width [standardized]')
>>> plt.legend(loc='upper left')
>>> plt.show()
           

感知机算法在非完全线性可分的数据集上,从不收敛。所以一般not recommended。

Modeling class probabilities via logistic regression

上面说到感知机算法从不收敛,直观的说,可以考虑由于权重是连续被更新的,所以每次迭代总会有至少一个被误分的样本。

Logistic regression intuition and conditional probabilities

odds ratio,直译的话就是胜率,比值比。可以写作

p1−p

, p <script type="math/tex" id="MathJax-Element-2">p</script>da