Chapter3
A Tour of Machine Learning Classifiers Using Scikit-learn
3topoics:
- 主流分类算法
- Using the scikit-learn machine learning library
- Questions to ask when selecting a machine learning algorithm
Chosing a classification algorithm
训练机器学习算法的五个步骤:
- Selection of features.
- Choosing a performance metric.
- Choosing a classifier and optimization algorithm.
- Evaluating the performance of the model.
- Tuning the algorithm
First steps with scikit-learn
Now we will take a look at the scikit-learn API, which **combines a user-friendly interface with a highly optimized
implementation of several classification algorithms**.
However, the scikit-learn library offers not only a large variety of learning algorithms, but also many convenient functions to preprocess data and to fine-tune and evaluate our models
Training a perceptron via scikit-learn
数据集还是Iris,将前150个petal length和petal width作为特征矩阵X,对应标签作为特征y:
>>> from sklearn import datasets
>>> import numpy as np
>>> iris = datasets.load_iris()
>>> X = iris.data[:, [, ]]
>>> y = iris.target
如果执行
np.unique(y)
,则返回一个不同类的标签,存入
iris.target
,可以看到类名:Iris-Setosa, Iris-Versicolor,
and Iris-Virginica分别被标记为整数(0,1,2)。
为了评估训练模型的好坏,将数据集分为测试集(30%,45个样本)和训练集(70%,105个样本):
>>> from sklearn.cross_validation import train_test_split #train_test_split函数讲数据集随机分为两部分,测试集(30%,45个样本)和训练集(70%,105个样本)
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=, random_state=)
特征归一化:
>>> from sklearn.preprocessing import StandardScaler
>>> sc = StandardScaler()
>>> sc.fit(X_train)#用fit方法估计参数$\mu$和参数$\sigma$
>>> X_train_std = sc.transform(X_train) #用参数mu和sigma标准化训练数据和测试数据
>>> X_test_std = sc.transform(X_test)
scikit-learn里面大多数算法都支持多分类,默认调用One-vs.-Rest方法,这样我们可以一次性输入三种花的类别:
>>> from sklearn.linear_model import Perceptron
>>> ppn = Perceptron(n_iter=, eta0=, random_state=)
>>> ppn.fit(X_train_std, y_train)
sklearn里面的perceptron,和之前我们自己定义的perceptron可以说是很像了。
eta0
对应与之前的
eta
,都表示学习率,
n_iter
都表示迭代次数。
random_state
用来在每一轮迭代之后再现初始数据集。
预测部分:
>>> y_pred = ppn.predict(X_test_std)
>>> print('Misclassified samples: %d' % (y_test != y_pred).sum())
Misclassified samples:
Scikit-learn 也实现了许多不同的性能度量。比如我们可以计算测试集的分类精度:
>>> from sklearn.metrics import accuracy_score
>>> print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))
y_test
y_pred
分别是测试集真实标签和预测标签。
可视化结果:
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
def plot_decision_regions(X, y, classifier,
test_idx=None, resolution=):
# setup marker generator and color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
# plot the decision surface
x1_min, x1_max = X[:, ].min() - , X[:, ].max() +
x2_min, x2_max = X[:, ].min() - , X[:, ].max() +
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# plot all samples
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, ], y=X[y == cl, ],
alpha=, c=cmap(idx),
marker=markers[idx], label=cl)
# highlight test samples
if test_idx:
X_test, y_test = X[test_idx, :], y[test_idx]
plt.scatter(X_test[:, ], X_test[:, ], c='',
alpha=, linewidths=, marker='o',
s=, label='test set')
now specify the indices of the samples that we want to mark on the resulting plots:
>>> X_combined_std = np.vstack((X_train_std, X_test_std))
>>> y_combined = np.hstack((y_train, y_test))
>>> plot_decision_regions(X=X_combined_std,
... y=y_combined,
... classifier=ppn,
... test_idx=range(,))
>>> plt.xlabel('petal length [standardized]')
>>> plt.ylabel('petal width [standardized]')
>>> plt.legend(loc='upper left')
>>> plt.show()
感知机算法在非完全线性可分的数据集上,从不收敛。所以一般not recommended。
Modeling class probabilities via logistic regression
上面说到感知机算法从不收敛,直观的说,可以考虑由于权重是连续被更新的,所以每次迭代总会有至少一个被误分的样本。
Logistic regression intuition and conditional probabilities
odds ratio,直译的话就是胜率,比值比。可以写作
p1−p
, p <script type="math/tex" id="MathJax-Element-2">p</script>da