Chapter3

A Tour of Machine Learning Classifiers Using Scikit-learn

3topoics:

- 主流分类算法

- Using the scikit-learn machine learning library

- Questions to ask when selecting a machine learning algorithm

Chosing a classification algorithm

训练机器学习算法的五个步骤：

Selection of features.
Choosing a performance metric.
Choosing a classifier and optimization algorithm.
Evaluating the performance of the model.
Tuning the algorithm

First steps with scikit-learn

Now we will take a look at the scikit-learn API, which **combines a user-friendly interface with a highly optimized

implementation of several classification algorithms**.

However, the scikit-learn library offers not only a large variety of learning algorithms, but also many convenient functions to preprocess data and to fine-tune and evaluate our models

Training a perceptron via scikit-learn

数据集还是Iris，将前150个petal length和petal width作为特征矩阵X，对应标签作为特征y：

>>> from sklearn import datasets
>>> import numpy as np
>>> iris = datasets.load_iris()
>>> X = iris.data[:, [, ]]
>>> y = iris.target

如果执行

np.unique(y)

，则返回一个不同类的标签，存入

iris.target

,可以看到类名：Iris-Setosa, Iris-Versicolor,

and Iris-Virginica分别被标记为整数（0,1,2）。

为了评估训练模型的好坏，将数据集分为测试集（30%，45个样本）和训练集（70%，105个样本）：

>>> from sklearn.cross_validation import train_test_split #train_test_split函数讲数据集随机分为两部分，测试集（30%，45个样本）和训练集（70%，105个样本）
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=, random_state=)

特征归一化：

>>> from sklearn.preprocessing import StandardScaler
>>> sc = StandardScaler()
>>> sc.fit(X_train)#用fit方法估计参数$\mu$和参数$\sigma$
>>> X_train_std = sc.transform(X_train) #用参数mu和sigma标准化训练数据和测试数据
>>> X_test_std = sc.transform(X_test)

scikit-learn里面大多数算法都支持多分类，默认调用One-vs.-Rest方法，这样我们可以一次性输入三种花的类别：

>>> from sklearn.linear_model import Perceptron
>>> ppn = Perceptron(n_iter=, eta0=, random_state=)
>>> ppn.fit(X_train_std, y_train)

sklearn里面的perceptron，和之前我们自己定义的perceptron可以说是很像了。

eta0

对应与之前的

eta

，都表示学习率，

n_iter

都表示迭代次数。

random_state

用来在每一轮迭代之后再现初始数据集。

预测部分：

>>> y_pred = ppn.predict(X_test_std)
>>> print('Misclassified samples: %d' % (y_test != y_pred).sum())
Misclassified samples:

Scikit-learn 也实现了许多不同的性能度量。比如我们可以计算测试集的分类精度：

>>> from sklearn.metrics import accuracy_score
>>> print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

y_test

y_pred

分别是测试集真实标签和预测标签。

可视化结果：

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
def plot_decision_regions(X, y, classifier,
test_idx=None, resolution=):
# setup marker generator and color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
# plot the decision surface
x1_min, x1_max = X[:, ].min() - , X[:, ].max() + 
x2_min, x2_max = X[:, ].min() - , X[:, ].max() + 
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# plot all samples
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, ], y=X[y == cl, ],
alpha=, c=cmap(idx),
marker=markers[idx], label=cl)
# highlight test samples
if test_idx:
X_test, y_test = X[test_idx, :], y[test_idx]
plt.scatter(X_test[:, ], X_test[:, ], c='',
alpha=, linewidths=, marker='o',
s=, label='test set')

now specify the indices of the samples that we want to mark on the resulting plots:

>>> X_combined_std = np.vstack((X_train_std, X_test_std))
>>> y_combined = np.hstack((y_train, y_test))
>>> plot_decision_regions(X=X_combined_std,
... y=y_combined,
... classifier=ppn,
... test_idx=range(,))
>>> plt.xlabel('petal length [standardized]')
>>> plt.ylabel('petal width [standardized]')
>>> plt.legend(loc='upper left')
>>> plt.show()

感知机算法在非完全线性可分的数据集上，从不收敛。所以一般not recommended。

Modeling class probabilities via logistic regression

上面说到感知机算法从不收敛，直观的说，可以考虑由于权重是连续被更新的，所以每次迭代总会有至少一个被误分的样本。

Logistic regression intuition and conditional probabilities

odds ratio，直译的话就是胜率，比值比。可以写作

p1−p

, p <script type="math/tex" id="MathJax-Element-2">p</script>da

Python Machine Learning chap3Chapter3A Tour of Machine Learning Classifiers Using Scikit-learn

Chapter3

A Tour of Machine Learning Classifiers Using Scikit-learn

Chosing a classification algorithm

First steps with scikit-learn

Training a perceptron via scikit-learn

Modeling class probabilities via logistic regression

Logistic regression intuition and conditional probabilities

继续阅读

caffe参数配置solver.prototxt 及优化算法选择

Deep Learning | Coursera 课后作业笔记deeplearning.ai

论文阅读笔记: 2016 cvpr Convolutional Pose Machines论文阅读笔记: 2016 cvpr Convolutional Pose Machines

使用ipython %matplotlib inline

Windows-caffe安装

Ubuntu16.04 + GTX1080 解决桌面重复登录问题

使用object_detection_api进行训练和预测使用object_detection_api进行训练和预测

Relation Network笔记

Dinosaurus Island Character level language model finalCharacter level language model - Dinosaurus land

Tensorflow源码编译GTX1060 configure配置

Mean Average Precise理解及源码分析

lstm-结构

《深度学习入门:基于Python的理论与实现》学习与总结（二）

Keras-3 Keras With Otto GroupOtto 分类问题

ubuntu从零开始安装mxnet--安装NVIDIA驱动