sklearnsklearn

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline1">1. Overview</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline2">2. Building Blocks</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline7">3. Supervised Learning</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline3">3.1. Support Vector Machines</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline4">3.2. Ensemble methods</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline5">3.3. Nearest Neighbors</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline6">3.4. Naive Bayes</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline14">4. Model selection and evaluation</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline8">4.1. Cross-validation: evaluating estimator performance</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline9">4.2. Grid Search: searching for estimator parameters</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline10">4.3. Pipeline: chaining estimators</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline11">4.4. Model evaluation: quantifying the quality of predictions</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline12">4.5. Model persistence</a>

<a href="http://www.cnblogs.com/hhh5460/p/5186197.html#orgheadline13">4.6. Validation curves: plotting scores to evaluate models</a>

<a href="http://www.cnblogs.com/images/scikit-learn-ml-map.png">一张图说明如何选择正确算法</a>

supervised learning

classification # Identifying to which set of categories a new observation belong to.

regression # Predicting a continuous value for a new example.

unsupervised learning

clustering # Automatic grouping of similar objects into sets.

dimensionality reduction # Reducing the number of random variables to consider.

model selection and evaluation # Comparing, validating and choosing parameters and models.

dataset transformations # Feature extraction and normalization.

dataset loading utilities

<a href="https://docs.scipy.org/doc/numpy-dev/user/quickstart.html">https://docs.scipy.org/doc/numpy-dev/user/quickstart.html</a>

<a href="http://cs231n.github.io/python-numpy-tutorial/">http://cs231n.github.io/python-numpy-tutorial/</a>

numpy. 数组/矩阵的表示和运算能力. # import numpy as np

numpy provides:

extension package to Python for multi-dimensional arrays

closer to hardware (efficiency)

designed for scientific computation (convenience)

also known as array oriented computing

array attributes

ndim # 维度

shape # 每个维度大小

dtype # 存储类型

T # 转置矩阵

size # 元素个数

itemsize # 每个元素占用内存大小

nbytes # 占用内存大小

index array

a[d1, d2, …] # 多维访问

a[<array>, …] # fancy indexing

pylab. 绘图能力 # import pylab as plt

scipy. 复杂数值处理运算能力.

The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc. scipy can be compared to other standard scientific-computing libraries, such as the GSL (GNU Scientific Library for C and C++), or Matlab’s toolboxes. scipy is the core package for scientific routines in Python; it is meant to operate efficiently on numpy arrays, so that numpy and scipy work hand in hand.

<a href="http://scikit-learn.org/stable/modules/svm.html">http://scikit-learn.org/stable/modules/svm.html</a>

svm可以用来做classification, regression以及outliers detection(异常检测).

classification有三种分类器分别是SVC, NuSVC, LinearSVC. 其中LinearSVC相同于我SVC使用'linear'核方法，区别在于SVC底层使用libsvm, 而LinearSVC则使用liblinear. 另外LinearSVC得到的结果最后也不会返回support_(支持向量). 对于多分类问题SVC使用one-vs-one来生成分类器，也就是说需要构造C(n,2)个分类器。LinearSVC使用one-vs-rest来生成分类器，也就是构造n个分类器。LinearSVC也有比较复杂的算法只构造一个分类器就可以进行多分类。regression有两种回归器分别是SVR和NuSVR. classifier和regressor都允许直接输出概率值。用于异常检测是OneClassSVM.

kernel函数支持 1.linear 2. polynomial 3. rbf 4. sigmoid(tanh). 对于unbalanced的问题，sklearn实现允许指定 1.class_weight 2.sample_weight. 其中class_weight表示每个class对应的权重，这个在构造classifier时候就需要设置。如果不确定的话就设置成为'auto'。sample_weight则表示每个实例对应的权重，这个可以在调用训练方法fit的时候传入。另外一个比较重要的参数是C(惩罚代价), 通常来说设置成为1.0就够了。但是如果数据中太多噪音的话那么最好减小一些。

在计算效率方面，SVM是通过QP来求解的。基于libsvm的实现时间复杂度在O(d * n^2) ~ O(d * n^3)之间，变化取决于如何使用cache. 所以如果我们内存足够的话那么可以调大cache_size来加快计算速度。其中d表示feature大小，如果数据集合比较稀疏的话，那么可以认为d是non-zero的feature平均数量。libsvm处理数据集合大小最好不要超过10k. 相比之下，liblinear的效率则要好得多，可以很容易训练million级别的数据集合。

<a></a>

<a href="http://scikit-learn.org/stable/modules/ensemble.html">http://scikit-learn.org/stable/modules/ensemble.html</a>

emsemble方法通常分为两类：

averaging methods. 平均方法，使用不同的算法构建出几个不同的假设然后取平均效果。算法得到的假设都比较好但是容易overfitting, 通过取平均效果降低variance. 通常算法只是作用在部分数据上。这类方法有Bagging, Random Forest等。sklearn提供了bagging meta-estimator允许传入base-estimator来自动做averaging. RF还提供了两个不同版本，另外一个版本在生成决策树选择threshold上也做了随机。

boosting methods. 增强方法，使用同一个算法不断地修正和迭代然后组合。算法得到的假设一般都比较弱，但是通过组合在一起得到效果比较好的假设。通常算法作用在全部数据上。这类方法有AdaBoost, Gradient Boosting等。sklearn提供的AdaBoost内部base-estimator默认是DecisionTree, 而GBDT内部base-estimator固定就是decision-tree但是允许自定义损失函数。

使用Decision Tree来做分类和回归时另外一个好处是可以知道每个feature的重要性：位于DecisionTree越高的feature越重要。不过我的理解是这种feature重要性只能用在DecisionTree这种训练方式上。

#note: 从下面程序效果上看，GBDT比RF稍微差一些，并且GBDT运行时间要明显长于RF。用iris数据集合的话两者效果差不多。

<a href="http://scikit-learn.org/stable/modules/neighbors.html">http://scikit-learn.org/stable/modules/neighbors.html</a>

NN可以同时用来做监督和非监督学习。其中非监督学习的NN是其他一些学习方法的基础。

在实现上sklearn提供了几种算法来寻找最近点：1. brute-force 2. kd-tree 3. ball-tree 4. auto. 其中auto是根据数量大小自动选择算法的。brute-force是采用暴力搜索算法，kd-tree和ball-tree则建立了内部数据结构来加快检索。假设数据维度是d, 数据集合大小是N的话，那么三个算法时间复杂度分别是O(dN), O(d*logN), O(d*logN). 不过如果d过大的话kd-tree会退化称为O(dN).

如果数据量比较小的话那么1比2,3要好，所以在实现上kd-tree/ball-tree发现如果数据集合较小的话就会改用brute-force来做。这个阈值称为leaf_size. leaf_size大小会影响到 1. 构建索引时间(反比) 2. 查询时间(合适的leaf_size可以达到最优) 3. 内存大小(反比). 所以尽可能地增大leaf_size但是确保不会影响查询时间。

classifier和regressor基本上就是在这些数据结构上做了一层包装。我们可以指定距离函数以及查找到最近点之后的合成函数. 默认距离函数是minkowski(p=2, 也就欧几里得距离), 合成函数包含uniform和distance(和距离成反比). KNeighborsClassifier是选择附近k个点，而RadiusNeighborsClassifier则是选择附近在radius范围内的所有点。另外还有一个NearestCentroid分类器：假设y有k个classes的话，根据这些class归纳为k类并且计算出中心(centroid), 然后判断离哪个中心近就预测哪个class.

<a href="http://scikit-learn.org/stable/modules/naive_bayes.html">http://scikit-learn.org/stable/modules/naive_bayes.html</a>

朴素贝叶斯用于分类问题，其中两项主要工作就是计算 1.P(X|y) 2.P(y). 两者都是通过MLE(maximum likehood estimation)来完成的。P(y)相对来说比较好计算，计算P(X|y)有下面三种办法：

如果Xi是连续量的话，Gaussian Naive Bayes. 取y=k的所有Xi数据点，假设这个分布服从高斯分布。计算出这个高斯分布的mean和std之后，就可以计算P(X|y=k)。这个模型系数有d * k个。

如果Xi是离散量的话，Multinomial Naive Bayes. 那么P(X=u|y=k) = P(X=u, y=k) / P(y=k). 这个模型系数有k * ∑ {Xi}个。模型里面还有一个平滑参数。

进一步如果Xi是(0,1)的话，Bernoulli Naive Bayes. 通常我们需要提供参数binarize，这个方法用来将X转换成为(0,1).

<a href="http://scikit-learn.org/stable/modules/cross_validation.html">http://scikit-learn.org/stable/modules/cross_validation.html</a>

使用train_test_split分开training_set和test_set.

使用k-fold等方式从training_set中分出validation_set做cross_validation.

使用cross_val_score来进行cross_validation并且计算cross_validation效果.

<a href="http://scikit-learn.org/stable/modules/grid_search.html">http://scikit-learn.org/stable/modules/grid_search.html</a>

参数空间搜索方式大致分为三类： 1.暴力 2.随机 3.adhoc. 其中23和特定算法相关。

代码最后使用最优模型作用在测试数据上，然后使用classification_report打印评分结果.

<a href="http://scikit-learn.org/stable/modules/pipeline.html">http://scikit-learn.org/stable/modules/pipeline.html</a>

将多个阶段串联起来自动化

<a href="http://scikit-learn.org/stable/modules/model_evaluation.html">http://scikit-learn.org/stable/modules/model_evaluation.html</a>

There are 3 different approaches to evaluate the quality of predictions of a model: # 有3中不同方式来评价模型预测结果

Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. # 模型自身内部的评价比如损失函数等

Scoring parameter: Model-evaluation tools using cross-validation (such as cross_validation.cross_val_score and grid_search.GridSearchCV) rely on an internal scoring strategy. # cv的评价，通常是数值表示. 比如'f1'.

Metric functions: The metrics module implements functions assessing prediction errors for specific purposes. # 作用在测试数据的评价，可以是数值表示，也可以是文本图像等表示. 比如'classification_report'.

其中23是比较相关的。差别在于3作用在测试数据上是我们需要进一步分析的，所以相对来说评价方式会更多一些。而2还是在模型选择阶段所以我们更加倾向于单一数值表示。

sklearn还提供了DummyEstimator. 它只有有限的几种比较dummy的策略，主要是用来给出baseline.

DummyClassifier implements three such simple strategies for classification:

'stratified' generates randomly predictions by respecting the training set’s class distribution,

'most_frequent' always predicts the most frequent label in the training set,

'uniform' generates predictions uniformly at random.

'constant' always predicts a constant label that is provided by the user.

DummyRegressor also implements three simple rules of thumb for regression:

'mean' always predicts the mean of the training targets.

'median' always predicts the median of the training targests.

'constant' always predicts a constant value that is provided by the user.

<a href="http://scikit-learn.org/stable/modules/model_persistence.html">http://scikit-learn.org/stable/modules/model_persistence.html</a>

可以使用python自带的pickle模块，或者是sklearn的joblib模块。joblib相对pickle能更有效地序列化到磁盘上，但缺点是不能够像pickle一样序列化到string上。

<a href="http://scikit-learn.org/stable/modules/learning_curve.html">http://scikit-learn.org/stable/modules/learning_curve.html</a>

Every estimator has its advantages and drawbacks. Its generalization error can be decomposed in terms of bias, variance and noise. The bias of an estimator is its average error for different training sets. The variance of an estimator indicates how sensitive it is to varying training sets. Noise is a property of the data. # bias是指模型对不同训练数据的偏差，variance则是指模型对不同训练数据的敏感程度，噪音则是数据自身属性。这三个问题造成预测偏差。

#note: 这个特性应该是从0.15才有的。之前我用apt-get安装的sklearn-0.14.1没有learning_curve这个模块。

validation curve

If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible. All three cases can be found in the plot below where we vary the parameter gamma on the digits dataset.

可以看到gamma在5 * 10^{-4}附近cross-validation score开始下滑，但是training score还是不错的，说明overfitting.

learning curve

第一幅图是是用朴素贝叶斯的learning curve. 可以看到high-bias情况。第二幅图是使用SVM(RBF kernel)的learning curve. 学习情况明显比朴素贝叶斯要好。

【转自】： http://dirlt.com/sklearn.html

本文转自罗兵博客园博客，原文链接：http://www.cnblogs.com/hhh5460/p/5186197.html，如需转载请自行联系原作者

sklearnsklearn

继续阅读

数据结构与算法（27）——排序（二）

nginx 安装错误信息解决

Dijkstra--简易版（最短路径）

C经典书籍笔记——C陷阱与缺陷②(语法陷阱之优先级)一、错误案列二、优先级规律

Ambari介绍和架构原理

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

线性表之顺序表的实现

C++判断素数、求最大公约数代码判断一个数是否为素数求两个数的最大公约数

SequoiaDB巨杉数据库C++驱动概述

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入

hdu7108哈希