天天看點

sklearn:sklearn.preprocessing.StandardScaler函數的fit_transform、transform、inverse_transform簡介、使用方法之詳細攻略

标準化/歸一化的數學原理及其代碼實作

參考文章:ML之FE:資料處理—特征工程之特征三化(标準化【四大資料類型(數值型/類别型/字元串型/時間型)】、歸一化、向量化)簡介、代碼實作、案例應用之詳細攻略

StandardScaler函數的的簡介及其用法

注意事項:在機器學習的sklearn.preprocessing中,當需要對訓練和測試資料進行标準化時,使用兩個不同的函數,

訓練資料,采用fit_transform()函數

測試資料,采用tansform()函數

StandardScaler函數的的簡介

     """Standardize features by removing the mean and scaling to unit variance

   Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the`transform` method.

   Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

   For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

   This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False` to avoid breaking the sparsity structure of the data.

   Read more in the :ref:`User Guide <preprocessing_scaler>`.

通過去除均值并縮放到機關方差來标準化特征

通過計算訓練集中樣本的相關統計資料,對每個特征分别進行定心和定标。然後使用“transform”方法存儲平均值和标準差,以供以後的資料使用。

PS:系統會記錄每個輸入參數的平均數和标準差,以便資料可以還原。

資料集的标準化是許多機器學習估計器的一個常見需求:如果單個特征與标準的正态分布資料(例如,均值為0的高斯分布和機關方差)不太相似,估計器的性能可能會很差。

例如,學習算法的目标函數中使用的許多元素(如支援向量機的RBF核或線性模型的L1和L2正則化器)都假定所有特征都以0為中心,并且具有相同的方差。如果一個特征的方差比其他特征的方差大幾個數量級,那麼它就可能控制目标函數,使估計者無法按照預期正确地從其他特征中學習。

這個标量也可以通過傳遞with_mean=False來應用于稀疏的CSR或CSC矩陣,以避免打破資料的稀疏結構。

請參閱:ref: ' User Guide  '。</preprocessing_scaler>

   Parameters

   ----------

   copy : boolean, optional, default True

   If False, try to avoid a copy and do inplace scaling instead.

   This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be  returned.

   with_mean : boolean, True by default

   If True, center the data before scaling.

   This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

   with_std : boolean, True by default

   If True, scale the data to unit variance (or equivalently,  unit standard deviation). 參數

----------

copy:  布爾值,可選,預設為真

如果是假的,盡量避免複制,而要進行适當的縮放。

這并不能保證總是在适當的地方工作;例如,如果資料不是NumPy數組或scipy。稀疏的CSR矩陣,仍然可以傳回一個副本。

with_mean:布爾值,預設為真

如果為真,則在擴充之前将資料居中。

這在處理稀疏矩陣時不起作用(并且會引發一個異常),因為将它們居中需要建構一個密集的矩陣,在通常情況下,這個矩陣可能太大而無法裝入記憶體。

with_std:布爾值,預設為真

如果為真,則将資料縮放到機關方差(或者等效為機關标準差)。

   Attributes

   scale_ : ndarray, shape (n_features,)   Per feature relative scaling of the data.

   .. versionadded:: 0.17

   *scale_*

   mean_ : array of floats with shape [n_features]

   The mean value for each feature in the training set.

   var_ : array of floats with shape [n_features]

   The variance for each feature in the training set. Used to compute `scale_`

   n_samples_seen_ : int

   The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across ``partial_fit`` calls.

屬性

scale_: ndarray,形狀(n_features,)資料的每個特征相對縮放。縮放比例,同時也是标準差。

. .versionadded:: 0.17

* scale_ *

mean_:帶形狀的浮動數組[n_features]

訓練集中每個特征的平均值。

var_:帶形狀的浮動數組[n_features]

訓練集中每個特征的方差。用于計算' scale_ '

n_samples_seen_: int

由估計量處理的樣本數。将重置新的調用,以适應,但增量跨越' ' partial_fit ' '調用。

   See also

   --------

   scale: Equivalent function without the estimator API.

   :class:`sklearn.decomposition.PCA`

   Further removes the linear correlation across features with 'whiten=True'.

   Notes

   -----

   For a comparison of the different scalers, transformers, and normalizers,

   see :ref:`examples/preprocessing/plot_all_scaling.py

   <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`. 另請參閱

--------

scale:沒有estimator API的等價函數。

類:“sklearn.decomposition.PCA”

進一步用'whiten=True'去除特征間的線性相關。

筆記-----

為了比較不同的定标器、變壓器和規格化器,

看:裁判:“/預處理/ plot_all_scaling.py例子

< sphx_glr_auto_examples_preprocessing_plot_all_scaling.py >”。

StandardScaler函數的案例應用

from sklearn.preprocessing import StandardScaler

data = [[0, 0], [0, 0], [1, 1], [1, 1]]

scaler = StandardScaler()

print(scaler.fit(data))

StandardScaler(copy=True, with_mean=True, with_std=True)

print(scaler.mean_)        # [ 0.5  0.5]

print(scaler.transform(data))

#     [[-1. -1.]

#     [-1. -1.]

#     [ 1.  1.]

#     [ 1.  1.]]

print(scaler.transform([[2, 2]])) #[[ 3.  3.]]

fit_transform函數

fit_transform函數的簡介

   """Fit to data, then transform it.

   Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. “拟合資料,然後轉換它。”

使用可選參數fit_params将transformer比對到X和y,并傳回轉換後的X版本。

   X : numpy array of shape [n_samples, n_features]

   Training set.

   y : numpy array of shape [n_samples]

   Target values.

   Returns

   -------

   X_new : numpy array of shape [n_samples, n_features_new]

   Transformed array.

參數

X:  形狀是numpy數組[n_samples, n_features]

訓練集

y:   numpy數組的形狀[n_samples]

目标值。

傳回

-------

X_new: numpy數組的形狀[n_samples, n_features_new]

改變數組。&nbsp;

   # non-optimized default implementation; override when a   better method is possible for a given clustering algorithm 未經優化預設實作;當對給定的聚類算法有更好的方法時重寫

fit_transform函數的用法

def fit_transform Found at: sklearn.base

def fit_transform(self, X, y=None, **fit_params):

   """

   # non-optimized default implementation; override when a

    better

   # method is possible for a given clustering algorithm

   if y is None:

   # fit method of arity 1 (unsupervised transformation)

       return self.fit(X, **fit_params).transform(X)

   else:

       return self.fit(X, y, **fit_params).transform(X) # fit method of

        arity 2 (supervised transformation)

transform函數的簡介及其用法

transform函數的簡介

   """Perform standardization by centering and scaling

   X : array-like, shape [n_samples, n_features]

   The data used to scale along the features axis.

   y : (ignored)

   .. deprecated:: 0.19

   This parameter will be removed in 0.21.

   copy : bool, optional (default: None)

   Copy the input X or not.

   """ 通過定心和定标來實作标準化

X:類數組,形狀[n_samples, n_features]

用于沿着特征軸縮放的資料。

y:(忽略)

. .棄用::0.19

這個參數将在0.21中删除。

複制:bool,可選(預設:無)

是否複制輸入X。

”“”

transform函數的用法

def transform Found at: sklearn.preprocessing.data

def transform(self, X, y='deprecated', copy=None):

   if not isinstance(y, string_types) or y !=

    'deprecated':

       warnings.warn("The parameter y on transform()

        is "

           "deprecated since 0.19 and will be removed in

            0.21",

           DeprecationWarning)

   check_is_fitted(self, 'scale_')

   copy = copy if copy is not None else self.copy

   X = check_array(X, accept_sparse='csr',

    copy=copy, warn_on_dtype=True,

       estimator=self, dtype=FLOAT_DTYPES)

   if sparse.issparse(X):

       if self.with_mean:

           raise ValueError(

               "Cannot center sparse matrices: pass

                `with_mean=False` "

               "instead. See docstring for motivation and

                alternatives.")

       if self.scale_ is not None:

           inplace_column_scale(X, 1 / self.scale_)

           X -= self.mean_

       if self.with_std:

           X /= self.scale_

   return X

inverse_transform函數的簡介及其用法

inverse_transform函數的簡介

   """Scale back the data to the original representation

   X_tr : array-like, shape [n_samples, n_features]

   Transformed array.

   """

把資料縮減到原來的樣子

X_tr:類數組,形狀[n_samples, n_features]

改變數組。

"""

inverse_transform函數的用法

def inverse_transform Found at: sklearn.preprocessing.data

def inverse_transform(self, X, copy=None):

               "Cannot uncenter sparse matrices: pass

               "instead See docstring for motivation and

       if not sparse.isspmatrix_csr(X):

           X = X.tocsr()

           copy = False

       if copy:

           X = X.copy()

           inplace_column_scale(X, self.scale_)

       X = np.asarray(X)

           X *= self.scale_

           X += self.mean_