Machine Learning Getting Started Catalog

First, an overview of machine learning
1.1 Overview of artificial intelligence
1.1.1 Machine learning and artificial intelligence, deep learning
1.1.2 What machine learning and deep learning can do
1.1.3 Artificial Intelligence Stage Course Schedule
1.2 What is machine learning
1.2.1 Definitions
1.2.2 Interpretation
1.2.3 Dataset Composition
1.3 Machine learning algorithm classification
1.3.1 Summary
1.3.2 Exercises
1.3.3 Machine learning algorithm classification
1.4 Machine Learning Development Process
1.5 Learning framework and information introduction
1.5.1 Machine Learning Libraries and Frameworks
1.5.2 Books
1.5.3 Improve internal strength
2. Feature engineering
2.1 Datasets
2.1.1 Available datasets
2.1.2 sklearn dataset
2.1.3 Division of data sets
2.2 Introduction to feature engineering
2.2.1 Why feature engineering is needed
2.2.2 What is feature engineering?
2.2.3 Comparison of the location of feature engineering and data processing
2.3 Feature extraction
2.3.1 What is feature extraction?
2.3.2 Dictionary feature extraction
2.3.3 Text feature extraction
2.4 Feature Preprocessing
2.4.1 What is feature preprocessing?
2.4.2 Normalization
2.4.3 Standardization
2.5 Feature dimensionality reduction
2.5.1 Dimensionality reduction
2.5.2 Two ways to reduce dimensionality
2.5.3 What is feature selection
2.6 Principal Component Analysis
2.6.1 What is Principal Component Analysis (PCA)
2.6.2 Case: Explore the user's preference for item category subdivision and dimensionality reduction
2.7 Summary of the first day of machine learning

If you need to get [Machine Learning from Introduction to Practical Combat Full Set of Learning Notes], help forward, forward, forward it, and then follow my private message reply "666" to get the way to get it!

Pay more attention to Xiaobian and continue to share programming learning dry goods!

First, an overview of machine learning

1.1 Overview of artificial intelligence

1.1.1 Machine learning and artificial intelligence, deep learning

Three days of quick start Python machine learning (day one) continuous update! Favorites recommended!

Machine learning is one way to achieve artificial intelligence
Deep learning is a method of machine learning

1.1.2 What machine learning and deep learning can do

Traditional forecasting: store sales forecasting, quantitative investment, advertising recommendation, enterprise customer classification, SQL statement security detection classification
Image recognition: street traffic sign detection, face recognition
Natural language processing: text classification, sentiment analysis, automated chat, text detection

1.1.3 Artificial Intelligence Stage Course Schedule

1.2 What is machine learning

1.2.1 Definitions

Machine learning is the automatic analysis of data to obtain models and use the models to make predictions on unknown data

1.2.2 Interpretation

1.2.3 Dataset Composition

Structure: eigenvalue + target value

Concentrate:

For each row of data, we can call it a sample
Some datasets can have no target value

1.3 Machine learning algorithm classification

The first type:

Identify cats and dogs:

Characteristic value: Image

Target value: Cat/Dog**-Category**

Classification issues

The second type:

Home Price Prediction:

Characteristic values: Individual attribute information for the house

Target value: Home price** - continuous data**

Regression issues

The third kind

Feature values: The information of each attribute of the character

Target value: None

Unsupervised learning

1.3.1 Summary

Pay more attention to Xiaobian and continue to share programming learning dry goods!

1.3.2 Exercises

Let's talk about their specific problem classification:

1. What is the forecast for tomorrow's weather? regression
2. Will tomorrow be cloudy, sunny or rainy? classify
3. Face age prediction? Classification or regression
4. Face recognition? classify

1.3.3 Machine learning algorithm classification

Supervised learning: Prediction

Definition: The input data is composed of input features and target values, and the output of the function can be a continuous value, called regression; It can also be a discrete value, called a classification
Classification: K-nearest neighbor algorithm, Bayesian classification, decision trees and random forests, logistic regression
Regression: linear regression, ridge regression

Unsupervised learning

Definition: Input data is composed of input eigenvalues
Clustering: k-means

1.4 Machine Learning Development Process

Process:

1) Get the data
2) Data processing
3) Feature engineering
4) Machine learning algorithm training - get the model
5) Model evaluation
6) Application

1.5 Learning framework and information introduction

Most of the algorithm design of complex models is done by algorithm engineers, and we

Analyze a lot of data
Analyze specific businesses
Apply common algorithms
Feature engineering, parameter adjustment, optimization

1.5.1 Machine Learning Libraries and Frameworks

1.5.2 Books

1.5.3 Improve internal strength

2. Feature engineering

2.1 Datasets

target

Know that the dataset is divided into a training set and a test set
SKLEARN's dataset will be used

2.1.1 Available datasets

Kaggle website: https://www.kaggle.com/datasets
UCI Dataset URL: http://archive.ics.uci.edu/ml/
scikit-learn URL: http://scikit-learn.org/stable/datasets/index.html#datasets
https://scikit-learn.org.cn/

Scikit-Learn features:

1. Small amount of data
2. Easy to learn

UCI Features:

1. Contains 360 datasets
2. Covering science, life, economy and other fields
3. The data set is hundreds of thousands

Kaggle features:

1. Big data competition platform
2. 800,000 scientists
3. The amount of data is huge

1. Introduction to Scikit-learn tools

Machine learning tools in the Python language
Scikit-learn includes the implementation of many well-known machine learning algorithms
Scikit-learn documentation is perfect, easy to use, rich API
The current stable version is 0.19.1

2 Installation

pip install Scikit-learn -i https://pypi.douban.com/simple

After installation, you can use the following command to check whether the installation is successful:

import sklearn

Note: Installing scikit-learn requires libraries such as Numpy, Scipy, etc

3 What Scikit-learn contains

2.1.2 sklearn dataset

1 Introduction to the scikit-learn dataset API

sklearn.datasets.load_*(): Get small-scale datasets, contained in datasets
sklearn.datasets.fetch_*(data_home=None): To obtain a large-scale dataset, you need to download it from the network, the first parameter of the function is data_home, indicating the directory where the dataset is downloaded, the default is **~/scikit_learn_data/**

2 sklearn small dataset

sklearn.datasets.load_iris(): Loads and returns the iris dataset

sklearn.datasets.load_boston(): Loads and returns the Boston House Price dataset

3 sklearn large dataset

sklearn.datasets.fetch_20newsgroups(data=None, subset=‘train’)

subset: 'train' or 'test', 'all', optionally, select the dataset to load
'Train' for the training set, 'Test' for the test set, 'All' of both

4 Use of the sklearn dataset

Introduction to the sklearn dataset return value

Data types returned by load and fetch datasets.base.Bunch (dictionary format)

data: An array of characteristic data, which is a two-dimensional numpy.ndarry array of [n_samples * n_features].
target: An array of tags, which is a one-dimensional numpy.ndarry array of n_samples
DESCR: Data description
feature_names: Feature names, news data, handwritten digits, regression datasets are not
target_names: The label name

from sklearn.datasets import load_iris
def datasets_demo():
"""
sklearn数据集使用
:return:
"""
# 获取数据集
iris = load_iris()
print("鸢尾花数据集：\n", iris)
print("查看数据集描述：\n", iris["DESCR"]) # 数据集的描述信息
print("查看特征值的名字：\n", iris.feature_names)
print("查看特征值：\n", iris.data, iris.data.shape) # shape:(150,4)
return None
if __name__ == "__main__":
datasets_demo()

查看特征值的名字：
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

2.1.3 Division of data sets

The general data set of machine learning is divided into two parts:

Training data: used for training and building models
Test data: Used during model testing to evaluate whether the model is valid

Divide the proportions:

Training set: 70%, 80%
Test set: 30%, 20%

Dataset partitioning API: sklearn.model_selection.train_test_split(arrays, *options)

The feature values of the x dataset
The label value of the y dataset
test_size size of the test set, generally float
random_state random number seeds, different seeds cause different random sampling results. Same seed sampling results
Return training set feature value, test set feature value, training set target value, test set target value

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
def datasets_demo():
"""
sklearn数据集使用
:return:
"""
# 获取数据集
iris = load_iris()
print("鸢尾花数据集：\n", iris)
print("查看数据集描述：\n", iris["DESCR"])
print("查看特征值的名字：\n", iris.feature_names)
print("查看特征值：\n", iris.data, iris.data.shape) # 150个样本
# 数据集划分 X为特征 Y为标签
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
print("训练集的特征值：\n", x_train, x_train.shape) # 120个样本
return None
if __name__ == "__main__":
datasets_demo()

2.2 Introduction to feature engineering

2.2.1 Why feature engineering is needed

Data and features determine the upper limit of machine learning, and models and algorithms simply approximate this upper limit

2.2.2 What is feature engineering?

Feature engineering is the process of processing data using specialized background knowledge and skills so that features can work better on machine learning algorithms

Meaning: It will directly affect the effectiveness of machine learning

2.2.3 Comparison of the location of feature engineering and data processing

Pandas: A tool for reading data very easily and basic processing formats
sklearn: Provides a powerful interface for feature processing

Feature engineering consists of:

Feature extraction/feature extraction
Feature preprocessing
Feature dimensionality reduction

What is feature extraction?

2.3 Feature extraction

2.3.1 What is feature extraction?

1 Transform arbitrary data, such as text or images, into digital features that can be used in machine learning

Note: Feature values are for computers to better understand the data

Dictionary feature extraction (feature discretization)
Text feature extraction
Image Feature Extraction (Deep Learning Reintroduction)

2 Feature extraction API

sklearn.feature_extraction

2.3.2 Dictionary feature extraction

Function: Feature-value dictionary data

sklearn.feature_extraction. DictVectorizer(sparse=True, …)

DictVectorizer.fit_transform(X), X: The return value of a dictionary or an iterator containing a dictionary, returning a sparse matrix
DictVectorizer.inverse_transform(X), X:array array or sparse matrix Return value: The data format before conversion
DictVectorizer.get_feature_names(): Returns the category name

1 Application

Feature extraction of data: Convert categories to one-hot encoding to save memory and improve download efficiency

from sklearn.feature_extraction import DictVectorizer
def dict_demo():
"""
字典特征抽取
:return:
"""
data = [{'city':'北京', 'temperature':100},
{'city':'上海', 'temperature':60},
{'city':'深圳', 'temperature':30}]
# 1、实例化一个转换器类
#transfer = DictVectorizer() # 返回sparse矩阵
transfer = DictVectorizer(sparse=False)
# 2、调用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new：\n", data_new) # 转化后的
print("特征名字：\n", transfer.get_feature_names())
return None
if __name__ == "__main__":
dict_demo()

data_new：
[[ 0. 1. 0. 100.]
[ 1. 0. 0. 60.]
[ 0. 0. 1. 30.]]
特征名字：
['city=上海', 'city=北京', 'city=深圳', 'temperature']

2.3.3 Text feature extraction

Words as features

Function: Eigenization of text data

sklearn.feature_extraction.text.CountVectorizer(stop_words=[]): Returns the word frequency matrix

CountVectorizer.fit_transform(X),X: Text or iterable containing text strings, return value: returns the sparse matrix
CountVectorizer.inverse_transform(X), X: array array or sparse matrix, return value: data cell before conversion
CountVectorizer.get_feature_names(): Return value: List of words

sklearn.feature_extraction.text.TfidVectorizer

1 Application

English text participle

from sklearn.feature_extraction.text import CountVectorizer
def count_demo():
"""
文本特征抽取：CountVectorizer
:return:
"""
data = ['life is short,i like like python',
'life is too long,i dislike python']
# 1、实例化一个转换器类
transfer = CountVectorizer()
# 2、调用fit_transform
data_new = transfer.fit_transform(data)
print("data_new：\n", data_new.toarray()) # toarray转换为二维数组
print("特征名字：\n", transfer.get_feature_names())
return None
if __name__ == "__main__":
count_demo()

data_new：
[[0 1 1 2 0 1 1 0]
[1 1 1 0 1 1 0 1]]
特征名字：
['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']

Stop words: stop_words=[]

from sklearn.feature_extraction.text import CountVectorizer
def count_demo():
"""
文本特征抽取：CountVectorizer
:return:
"""
data = ['life is short,i like like python',
'life is too long,i dislike python']
# 1、实例化一个转换器类
transfer = CountVectorizer(stop_words=['is', 'too'])
# 2、调用fit_transform
data_new = transfer.fit_transform(data)
print("data_new：\n", data_new.toarray()) # toarray转换为二维数组
print("特征名字：\n", transfer.get_feature_names())
return None
if __name__ == "__main__":
count_demo()

data_new：
[[0 1 2 0 1 1]
[1 1 0 1 1 0]]
特征名字：
['dislike', 'life', 'like', 'long', 'python', 'short']

Chinese text word segmentation

Note: Single Chinese words are not supported!

This method counts the number of occurrences of feature words

from sklearn.feature_extraction.text import CountVectorizer
def count_demo():
"""
文本特征抽取：CountVectorizer
:return:
"""
data = ['我 爱 北京 天安门',
'天安门 上 太阳 升']
# 1、实例化一个转换器类
transfer = CountVectorizer()
# 2、调用fit_transform
data_new = transfer.fit_transform(data)
print("data_new：\n", data_new.toarray()) # toarray转换为二维数组
print("特征名字：\n", transfer.get_feature_names())
return None
if __name__ == "__main__":
count_demo()

data_new：
[[1 1 0]
[0 1 1]]
特征名字：
['北京', '天安门', '太阳']

Example 2

from sklearn.feature_extraction.text import CountVectorizer
import jieba
def count_chinese_demo2():
"""
中文文本特征抽取，自动分词
:return:
"""
data = ['一种还是一种今天很残酷，明天更残酷，后天很美好，但绝对大部分是死在明天晚上，所以每个人不要放弃今天。',
'我们看到的从很远星系来的光是在几百万年之前发出的，这样当我们看到宇宙时，我们是在看它的过去。',
'如果只用一种方式了解某件事物，他就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。']
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 1、实例化一个转换器类
transfer = CountVectorizer()
# 2、调用fit_transform
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字：\n", transfer.get_feature_names())
return None
def cut_word(text):
"""
进行中文分词：“我爱北京天安门” -> "我 爱 北京 天安门"
:param text:
:return:
"""
return ' '.join(jieba.cut(text))
if __name__ == "__main__":
count_chinese_demo2()
#print(cut_word('我爱北京天安门'))

['一种 还是 一种 今天 很 残酷 ， 明天 更 残酷 ， 后天 很 美好 ， 但 绝对 大部分 是 死 在 明天 晚上 ， 所以 每个 人 不要 放弃 今天 。', '我们 看到 的 从 很 远 星系 来 的 光是在 几百万年 之前 发出 的 ， 这样 当 我们 看到 宇宙 时 ， 我们 是 在 看 它 的 过去 。', '如果 只用 一种 方式 了解 某件事 物 ， 他 就 不会 真正 了解 它 。 了解 事物 真正 含义 的 秘密 取决于 如何 将 其 与 我们 所 了解 的 事物 相 联系 。']
data_final:
[[2 0 1 0 0 0 2 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 2 0 1 0 2 1 0 0 0 1 1 0 0 1
0]
[0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0
1]
[1 1 0 0 4 2 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 2 1 0 0 1 0 0
0]]
特征名字：
['一种', '不会', '不要', '之前', '了解', '事物', '今天', '光是在', '几百万年', '发出', '取决于', '只用', '后天', '含义', '大部分', '如何', '如果', '宇宙', '我们', '所以', '放弃', '方式', '明天', '星系', '晚上', '某件事', '残酷', '每个', '看到', '真正', '秘密', '绝对', '美好', '联系', '过去', '还是', '这样']

Keywords: Appears a lot in one category of articles, but rarely in other categories

5 TF-IDF text feature extraction

The main idea of TF-IDF is that if a word or phrase has a high probability of appearing in one article and rarely appears in other articles, it is considered that the word or phrase has a good ability to distinguish categories and is suitable for classification

TF-IDF role: Used to assess the importance of a word to one of the documents in a document set or a corpus
This method calculates the importance of feature words
TF-IDF: Measures importance
TF: Word frequency
IDF: Reverse document frequency, which can be obtained by the total number of files / the number of files containing the word, and then the resulting logarithm of the quotient base-10

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import jieba
def cut_word(text):
"""
进行中文分词：“我爱北京天安门” -> "我 爱 北京 天安门"
:param text:
:return:
"""
return ' '.join(jieba.cut(text))
def tfidf_demo():
"""
用TF-IDF的方法进行文本特征抽取
:return:
"""
data = ['一种还是一种今天很残酷，明天更残酷，后天很美好，但绝对大部分是死在明天晚上，所以每个人不要放弃今天。',
'我们看到的从很远星系来的光是在几百万年之前发出的，这样当我们看到宇宙时，我们是在看它的过去。',
'如果只用一种方式了解某件事物，他就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。']
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 1、实例化一个转换器类
transfer = TfidfVectorizer()
# 2、调用fit_transform
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字：\n", transfer.get_feature_names())
return None
if __name__ == "__main__":
tfidf_demo()
#print(cut_word('我爱北京天安门'))

['一种 还是 一种 今天 很 残酷 ， 明天 更 残酷 ， 后天 很 美好 ， 但 绝对 大部分 是 死 在 明天 晚上 ， 所以 每个 人 不要 放弃 今天 。', '我们 看到 的 从 很 远 星系 来 的 光是在 几百万年 之前 发出 的 ， 这样 当 我们 看到 宇宙 时 ， 我们 是 在 看 它 的 过去 。', '如果 只用 一种 方式 了解 某件事 物 ， 他 就 不会 真正 了解 它 。 了解 事物 真正 含义 的 秘密 取决于 如何 将 其 与 我们 所 了解 的 事物 相 联系 。']
data_final:
[[0.30847454 0. 0.20280347 0. 0. 0.
0.40560694 0. 0. 0. 0. 0.
0.20280347 0. 0.20280347 0. 0. 0.
0. 0.20280347 0.20280347 0. 0.40560694 0.
0.20280347 0. 0.40560694 0.20280347 0. 0.
0. 0.20280347 0.20280347 0. 0. 0.20280347
0. ]
[0. 0. 0. 0.2410822 0. 0.
0. 0.2410822 0.2410822 0.2410822 0. 0.
0. 0. 0. 0. 0. 0.2410822
0.55004769 0. 0. 0. 0. 0.2410822
0. 0. 0. 0. 0.48216441 0.
0. 0. 0. 0. 0.2410822 0.
0.2410822 ]
[0.12826533 0.16865349 0. 0. 0.67461397 0.33730698
0. 0. 0. 0. 0.16865349 0.16865349
0. 0.16865349 0. 0.16865349 0.16865349 0.
0.12826533 0. 0. 0.16865349 0. 0.
0. 0.16865349 0. 0. 0. 0.33730698
0.16865349 0. 0. 0.16865349 0. 0.
0. ]]
特征名字：
['一种', '不会', '不要', '之前', '了解', '事物', '今天', '光是在', '几百万年', '发出', '取决于', '只用', '后天', '含义', '大部分', '如何', '如果', '宇宙', '我们', '所以', '放弃', '方式', '明天', '星系', '晚上', '某件事', '残酷', '每个', '看到', '真正', '秘密', '绝对', '美好', '联系', '过去', '还是', '这样']

2.4 Feature Preprocessing

2.4.1 What is feature preprocessing?

The process of converting feature data into feature data more suitable for the algorithm model through some conversion functions

Dimensionless of numerical data:

Normalization
standardization

2 Feature preprocessing API

sklearn.preprocessing

Why normalize/standardize?

The unit or size of the feature varies greatly, or the method of a feature is orders of magnitude larger than other features, which is easy to affect (dominate) the target result, so that some algorithms cannot learn other features

2.4.2 Normalization

1 Definitions

Map the data by transforming the original data (default is between [0,1])

2 Formula

3 API

sklearn.preprocessing.MinMaxScaler(feature_range=(0,1)…)

MinMaxScaler.fit_transform(X), data in X:numpy array format [n_samples,n_features], return value: the same array as the converted form

4 Data calculation

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
def minmax_demo():
"""
归一化
:return:
"""
# 1、获取数据
data = pd.read_csv("datingTestSet2.txt", sep='\t')
data = data.iloc[:, :3]
print("data:\n", data)
# 2、实例化一个转换器类
transform = MinMaxScaler()
#transform = MinMaxScaler(feature_range=[2,3])
# 3、调用fit_transform
data_new = transform.fit_transform(data)
print("data_new:\n", data_new)
return None
if __name__ == "__main__":
minmax_demo()

Question: What happens if there are outliers

Outliers: maximum, minimum

5 Normalization summary

Note that the maximum and minimum values are variable, and in addition, the maximum and minimum values are very susceptible to outliers.

Therefore, this method is less robust and is only suitable for traditional accurate small data scenarios

2.4.3 Standardization

1 Definitions

By transforming the original data, the data is transformed into a range with a mean of 0 and a standard deviation of 1

2 Formula

For normalization: if an anomaly occurs that affects the maximum and minimum values, the name result will obviously change
For standardization, if there are anomalies, due to a certain amount of data, a small number of anomalies have little effect on the mean, so the variance changes little

4 Code

sklearn.perprocessing.StandradScaler()

After processing, for each column, all data are clustered around 0 with a standard deviation of 1

StandardScaler.fit_transform(X)，X; Data in numpy array format [n_samples,n_features], return value: the transformed array with the same shape

from sklearn.preprocessing import MinMaxScaler, StandardScaler
def stand_demo():
"""
标准化
:return:
"""
# 1、获取数据
data = pd.read_csv("datingTestSet2.txt", sep='\t')
data = data.iloc[:, :3]
print("data:\n", data)
# 2、实例化一个转换器类
transform = StandardScaler()
#transform = StandardScaler(feature_range=[2,3])
# 3、调用fit_transform
data_new = transform.fit_transform(data)
print("data_new:\n", data_new)
return None
if __name__ == "__main__":
stand_demo()

5 Summary of standardization

It is relatively stable when there are enough existing samples, and is suitable for modern noisy big data scenarios

2.5 Feature dimensionality reduction

2.5.1 Dimensionality reduction

Dimensionality reduction refers to the process of reducing the number of random variables (features) under certain limited conditions to obtain a set of "unrelated" host variables

2.5.2 Two ways to reduce dimensionality

Feature selection

Principal component analysis (a way of feature extraction can be understood)

2.5.3 What is feature selection

1 Definitions

The data contains redundant or related variables (or features, attributes, indicators, etc.) to find the main characteristics from the original characteristics

2 Methods

Filter filter: mainly explores the characteristics of the feature itself, the correlation between the feature and the feature and the target value

(1) Variance selection method: low variance feature filtering
(2) Correlation coefficient: the degree of correlation between features and features

Embedded: The algorithm automatically selects features (correlations between features and target values)

(1) Decision tree: information entropy, information gain
(2) Regularization: L1, L2
(3) Deep learning: convolution, etc

3 Modules

sklearn.feature_selection

4 Filtered

4.1 Low variance feature filtering

Remove some characteristics of low variance

Small feature variance: Most samples of a feature have similar values
Large feature variance: The values of many samples of a feature are different

4.1.1 API

sklearn.feature_selection. VArianceThreshold(threshold=0.0)

Remove all low variance characteristics

Variance.fit_transform(X), data in X:numpy array format [m_sample,n_features], return value: features with lower training set variance than threadshold will be removed. The default value is to keep the non-zero variance feature, that is, remove the feature with the same value in all samples

4.1.2 Data Calculation

from sklearn.feature_selection import VarianceThreshold
def variance_demo():
"""
低方差特征过滤
:return:
"""
# 1、获取数据
data = pd.read_csv('factor_returns.csv')
print('data:\n', data)
data = data.iloc[:,1:-2]
print('data:\n', data)
# 2、实例化一个转换器类
#transform = VarianceThreshold()
transform = VarianceThreshold(threshold=10)
# 3、调用fit_transform
data_new = transform.fit_transform(data)
print("data_new\n", data_new, data_new.shape)
return None
if __name__ == "__main__":
variance_demo()

4.2 Correlation coefficient

Pearson Correlation Coefficient: A statistical indicator that reflects the degree of correlation between variables

Formula:

Calculation process

4.2.3 Features

The value of the correlation coefficient is between -1 and +1, that is, -1<=r<=+1. Its nature is as follows:

When r>0, it means that the two variables are positively correlated; At r<0, the two variables are negatively correlated
When |r|=1, it means that the two variables are completely correlated; When r=0, it means that there is no correlation between the two variables
When 0<|r|<1, it means that the two variables have a certain degree of correlation. And |r| closer to 1, the closer the linear relationship between the two variables; |r| closer to 0, the weaker the linear correlation between the two variables
Generally, it can be divided into three levels: |r|<0.4 is low correlation; 0.4<=|r|<0.7 is significantly correlated; 0.7<=|r|<1 is a high-dimensional linear correlation

4.2.4 API

from scipy.stats import pearsonr

x:(N.)array_like
y：（N.）array_like Returns:(Perason’s correlation coefficient, p-value)

from sklearn.feature_selection import VarianceThreshold
from scipy.stats import pearsonr
def variance_demo():
"""
低方差特征过滤
:return:
"""
# 1、获取数据
data = pd.read_csv('factor_returns.csv')
print('data:\n', data)
data = data.iloc[:,1:-2]
print('data:\n', data)
# 2、实例化一个转换器类
#transform = VarianceThreshold()
transform = VarianceThreshold(threshold=10)
# 3、调用fit_transform
data_new = transform.fit_transform(data)
print("data_new\n", data_new, data_new.shape)
# 计算两个变量之间的相关系数
r = pearsonr(data["pe_ratio"],data["pb_ratio"])
print("相关系数：\n", r)
return None
if __name__ == "__main__":
variance_demo()

Look at the previous one for the correlation coefficient

If the feature is highly correlated with the feature:

1) Select one of them
2) Weighted summation
3) Principal component analysis

2.6 Principal Component Analysis

2.6.1 What is Principal Component Analysis (PCA)

Definition: The process of converting high-dimensional data to low-dimensional data, in which the original data may be discarded and new variables created
Function: It is to compress the dimension of the data, reduce the dimension (complexity) of the original data as much as possible, and lose a small amount of information
Application: Regression analysis or cluster analysis

1 Calculate case understanding

Two dimensions down to one dimension

2 Code

sklearn.decomposition.PCA(n_components=None)

Decompose the data into lower-dimensional spaces

n_components:

Decimal number: Indicates how much of the information is retained
Integer: How many features to reduce to

PCA.fit_transform(X), data in X:numpy array format [N_samples, n_features], return value: array of dimensions specified after transformation

3 Data calculation

from sklearn.decomposition import PCA
def pca_demo():
"""
PCA降维
:return:
"""
data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]
# 1、实例化一个转换器类
transform = PCA(n_components=2) # 4个特征降到2个特征
# 2、调用fit_transform
data_new = transform.fit_transform(data)
print("data_new\n", data_new)
transform2 = PCA(n_components=0.95) # 保留95%的信息
data_new2 = transform2.fit_transform(data)
print("data_new2\n", data_new2)
return None
if __name__ == "__main__":
pca_demo()

data_new
[[ 1.28620952e-15 3.82970843e+00]
[ 5.74456265e+00 -1.91485422e+00]
[-5.74456265e+00 -1.91485422e+00]]
data_new2
[[ 1.28620952e-15 3.82970843e+00]
[ 5.74456265e+00 -1.91485422e+00]
[-5.74456265e+00 -1.91485422e+00]]

2.6.2 Case: Explore the user's preference for item category subdivision and dimensionality reduction

Data:

1) order_prodects_prior.csv: Order and product information
Fields: order_id, product_id, add_to_cart_order, reordered
2) Products.csv: Product information
Fields: product_id, product_name, aisle_id, department_id
3) Order .csv: The user's order information
Fields: order_id, user_id, eval_set, order_number, ...
4) AISLES .csv: The specific item category to which the product belongs
Fields: aisle_id, aisle

Processed like this

demand

1) The user_id and Aisle need to be placed in the same table—merged

2) Find user_id and AISLE---- crosstabs and pivot tables

# 1、获取数据
# 2、合并表
# 3、找到suer_id和aisle之间的关系
# 4、PAC降维
import pandas as pd
# 1、获取数据
order_products = pd.read_csv('./instacart/order_products__prior.csv') #32434489× 4
products = pd.read_csv('./instacart/products.csv') # (49688,4)
orders = pd.read_csv('./instacart/orders.csv') #3421083 rows × 7 columns
aisles = pd.read_csv('./instacart/aisles.csv') #(134,2)
# 2、合并表'

# 合并aisles和products
tab1 = pd.merge(aisles, products, on=["aisle_id", "aisle_id"]) #49688 × 5 c
tab2 = pd.merge(tab1, order_products, on=["product_id", "product_id"])#32434489 ,8
tab3 = pd.merge(tab2, orders, on=["order_id", "order_id"])#32434489 ,14
# tab3.head()
# 3、找到suer_id和aisle之间的关系
table = pd.crosstab(tab3["user_id"], tab3["aisle"]) #206209 rows × 134 columns
data = table[:10000] #10000 rows × 134 columns
# 4、PAC降维
from sklearn.decomposition import PCA
# 1)实例化一个转换器类
transfer = PCA(n_components=0.95) # 保留95%的信息
# 2）调用fit_transform
data_new = transfer.fit_transform(data) #(10000, 42)，由134个特征降维到42个

2.7 Summary of the first day of machine learning

Pay more attention to Xiaobian and continue to share programming learning dry goods!

Three days of quick start Python machine learning (day one) continuous update! Favorites recommended!

Machine Learning Getting Started Catalog

First, an overview of machine learning

1.1 Overview of artificial intelligence

1.1.1 Machine learning and artificial intelligence, deep learning

1.1.2 What machine learning and deep learning can do

1.1.3 Artificial Intelligence Stage Course Schedule

1.2 What is machine learning

1.2.1 Definitions

1.2.2 Interpretation

1.2.3 Dataset Composition

1.3 Machine learning algorithm classification

1.3.1 Summary

1.3.2 Exercises

1.3.3 Machine learning algorithm classification

1.4 Machine Learning Development Process

1.5 Learning framework and information introduction

1.5.1 Machine Learning Libraries and Frameworks

1.5.2 Books

1.5.3 Improve internal strength

2. Feature engineering

2.1 Datasets

2.1.1 Available datasets

1. Introduction to Scikit-learn tools

2 Installation

3 What Scikit-learn contains

2.1.2 sklearn dataset

1 Introduction to the scikit-learn dataset API

2 sklearn small dataset

3 sklearn large dataset

4 Use of the sklearn dataset

2.1.3 Division of data sets

2.2 Introduction to feature engineering

2.2.1 Why feature engineering is needed

2.2.2 What is feature engineering?

2.2.3 Comparison of the location of feature engineering and data processing

What is feature extraction?

2.3 Feature extraction

2.3.1 What is feature extraction?

2 Feature extraction API

2.3.2 Dictionary feature extraction

1 Application

2.3.3 Text feature extraction

1 Application

5 TF-IDF text feature extraction

2.4 Feature Preprocessing

2.4.1 What is feature preprocessing?

2 Feature preprocessing API

2.4.2 Normalization

2.4.3 Standardization

2.5 Feature dimensionality reduction

2.5.1 Dimensionality reduction

2.5.2 Two ways to reduce dimensionality

2.5.3 What is feature selection

2 Methods

3 Modules

4 Filtered

4.1 Low variance feature filtering

4.1.1 API

4.1.2 Data Calculation

4.2 Correlation coefficient Pearson Correlation Coefficient: A statistical indicator that reflects the degree of correlation between variables

4.2.3 Features

4.2.4 API

2.6 Principal Component Analysis

2.6.1 What is Principal Component Analysis (PCA)

2.6.2 Case: Explore the user's preference for item category subdivision and dimensionality reduction

2.7 Summary of the first day of machine learning

4.2 Correlation coefficient

Pearson Correlation Coefficient: A statistical indicator that reflects the degree of correlation between variables