Python data mining project combat record

Taken from the book "Python Data Analysis and Mining Practice", the data processing methods used in each project are sorted out:

Data preprocessing methods
Model methods
Draw the graph

For classification problems: classification with models; confusion matrices and their plots; observation of their ROC curves;

For clustering problems: Divide the number of categories; Get the center of the division; Parallel coordinate system description

(1) Methods of data preprocessing

After getting the data, you will find some data value errors

First, fill in the blank value

Second, observe the distribution of data

Third, data cleaning, so that data values are reasonable

Fourth, data protocol, extract important attributes

Data normalization, in order to use the K-Means algorithm

Data reading

#SVM，bayes，ANN，D_Tree，等需要操作的数据是矩阵，需要
data=data.as_matrix() #将series数据转为矩阵形式的训练集

from numpy.random import shuffle
shuffle(data) #随机打乱数据
x_train=data_train[:,2:]*30 #将特征放大

import pickle
pickle.dump(model, open('../tmp/svm.model', 'wb'))# 保存模型
model = pickle.load(open('../tmp/svm.model', 'rb'))# 读取模型

#将数据保存为固定文件格式
pd.DataFrame(cm_train, index=range(5),columns=range(5)).to_excel(outputfile1)

1. Lagrange interpolation

from scipy.interpolate import lagrange
# 取数，两个列表尾接
y = s[list(range(n - k, n)) + list(range(n + 1, n + 1 + k))]
y = y[y.notnull()]
res=lagrange(y.index, list(y))(n)   #输入索引，和该列数据 n为空值下标

Second, the built-in interpolation method of series

df = pd.DataFrame(data, columns=[0, 1, 2])
df.interpolate()

Third, the description of the power trend

Draw a line chart and observe the downward trend of electricity.

Data exploration describe (include='all). T

Through data exploration, find outliers, illogical error values, if the dataset is large can be discarded, otherwise can be filled

explore = data.describe(include='all').T
explore['null'] = len(data) - explore['count']
explore = explore[['null', 'max', 'min', 'std']]# count,unique,top,freq,mean,std,min,50%,max等选取几个属性

#计算相关系数矩阵，data必须是n*n矩阵
print(np.round(data.corr(method='pearson'), 2))

5. Data cleaning

For collection-like operations, for illogical error values, set indexes, exclude.

data = data[data['SUM_YR_1'].notnull() & data['SUM_YR_2'].notnull()]
index1 = data['SUM_YR_1'] != 0
index2 = data['SUM_YR_2'] != 0
index3 = (data['SEG_KM_SUM'] == 0) & (data['avg_discount'] == 0)
data = data[index1 | index2 | index3]

6. Attribute conversion

The specification of the data is to select the useful data attributes, which can be achieved by deleting the corresponding column by excel.

# 选取某列
data=data[data['TARGET_ID']==184].copy()  #获取该条件下的数据的副本
data_group=data.groupby('COLLECTTIME')  #以时间分组

def attr_trans(x):  # 定义属性变换函数
    #创建新的series
    result=pd.Series(index=['SYS_NAME', 'CWXT_DB:184:C:\\', 'CWXT_DB:184:D:\\', 'COLLECTTIME'])  #设置列标
    result['SYS_NAME'] = x['SYS_NAME'].iloc[0] #获取该属性值，唯一
    result['COLLECTTIME'] = x['COLLECTTIME'].iloc[0] #获取该属性值，唯一
    result['CWXT_DB:184:C:\\'] = x['VALUE'].iloc[0] #获取属性值A
    result['CWXT_DB:184:D:\\'] = x['VALUE'].iloc[1] #获取属性值B，等等
    return result

data_processed = data_group.apply(attr_trans)  # 逐组处理

7. Data normalization and standardization

Standard normal distribution normalization using normal distribution: x-u/σ

data = (data - data.mean(axis=0)) / data.std(axis=0)  # 按列选取均值和标准差。矩阵操作

When it is found that the range of values has too much influence on the result and is inconvenient to calculate, the data is standardized

data=(data-data.min())/(data.max()-data.min())
data=data.reset_index()

8. Data discretization

When an application discovers frequent items, it needs to turn continuous data into discrete data.

for i in range(len(keys)):
        # 调用k-means算法，进行聚类离散化
        r1 = pd.DataFrame(kmodel.cluster_centers_, columns=[typelabel[keys[i]]])  # 聚类中心,A
        r2 = pd.Series(kmodel.labels_).value_counts()  # 分类统计
        r2 = pd.DataFrame(r2, columns=[typelabel[keys[i]] + 'n'])  #统计量, An

        r = pd.DataFrame(pd.concat([r1, r2], axis=1))  #聚类中心与类别数目匹配连接
        r = r.sort_values(typelabel[keys[i]])
        r.index = [1, 2, 3, 4]

        r[typelabel[keys[i]]] = pd.rolling_mean(r[typelabel[keys[i]]], 2)  # rolling_mean()用来计算相邻2列的均值，以此作为边界点。
        r[typelabel[keys[i]]][1] = 0.0  # 这两句代码将原来的聚类中心改为边界点。
        result = result.append(r.T) #转置添加

    result = result.sort_index()  # 以Index(A,B,C,D,E,F)顺序排序，保存
    result.to_excel(processedfile)

9. Image cutting and color matrix extraction

1. First-order color moment: The first-order origin moment is used to reflect the overall brightness and darkness of the image

No=1/N * ∑(j:1-N) Pij

2. Second-order color moment: The distribution range of the color of the reaction image

σi=(1/N *∑j:1-N (Pij-No)^2) ^1/2

3. Third-order color moment: Reflects the symmetry of the color distribution of the image

10. Time series algorithms

Use the process of the time series algorithm model to predict future data based on historical data

Time series algorithm is used to model fit, test and detect the model input data. According to the error formula, the error between the predicted value and the verification data is calculated, and whether it falls within the scope of business acceptance.

The model recognizes AR, MA, ARMA

Stationarity test, white noise test, model identification, model test, model prediction, model evaluation, model application

11. Behavior analysis and service recommendation

Connect to the database

The system filtering algorithm is the main one, and the others are supplemented.

recommend

Item similarity: angle cosine; Jecard similarity coefficient; correlation coefficient

Familiarize yourself with the use of item-based collaborative filtering algorithms

# 基于物品的协同过滤算法
def Jaccard(a, b):
    return 1.0 * (a * b).sum() / (a + b - a * b).sum()
class Recommender():
    sim = None
    def similarity(self, x, distance):
        y = np.ones((len(x), len(x)))
        for i in range(len(x)):
            for j in range(len(x)):
                y[i, j] = distance(x[i], x[j])
        return y
    def fit(self, x, distance=Jaccard):  #x传入的是矩阵（行：物品，列：用户）
        self.sim = self.similarity(x, distance) #计算相似度
    def recommend(self, a): #传入预测用户的购买记录的矩阵.T = n * 1
        return np.dot(self.sim, a) * (1 - a)

12. Variable selection and grey prediction

Use the Lasso function to select variables with the processed data

Grey forecasts get the predicted values for key influencing factors

Use neural networks to forecast fiscal revenues

13. Text preprocessing

#数据去重
l1 = len(data)
data = pd.DataFrame(data[0].unique()) #选取数据列进行unique()
l2 = len(data)
data.to_csv(outputfile, index = False, header = False, encoding = 'utf-8')
print(u'删除了%s条评论。' %(l1 - l2))

#机械压缩去词，去除连续重复语料，和短句子删除过滤较多垃圾信息

#文本评论分词
mycut = lambda s: ' '.join(jieba.cut(s)) #自定义简单分词函数
data1 = data1[0].apply(mycut)#对于读入的数据执行分词函数
data2 = data2[0].apply(mycut)#通过“广播”形式分词，加快速度。

#先将文本正负面评价分开，然后再进行LDA主题分析。COSTCM6中的情感分析做及其分类，生成正面情感和负面情感

# 正面主题分析
from gensim import corpora, models
pos_dict = corpora.Dictionary(pos[2])
pos_corpus = [pos_dict.doc2bow(i) for i in pos[2]]
pos_lda = models.LdaModel(pos_corpus, num_topics=3, id2word=pos_dict)
for i in range(3):
    neg_lda.print_topic(i)  # 输出每个主题

(2) Model method

First, neural networks

2. Decision trees

三、K-Means

First, LM neural network

API:

add(); compile(); fit(); save_weights(); predict_classrs()

from keras.models import Sequential
from keras.layers import Dense, Activation
net = Sequential()
net.add(Dense(input_dim=3, activation='relu', units=10))
net.compile(loss='binary_crossentropy', optimizer='adam',  metrics=['accuracy'])
net.fit(train[:, :3], train[:, 3], epochs=1000, batch_size=1)#传入的是矩阵，读取excel需要把数据.as_matrix()
net.save_weights(netfile)
predict_result = net.predict_classes(train[:, :3]).reshape(len(train))

2. CART decision tree

API：

fit(); predict();

# 构建CART决策树模型
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(train[:, :3], train[:, 3])
plt = cm_plot(test[:, 3], tree.predict(test[:, :3]))#获取结果
plt.show()

K-Means K clustering algorithm

from sklearn.cluster import KMeans
import pandas as pd
inputFile = '../data/zscoreddata.xls'
data = pd.read_excel(inputFile)
kmodel = KMeans(n_clusters=5, n_jobs=4)
kmodel.fit(data)
print(kmodel.cluster_centers_)

Fourth, SVM support vector machine

from sklearn import svm
smodel=svm.SVC()  #建立模型
smodel.fit(x_train,y_train)  #训练模型
res=smodel.predict(x_test)  #预测测试集

(3) Draw graphics

After the model is established, it is necessary to visually analyze the rationality and accuracy of data mining

Confusion matrix: The case of correct and incorrect classification

ROC curve: Performance of classification methods

Cluster group plot: Cluster data values into n classes and analyze n class population characteristics

Confusion matrix

Prediction accuracy: RMSE; MAE

Classification accuracy: precesion=TP/TP+FP: Indicates the likelihood that the user will be interested in the recommended product

recall=TP/(TP+FN): Indicates the recommended product, accounting for the probability that the user likes the product

from sklearn.metrics import confusion_matrix  # 导入混淆矩阵函数
cm = confusion_matrix(y, yp)  # 混淆矩阵如下
# CM [[TP,FP],[FN,TN]]
#例如['TP', 'FP', 'FN', 'TN'] == [46, 2, 7, 4]

cm_train = confusion_matrix(train_label, smodel.predict(trainSet))
cm_test = confusion_matrix(test_label, smodel.predict(testSet))
pd.DataFrame(cm_train).to_excel(outFile1)
pd.DataFrame(cm_test).to_excel(outFile2)

ROC curve

from sklearn.metrics import roc_curve  # 导入ROC曲线函数
fpr, tpr, thresholds = roc_curve(test[:, 3], tree.predict_proba(test[:, :3])[:, 1], pos_label=1)
plt.plot(fpr, tpr, linewidth=2, label='ROC of CART', color='green')  # 作出ROC曲线

Cluster group diagram

import matplotlib.pyplot as plt
centers = kmodel.cluster_centers_
for i in range(5):
plt.plot([2, 4, 6, 8, 10], centers[i], label='group' + str(i),marker='o') #设置横轴纵轴分别对应5个点
plt.ylabel('values')
plt.xlabel('index: L R F M C')
plt.show()

Hierarchical clustering spectral

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage,dendrogram
#这里使用scipy的层次聚类函数

Z = linkage(data_udf, method = 'ward', metric = 'euclidean') #谱系聚类图
P = dendrogram(Z, 0) #画谱系聚类图
plt.show()

Benefits of the day: Python Learning Resources Package puts you on the top

AIoT learning materials for Python
Python Getting Started Tutorial Full Version
Companion books for Python
Python data analysis and mining in practice
Artificial Intelligence Data for Python
Python full-stack development materials
Python data analysis practical project

Complete resource acquisition methods

Pay attention to the "Uncle Mai Python" headline number, comment forward reply "666" can be obtained.

Python Data Analytics and Data Mining Learning Routes! Project Transcript (Full)