Python data mining project combat record
Taken from the book "Python Data Analysis and Mining Practice", the data processing methods used in each project are sorted out:
- Data preprocessing methods
- Model methods
- Draw the graph
For classification problems: classification with models; confusion matrices and their plots; observation of their ROC curves;
For clustering problems: Divide the number of categories; Get the center of the division; Parallel coordinate system description
(1) Methods of data preprocessing
After getting the data, you will find some data value errors
First, fill in the blank value
Second, observe the distribution of data
Third, data cleaning, so that data values are reasonable
Fourth, data protocol, extract important attributes
Data normalization, in order to use the K-Means algorithm
Data reading
#SVM,bayes,ANN,D_Tree,等需要操作的数据是矩阵,需要
data=data.as_matrix() #将series数据转为矩阵形式的训练集
from numpy.random import shuffle
shuffle(data) #随机打乱数据
x_train=data_train[:,2:]*30 #将特征放大
import pickle
pickle.dump(model, open('../tmp/svm.model', 'wb'))# 保存模型
model = pickle.load(open('../tmp/svm.model', 'rb'))# 读取模型
#将数据保存为固定文件格式
pd.DataFrame(cm_train, index=range(5),columns=range(5)).to_excel(outputfile1)
1. Lagrange interpolation
from scipy.interpolate import lagrange
# 取数,两个列表尾接
y = s[list(range(n - k, n)) + list(range(n + 1, n + 1 + k))]
y = y[y.notnull()]
res=lagrange(y.index, list(y))(n) #输入索引,和该列数据 n为空值下标
Second, the built-in interpolation method of series
df = pd.DataFrame(data, columns=[0, 1, 2])
df.interpolate()
Third, the description of the power trend
Draw a line chart and observe the downward trend of electricity.
Data exploration describe (include='all). T
Through data exploration, find outliers, illogical error values, if the dataset is large can be discarded, otherwise can be filled
explore = data.describe(include='all').T
explore['null'] = len(data) - explore['count']
explore = explore[['null', 'max', 'min', 'std']]# count,unique,top,freq,mean,std,min,50%,max等选取几个属性
#计算相关系数矩阵,data必须是n*n矩阵
print(np.round(data.corr(method='pearson'), 2))
5. Data cleaning
For collection-like operations, for illogical error values, set indexes, exclude.
data = data[data['SUM_YR_1'].notnull() & data['SUM_YR_2'].notnull()]
index1 = data['SUM_YR_1'] != 0
index2 = data['SUM_YR_2'] != 0
index3 = (data['SEG_KM_SUM'] == 0) & (data['avg_discount'] == 0)
data = data[index1 | index2 | index3]
6. Attribute conversion
The specification of the data is to select the useful data attributes, which can be achieved by deleting the corresponding column by excel.
# 选取某列
data=data[data['TARGET_ID']==184].copy() #获取该条件下的数据的副本
data_group=data.groupby('COLLECTTIME') #以时间分组
def attr_trans(x): # 定义属性变换函数
#创建新的series
result=pd.Series(index=['SYS_NAME', 'CWXT_DB:184:C:\\', 'CWXT_DB:184:D:\\', 'COLLECTTIME']) #设置列标
result['SYS_NAME'] = x['SYS_NAME'].iloc[0] #获取该属性值,唯一
result['COLLECTTIME'] = x['COLLECTTIME'].iloc[0] #获取该属性值,唯一
result['CWXT_DB:184:C:\\'] = x['VALUE'].iloc[0] #获取属性值A
result['CWXT_DB:184:D:\\'] = x['VALUE'].iloc[1] #获取属性值B,等等
return result
data_processed = data_group.apply(attr_trans) # 逐组处理
7. Data normalization and standardization
Standard normal distribution normalization using normal distribution: x-u/σ
data = (data - data.mean(axis=0)) / data.std(axis=0) # 按列选取均值和标准差。矩阵操作
When it is found that the range of values has too much influence on the result and is inconvenient to calculate, the data is standardized
data=(data-data.min())/(data.max()-data.min())
data=data.reset_index()
8. Data discretization
When an application discovers frequent items, it needs to turn continuous data into discrete data.
for i in range(len(keys)):
# 调用k-means算法,进行聚类离散化
r1 = pd.DataFrame(kmodel.cluster_centers_, columns=[typelabel[keys[i]]]) # 聚类中心,A
r2 = pd.Series(kmodel.labels_).value_counts() # 分类统计
r2 = pd.DataFrame(r2, columns=[typelabel[keys[i]] + 'n']) #统计量, An
r = pd.DataFrame(pd.concat([r1, r2], axis=1)) #聚类中心与类别数目匹配连接
r = r.sort_values(typelabel[keys[i]])
r.index = [1, 2, 3, 4]
r[typelabel[keys[i]]] = pd.rolling_mean(r[typelabel[keys[i]]], 2) # rolling_mean()用来计算相邻2列的均值,以此作为边界点。
r[typelabel[keys[i]]][1] = 0.0 # 这两句代码将原来的聚类中心改为边界点。
result = result.append(r.T) #转置添加
result = result.sort_index() # 以Index(A,B,C,D,E,F)顺序排序,保存
result.to_excel(processedfile)
9. Image cutting and color matrix extraction
1. First-order color moment: The first-order origin moment is used to reflect the overall brightness and darkness of the image
No=1/N * ∑(j:1-N) Pij
2. Second-order color moment: The distribution range of the color of the reaction image
σi=(1/N *∑j:1-N (Pij-No)^2) ^1/2
3. Third-order color moment: Reflects the symmetry of the color distribution of the image
10. Time series algorithms
Use the process of the time series algorithm model to predict future data based on historical data
Time series algorithm is used to model fit, test and detect the model input data. According to the error formula, the error between the predicted value and the verification data is calculated, and whether it falls within the scope of business acceptance.
The model recognizes AR, MA, ARMA
Stationarity test, white noise test, model identification, model test, model prediction, model evaluation, model application
11. Behavior analysis and service recommendation
Connect to the database
The system filtering algorithm is the main one, and the others are supplemented.
recommend
Item similarity: angle cosine; Jecard similarity coefficient; correlation coefficient
Familiarize yourself with the use of item-based collaborative filtering algorithms
# 基于物品的协同过滤算法
def Jaccard(a, b):
return 1.0 * (a * b).sum() / (a + b - a * b).sum()
class Recommender():
sim = None
def similarity(self, x, distance):
y = np.ones((len(x), len(x)))
for i in range(len(x)):
for j in range(len(x)):
y[i, j] = distance(x[i], x[j])
return y
def fit(self, x, distance=Jaccard): #x传入的是矩阵(行:物品,列:用户)
self.sim = self.similarity(x, distance) #计算相似度
def recommend(self, a): #传入预测用户的购买记录的矩阵.T = n * 1
return np.dot(self.sim, a) * (1 - a)
12. Variable selection and grey prediction
Use the Lasso function to select variables with the processed data
Grey forecasts get the predicted values for key influencing factors
Use neural networks to forecast fiscal revenues
13. Text preprocessing
#数据去重
l1 = len(data)
data = pd.DataFrame(data[0].unique()) #选取数据列进行unique()
l2 = len(data)
data.to_csv(outputfile, index = False, header = False, encoding = 'utf-8')
print(u'删除了%s条评论。' %(l1 - l2))
#机械压缩去词,去除连续重复语料,和短句子删除过滤较多垃圾信息
#文本评论分词
mycut = lambda s: ' '.join(jieba.cut(s)) #自定义简单分词函数
data1 = data1[0].apply(mycut)#对于读入的数据执行分词函数
data2 = data2[0].apply(mycut)#通过“广播”形式分词,加快速度。
#先将文本正负面评价分开,然后再进行LDA主题分析。COSTCM6中的情感分析做及其分类,生成正面情感和负面情感
# 正面主题分析
from gensim import corpora, models
pos_dict = corpora.Dictionary(pos[2])
pos_corpus = [pos_dict.doc2bow(i) for i in pos[2]]
pos_lda = models.LdaModel(pos_corpus, num_topics=3, id2word=pos_dict)
for i in range(3):
neg_lda.print_topic(i) # 输出每个主题
(2) Model method
First, neural networks
2. Decision trees
三、K-Means
First, LM neural network
API:
add(); compile(); fit(); save_weights(); predict_classrs()
from keras.models import Sequential
from keras.layers import Dense, Activation
net = Sequential()
net.add(Dense(input_dim=3, activation='relu', units=10))
net.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
net.fit(train[:, :3], train[:, 3], epochs=1000, batch_size=1)#传入的是矩阵,读取excel需要把数据.as_matrix()
net.save_weights(netfile)
predict_result = net.predict_classes(train[:, :3]).reshape(len(train))
2. CART decision tree
API:
fit(); predict();
# 构建CART决策树模型
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(train[:, :3], train[:, 3])
plt = cm_plot(test[:, 3], tree.predict(test[:, :3]))#获取结果
plt.show()
K-Means K clustering algorithm
from sklearn.cluster import KMeans
import pandas as pd
inputFile = '../data/zscoreddata.xls'
data = pd.read_excel(inputFile)
kmodel = KMeans(n_clusters=5, n_jobs=4)
kmodel.fit(data)
print(kmodel.cluster_centers_)
Fourth, SVM support vector machine
from sklearn import svm
smodel=svm.SVC() #建立模型
smodel.fit(x_train,y_train) #训练模型
res=smodel.predict(x_test) #预测测试集
(3) Draw graphics
After the model is established, it is necessary to visually analyze the rationality and accuracy of data mining
Confusion matrix: The case of correct and incorrect classification
ROC curve: Performance of classification methods
Cluster group plot: Cluster data values into n classes and analyze n class population characteristics
Confusion matrix
Prediction accuracy: RMSE; MAE
Classification accuracy: precesion=TP/TP+FP: Indicates the likelihood that the user will be interested in the recommended product
recall=TP/(TP+FN): Indicates the recommended product, accounting for the probability that the user likes the product
from sklearn.metrics import confusion_matrix # 导入混淆矩阵函数
cm = confusion_matrix(y, yp) # 混淆矩阵如下
# CM [[TP,FP],[FN,TN]]
#例如['TP', 'FP', 'FN', 'TN'] == [46, 2, 7, 4]
cm_train = confusion_matrix(train_label, smodel.predict(trainSet))
cm_test = confusion_matrix(test_label, smodel.predict(testSet))
pd.DataFrame(cm_train).to_excel(outFile1)
pd.DataFrame(cm_test).to_excel(outFile2)
ROC curve
from sklearn.metrics import roc_curve # 导入ROC曲线函数
fpr, tpr, thresholds = roc_curve(test[:, 3], tree.predict_proba(test[:, :3])[:, 1], pos_label=1)
plt.plot(fpr, tpr, linewidth=2, label='ROC of CART', color='green') # 作出ROC曲线
Cluster group diagram
import matplotlib.pyplot as plt
centers = kmodel.cluster_centers_
for i in range(5):
plt.plot([2, 4, 6, 8, 10], centers[i], label='group' + str(i),marker='o') #设置横轴纵轴分别对应5个点
plt.ylabel('values')
plt.xlabel('index: L R F M C')
plt.show()
Hierarchical clustering spectral
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage,dendrogram
#这里使用scipy的层次聚类函数
Z = linkage(data_udf, method = 'ward', metric = 'euclidean') #谱系聚类图
P = dendrogram(Z, 0) #画谱系聚类图
plt.show()
Benefits of the day: Python Learning Resources Package puts you on the top
- AIoT learning materials for Python
- Python Getting Started Tutorial Full Version
- Companion books for Python
- Python data analysis and mining in practice
- Artificial Intelligence Data for Python
- Python full-stack development materials
- Python data analysis practical project
Complete resource acquisition methods
Pay attention to the "Uncle Mai Python" headline number, comment forward reply "666" can be obtained.