基于機器學習預測使用者流失

一、背景和目标

使用者營運是CRM營運中重中之重的一項工作，在人工智能時代，我們可以探索用AI幫助我們做一些使用者營運的工作，之前我寫了幾篇關于快消行業與AI技術相結合的文章：

1. 利用RFM模型對餐飲客戶進行分析

2.利用Apriori關聯算法看看客戶最喜歡買什麼

3.利用ARMA算法對銷售進行預測

4.利用深度學習和機器學習對餐飲客戶進行分類

本次探索一下利用AI來預測使用者流失，整體過程如下:

基于機器學習預測使用者流失

二、資料采集和準備

資料采集分為幾個子產品:使用者基本資訊、消費産品偏好、消費管道偏好、LBS資訊、優惠使用資訊、積分資訊、等級資訊等。

1、次元采集:

資料存儲在資料庫中，根據采集資訊，寫SQL形成次元資料，實際上類似于使用者畫像的萃取過程。這裡以SQL server為例，選擇使用者最喜歡的時間段、用的最多的管道和最喜歡的産品，隻選擇一部分SQL作為示例，一共采集33個次元資訊:

--計算每個使用者訂單管道、訂單時間段、産品類型
select * into #product_prefer
 from(
  select userid,'orderchannel'as tag,orderchannel,count(1) ordercount
  from #orderdetail
 group by userid,orderchannel
 union all
  select userid,'daypart'as tag,daypart,count(1) ordercount
  from #orderdetail
 group by userid,daypart
 union all
   select userid, 'product_type'as tag,product_type,count(1) ordercount
  from #orderdetail
 group by userid, product_type
  )y

 --取出每個使用者使用最多的訂單管道、最多時間段和最喜歡産品
 select *
 into  #base_tmp
from
  (select *,ROW_NUMBER() OVER ( PARTITION BY userid,tag order by userid,ordercount DESC ) 'rank' FROM  #product_prefer  ) a
where [rank] = 1
 --列轉行形成使用者标簽
select userid,max(product_type)product_type,max(daypart)dayparts,max(orderchannel)orderchannel
from(
SELECT *
FROM #base_emp
PIVOT(max(orderchannel) FOR tag IN("product_type","daypart","orderchannel")) AS T 
)x
group by userid

2、标簽生成

因為是采用有監督學習，是以需要有标簽，那怎麼怎麼生成呢? 我們可以采集截止2018年的次元資料作為訓練資料，然後2019年沒有消費的會員定義為流失會員，有消費的會員定義為活躍會員，這樣标簽就有了。用2020年的資料作為驗證，這樣就解決了标簽和驗證的問題。最終采集的資料大概是這樣的:

基于機器學習預測使用者流失

三、資料預處理

1、資料映射

因為我們很多元度是文本，需要轉化為數字才可以用于訓練，這裡用labelencoder進行映射，最後把原始的帶文本的dataframe轉成純數字的dataframe。

from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler
from collections import defaultdict

d=defaultdict(LabelEncoder)
x=X.apply(lambda z:d[z.name].fit_transform(z))
y=LabelEncoder().fit_transform(Y)
mergerdata=pd.concat([x,pd.DataFrame(y,columns=['labels'])],axis=1,ignore_index=False)  #把标簽合并成一張dataframe

2、資料探索

因為特征很多，我們需要選擇對結果影響比較大的特征進行訓練，排除一些無用的特征，先用人工篩選的方法探索看看。

a.看看性别和消費時間段對客戶流失的影響:

import matplotlib.pyplot as plt
import seaborn as sns

data=pd.read_csv(r'E:\Python\train.csv')#.fillna(0)
#資料探索
f,axes=plt.subplots(nrows=1,ncols=2,figsize=(10,10))
plt.subplot(1,2,1)
sex=sns.countplot(x='sex',hue='labels',data=data)
plt.xlabel='sex'
plt.title='Distribe by sex'

plt.subplot(1,2,2)
dayparts=sns.countplot(x='dayparts',hue='labels',data=data)
plt.xlabel='dayparts'
plt.title='Distribe by dayparts'
plt.show()

基于機器學習預測使用者流失

可以看到性别對消費影響不是很大，午餐時間段客戶流失較多。

b.特征之間相關性

使用熱力圖顯示特征之間的相關性

#轉獨熱碼
d=defaultdict(LabelEncoder)
x=X.apply(lambda z:d[z.name].fit_transform(z))
y=LabelEncoder().fit_transform(Y)
#檢視相關性
corr=x.corr()
plt.figure(figsize=(20,20))
ax=sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns,cmap='YlGnBu_r',annot=True) #annot顯示數字标注cmap="YlGnBu"：數字越大，顔色越深"
plt.show()

基于機器學習預測使用者流失

從上圖的顔色，可以看到特征之間的相關性，顔色越深越相關。

如果遇到報錯:

Traceback (most recent call last):

TypeError: '<' not supported between instances of 'str' and 'int'During handling of the above exception, another exception occurred:

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']

這是因為object類型有可以包含str，int等，需要統一标準化，可以通過如下方式解決:

方法一:X[cat]=X[cat].astype('category') 把對應列轉行category類型，轉成str也行。

方法二:X=X.apply(lambda z:z.astype(str)) 全部當成str處理

c.檢視各個特征與使用者流失之間的相關性:

#特征與使用者流失之間的相關性
labelrelation=mergerdata.corr()['labels'].sort_values(ascending=False).plot(kind='bar')
plt.show()

基于機器學習預測使用者流失

從上圖可以看到各個特征與使用者流失之間的關系，中間部分的值接近于0，說明對使用者流失影響不大，特征可以舍棄。

3.特征選擇

上面是用人工或者圖的方式來判斷特征對使用者流失的影響，但是sklearn還提供了其他方法可以判斷，比如PCA主成分分析，SelectKBest，随機森林等方式。本次選擇用随機森林來選擇特征

代碼:

#随機森林選擇特征值
threshold=0 #根據自己需求設定門檻值
rf=RFC(n_estimators=100,random_state=0,n_jobs=-1) # random_state 設定随機種子,n_jobs 為CPU數，-1可以調用所有核心
rf.fit(x,y)
select_feature=[]
for i in sorted(zip(rf.feature_importances_,mergerdata.columns),reverse=True):
    #print(i)
    if i[0]>threshold:
        select_feature.append(i[1])
x=x[select_feature] #選擇大于門檻值的特征作為新的訓練資料
print(select_x.head())

基于機器學習預測使用者流失

按特征重要性順序輸出，設定好門檻值，我們就可以選擇多少個次元進行訓練了,PS這是個很吃記憶體的過程，我設定1000棵樹，跑大概1000w資料，用了接近50個G的運存，如果記憶體吃不消，可以控制一下參數，比如減少樹數量，減少樹深度。。。。。

4、資料标準化和劃分驗證測試集

因為資料可能有大小差異，是以對資料進行标準化。用standardScaler函數直接标準化就行。劃分訓練集可以用StratifiedShuffleSplit可以分層抽樣也可以直接用常見的train_test_split劃分

select_x=StandardScaler().fit_transform(select_x)
#劃分資料集和測試集
#用StratifiedShuffleSplit可以分層抽樣
# for train_index, test_index in StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=0).split(select_x, y):
#     #print("train:", train_index, "test:", test_index)
#     x_train,x_test=select_x.iloc[train_index], select_x.iloc[test_index]
#     y_train,y_test=y[train_index], y[test_index]
 #直接抽樣
x_train, x_test, y_train, y_test = train_test_split(select_x, y, test_size = 0.3, random_state = 0)

到此為止，我們已經完成了訓練資料的準備。

四、機器學習

1.确定參數

利用網格搜尋法确定最佳參數

#利用網格搜尋法确定最佳參數
max_depth = [2, 3, 4, 5, 6]
min_samples_split = [2, 4, 6, 8, 10]
min_samples_leaf = [2, 4, 6, 8, 10]
parameters = {'max_depth':max_depth, 'min_samples_split':min_samples_split, 'min_samples_leaf':min_samples_leaf}
grid = GridSearchCV(estimator=DTC(), param_grid=parameters, cv=10)
grid.fit(x_train, y_train)
print(grid.best_params_)

基于機器學習預測使用者流失

是以我們樹的形狀就是上面的參數。

2.開始訓練

這裡選用決策樹、GDBT和随機森林三種機器學習方法訓練資料，并且把模型壓縮之後儲存本地，用法可以看我之前寫的sklearn 入門，把各個機器學習用法用一遍

#模型得分
def func(clf,filename):
    clf.fit(x_train, y_train)
    joblib.dump(value=clf,filename=filename,compress=True )  # 模型儲存
    score = clf.score(x_test, y_test)
    return score
print('決策樹結果為：{}'.format(func(DTC(min_samples_leaf=2,max_depth=6,min_samples_split=2),'DTC.gz')))
print('GDBT結果為：{}'.format(func(GDBC(n_estimators=200,max_depth=6,min_samples_leaf=2,min_samples_split=2),'GDBC.gz')))
print('随機森林結果為：{}'.format(func(RFC(n_estimators=200,max_depth=6,min_samples_leaf=2,min_samples_split=2),'RFC.gz')))

基于機器學習預測使用者流失

這個結果準确度太高，應該是過拟合了。接下來應用真實生成的資料來測試結果

3.使用者預測

選擇剛才儲存的模型，直接選用生産資料來預測，對比真實結果。

import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler
from collections import defaultdict
import joblib
import pyodbc

#1.資料預處理階段
sql ='''
select xxxxx from
xxxxx
'''
conn = get_db_connection()
print('資料庫連結成功')
data=pd.read_sql(sql,conn).fillna(0)
print('讀取成功')
X=data.iloc[:,1:17]
Y=data.iloc[:,17]
#預處理
d=defaultdict(LabelEncoder)
X=X.apply(lambda z:z.astype(str)) #為了避免報錯
x=X.apply(lambda z:d[z.name].fit_transform(z))
y=LabelEncoder().fit_transform(Y)
x_test=StandardScaler().fit_transform(x)#資料标準化


#使用者流失預測
#加載随機森林模型
RFC=joblib.load('RFC.gz')
y_predict = RFC.predict(x_test)
result=pd.DataFrame({'predict':y_predict,'actual':y})
result['accuracy']=(result['predict']==result['actual'])
c=result['accuracy'].value_counts().iloc[0]/len(result)
print('随機森林預測結果準确率為{:.2f}%'.format(c*100))
#加載決策樹模型
DTC=joblib.load('DTC.gz')
y_predict = DTC.predict(x_test)
result=pd.DataFrame({'predict':y_predict,'actual':y})
result['accuracy']=(result['predict']==result['actual'])
c=result['accuracy'].value_counts().iloc[0]/len(result)
print('決策樹預測結果準确率為{:.2f}%'.format(c*100))
#加載GDBC模型
DTC=joblib.load('GDBC.gz')
y_predict = DTC.predict(x_test)
result=pd.DataFrame({'predict':y_predict,'actual':y})
result['accuracy']=(result['predict']==result['actual'])
c=result['accuracy'].value_counts().iloc[0]/len(result)
print('GDBT預測結果準确率為{:.2f}%'.format(c*100))

基于機器學習預測使用者流失

總結:

這個結果正确率不太高，可能原因是标簽采集不太全面，作為一次的使用者預測嘗試，後面可以試試多采集使用者标簽，然後調調參數看看能不能有更精準的結果。等正确率提高了，可以與業務部分合作，提供定向的營銷手段避免使用者流失

完整代碼:

import pandas as pd
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler
from collections import defaultdict
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier as GDBC
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.externals import joblib
from sklearn.model_selection import StratifiedShuffleSplit,GridSearchCV
import pyodbc

#1.資料預處理階段
#data=pd.read_sql(sql,conn).fillna(0) #從資料庫擷取資料
data=pd.read_csv(r'E:\Python\train.csv').fillna(0) #從本地擷取資料
pd.set_option('display.max_columns', None) #列印全部行，避免省略号
X=data.iloc[:,1:33]
Y=data.iloc[:,33]

X=X.drop(['membername'],axis=1)
#資料探索
f,axes=plt.subplots(nrows=1,ncols=2,figsize=(10,10))
plt.subplot(1,2,1)
sex=sns.countplot(x='sex',hue='labels',data=data)
plt.xlabel='sex'
plt.title='Distribe by sex'
plt.subplot(1,2,2)
dayparts=sns.countplot(x='dayparts',hue='labels',data=data)
plt.xlabel='dayparts'
plt.title='Distribe by dayparts'
plt.show()
#轉獨熱碼
d=defaultdict(LabelEncoder)
x=X.apply(lambda z:d[z.name].fit_transform(z))
y=LabelEncoder().fit_transform(Y)
mergerdata=pd.concat([x,pd.DataFrame(y,columns=['labels'])],axis=1,ignore_index=False)  #把标簽合并成一張dataframe

#檢視特征相關性
corr=mergerdata.corr()
plt.figure(figsize=(20,20))
ax=sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns,cmap='YlGnBu_r',annot=True) #annot顯示數字标注cmap="YlGnBu"：數字越大，顔色越深"
plt.show()
#特征與使用者流失之間的相關性
labelrelation=mergerdata.corr()['labels'].sort_values(ascending=False).plot(kind='bar')
plt.show()
#PCA選擇特征值
# pca=PCA()
# pca.fit(x)
# print(pca.components_)
# print(pca.explained_variance_ratio_)
#Kbest選擇最佳特征值
# test = SelectKBest(score_func=chi2, k=20)
# fit = test.fit(x,y)
# features = fit.transform(x)
# print(features[0:21, :])
#随機森林選擇特征值
threshold=0.001
rf=RFC(n_estimators=100,random_state=0,n_jobs=-1) # random_state 設定随機種子,n_jobs 為CPU數，-1可以調用所有核心
rf.fit(x,y)
select_feature=[]
for i in sorted(zip(rf.feature_importances_,mergerdata.columns),reverse=True):
    print(i)
    if i[0]>threshold:
        select_feature.append(i[1])
#根據随機森林的結果選擇特征組成新的dataframe
select_x=x[select_feature]
select_x=StandardScaler().fit_transform(select_x)#資料标準化
#劃分資料集和測試集
#用StratifiedShuffleSplit可以分層抽樣
# for train_index, test_index in StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=0).split(select_x, y):
#     #print("train:", train_index, "test:", test_index)
#     x_train,x_test=select_x.iloc[train_index], select_x.iloc[test_index]
#     y_train,y_test=y[train_index], y[test_index]
x_train, x_test, y_train, y_test = train_test_split(select_x, y, test_size = 0.3, random_state = 0) #直接抽樣

#2.機器學習階段
#利用網格搜尋法确定最佳參數
max_depth = [2, 3, 4, 5, 6]
min_samples_split = [2, 4, 6, 8, 10]
min_samples_leaf = [2, 4, 6, 8, 10]
parameters = {'max_depth':max_depth, 'min_samples_split':min_samples_split, 'min_samples_leaf':min_samples_leaf}
grid = GridSearchCV(estimator=DTC(), param_grid=parameters, cv=10)
grid.fit(x_train, y_train)
print(grid.best_params_)
#模型得分
def func(clf,filename):
    clf.fit(x_train, y_train)
    joblib.dump(value=clf,filename=filename,compress=True )  # 模型儲存
    score = clf.score(x_test, y_test)
    return score
print('決策樹結果為：{}'.format(func(DTC(min_samples_leaf=2,max_depth=6,min_samples_split=2),'DTC.gz')))
print('GDBT結果為：{}'.format(func(GDBC(n_estimators=200,max_depth=6,min_samples_leaf=2,min_samples_split=2),'GDBC.gz')))
print('随機森林結果為：{}'.format(func(RFC(n_estimators=200,max_depth=6,min_samples_leaf=2,min_samples_split=2),'RFC.gz')))
#使用者流失預測
RFC=joblib.load('RFC.gz')
y_predict = RFC.predict(x_test)
result=pd.DataFrame({'predict':y_predict,'actual':y})
result['accuracy']=(result['predict']==result['actual'])
c=result['accuracy'].value_counts().iloc[0]/len(result)
print('随機森林預測結果準确率為{:.2f}%'.format(c*100))

參考部落格:

Python建立客戶流失預測模型

基于決策樹的使用者流失分析與預測

基于機器學習預測使用者流失

繼續閱讀

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入