天天看點

機器學習 day7 kmeans 聚類算法

#準确的客戶分類的結果是企業優化營銷資源的重要依據,本文利用了航空公司的部分資料,利用Kmeans聚類方法,對航空公司的客戶進行了分類,來識别出不同的客戶群體,從來發現有用的客戶,進而對不同價值的客戶類别提供個性化服務,指定相應的營銷政策。

# coding=utf-8
import pandas as pd
import numpy as np
#忽略報警
import warnings
warnings.filterwarnings("ignore")
           
#讀取原始資料,指定UTF-8編碼(需要用文本編輯器将資料裝換為ANSI編碼)
data = pd.read_csv(r'air_data - utf8.csv')
#檢視樣本
explore = data.describe(percentiles = [], include = 'all').T 
data.head()
           
MEMBER_NO FFP_DATE FIRST_FLIGHT_DATE GENDER FFP_TIER WORK_CITY WORK_PROVINCE WORK_COUNTRY AGE LOAD_TIME ... ADD_Point_SUM Eli_Add_Point_Sum L1Y_ELi_Add_Points Points_Sum L1Y_Points_Sum Ration_L1Y_Flight_Count Ration_P1Y_Flight_Count Ration_P1Y_BPS Ration_L1Y_BPS Point_NotFlight
54993 2006/11/2 2008/12/24 6 . 北京 CN 31.0 2014/3/31 ... 39992 114452 111100 619760 370211 0.509524 0.490476 0.487221 0.512777 50
1 28065 2007/2/19 2007/8/3 6 NaN 北京 CN 42.0 2014/3/31 ... 12000 53288 53288 415768 238410 0.514286 0.485714 0.489289 0.510708 33
2 55106 2007/2/1 2007/8/30 6 . 北京 CN 40.0 2014/3/31 ... 15491 55202 51711 406361 233798 0.518519 0.481481 0.481467 0.518530 26
3 21189 2008/8/22 2008/8/23 5 Los Angeles CA US 64.0 2014/3/31 ... 34890 34890 372204 186100 0.434783 0.565217 0.551722 0.448275 12
4 39546 2009/4/10 2009/4/15 6 貴陽 貴州 CN 48.0 2014/3/31 ... 22704 64969 64969 338813 210365 0.532895 0.467105 0.469054 0.530943 39

5 rows × 44 columns

explore
           
count unique top freq mean std min 50% max
MEMBER_NO 62988 NaN NaN NaN 31494.5 18183.2 1 31494.5 62988
FFP_DATE 62988 3068 2011/1/13 184 NaN NaN NaN NaN NaN
FIRST_FLIGHT_DATE 62988 3406 2013/2/16 96 NaN NaN NaN NaN NaN
GENDER 62985 2 48134 NaN NaN NaN NaN NaN
FFP_TIER 62988 NaN NaN NaN 4.10216 0.373856 4 4 6
WORK_CITY 60719 3309 廣州 9385 NaN NaN NaN NaN NaN
WORK_PROVINCE 59740 1183 廣東 17507 NaN NaN NaN NaN NaN
WORK_COUNTRY 62962 118 CN 57748 NaN NaN NaN NaN NaN
AGE 62568 NaN NaN NaN 42.4763 9.88591 6 41 110
LOAD_TIME 62988 1 2014/3/31 62988 NaN NaN NaN NaN NaN
FLIGHT_COUNT 62988 NaN NaN NaN 11.8394 14.0495 2 7 213
BP_SUM 62988 NaN NaN NaN 10925.1 16339.5 5700 505308
EP_SUM_YR_1 62988 NaN NaN NaN
EP_SUM_YR_2 62988 NaN NaN NaN 265.69 1645.7 74460
SUM_YR_1 62437 NaN NaN NaN 5355.38 8109.45 2800 239560
SUM_YR_2 62850 NaN NaN NaN 5604.03 8703.36 2773 234188
SEG_KM_SUM 62988 NaN NaN NaN 17123.9 20960.8 368 9994 580717
WEIGHTED_SEG_KM 62988 NaN NaN NaN 12777.2 17578.6 6978.26 558440
LAST_FLIGHT_DATE 62988 731 2014/3/31 959 NaN NaN NaN NaN NaN
AVG_FLIGHT_COUNT 62988 NaN NaN NaN 1.54215 1.787 0.25 0.875 26.625
AVG_BP_SUM 62988 NaN NaN NaN 1421.44 2083.12 752.375 63163.5
BEGIN_TO_FIRST 62988 NaN NaN NaN 120.145 159.573 50 729
LAST_TO_END 62988 NaN NaN NaN 176.12 183.822 1 108 731
AVG_INTERVAL 62988 NaN NaN NaN 67.7498 77.5179 44.6667 728
MAX_INTERVAL 62988 NaN NaN NaN 166.034 123.397 143 728
ADD_POINTS_SUM_YR_1 62988 NaN NaN NaN 540.317 3956.08 600000
ADD_POINTS_SUM_YR_2 62988 NaN NaN NaN 814.689 5121.8 728282
EXCHANGE_COUNT 62988 NaN NaN NaN 0.319775 1.136 46
avg_discount 62988 NaN NaN NaN 0.721558 0.185427 0.711856 1.5
P1Y_Flight_Count 62988 NaN NaN NaN 5.76626 7.21092 3 118
L1Y_Flight_Count 62988 NaN NaN NaN 6.07316 8.17513 3 111
P1Y_BP_SUM 62988 NaN NaN NaN 5366.72 8537.77 2692 246197
L1Y_BP_SUM 62988 NaN NaN NaN 5558.36 9351.96 2547 259111
EP_SUM 62988 NaN NaN NaN 265.69 1645.7 74460
ADD_Point_SUM 62988 NaN NaN NaN 1355.01 7868.48 984938
Eli_Add_Point_Sum 62988 NaN NaN NaN 1620.7 8294.4 984938
L1Y_ELi_Add_Points 62988 NaN NaN NaN 1080.38 5639.86 728282
Points_Sum 62988 NaN NaN NaN 12545.8 20507.8 6328.5 985572
L1Y_Points_Sum 62988 NaN NaN NaN 6638.74 12601.8 2860.5 728282
Ration_L1Y_Flight_Count 62988 NaN NaN NaN 0.486419 0.319105 0.5 1
Ration_P1Y_Flight_Count 62988 NaN NaN NaN 0.513581 0.319105 0.5 1
Ration_P1Y_BPS 62988 NaN NaN NaN 0.522293 0.339632 0.514252 0.999989
Ration_L1Y_BPS 62988 NaN NaN NaN 0.468422 0.338956 0.476747 0.999993
Point_NotFlight 62988 NaN NaN NaN 2.72815 7.36416 140
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62988 entries, 0 to 62987
Data columns (total 44 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   MEMBER_NO                62988 non-null  int64  
 1   FFP_DATE                 62988 non-null  object 
 2   FIRST_FLIGHT_DATE        62988 non-null  object 
 3   GENDER                   62985 non-null  object 
 4   FFP_TIER                 62988 non-null  int64  
 5   WORK_CITY                60719 non-null  object 
 6   WORK_PROVINCE            59740 non-null  object 
 7   WORK_COUNTRY             62962 non-null  object 
 8   AGE                      62568 non-null  float64
 9   LOAD_TIME                62988 non-null  object 
 10  FLIGHT_COUNT             62988 non-null  int64  
 11  BP_SUM                   62988 non-null  int64  
 12  EP_SUM_YR_1              62988 non-null  int64  
 13  EP_SUM_YR_2              62988 non-null  int64  
 14  SUM_YR_1                 62437 non-null  float64
 15  SUM_YR_2                 62850 non-null  float64
 16  SEG_KM_SUM               62988 non-null  int64  
 17  WEIGHTED_SEG_KM          62988 non-null  float64
 18  LAST_FLIGHT_DATE         62988 non-null  object 
 19  AVG_FLIGHT_COUNT         62988 non-null  float64
 20  AVG_BP_SUM               62988 non-null  float64
 21  BEGIN_TO_FIRST           62988 non-null  int64  
 22  LAST_TO_END              62988 non-null  int64  
 23  AVG_INTERVAL             62988 non-null  float64
 24  MAX_INTERVAL             62988 non-null  int64  
 25  ADD_POINTS_SUM_YR_1      62988 non-null  int64  
 26  ADD_POINTS_SUM_YR_2      62988 non-null  int64  
 27  EXCHANGE_COUNT           62988 non-null  int64  
 28  avg_discount             62988 non-null  float64
 29  P1Y_Flight_Count         62988 non-null  int64  
 30  L1Y_Flight_Count         62988 non-null  int64  
 31  P1Y_BP_SUM               62988 non-null  int64  
 32  L1Y_BP_SUM               62988 non-null  int64  
 33  EP_SUM                   62988 non-null  int64  
 34  ADD_Point_SUM            62988 non-null  int64  
 35  Eli_Add_Point_Sum        62988 non-null  int64  
 36  L1Y_ELi_Add_Points       62988 non-null  int64  
 37  Points_Sum               62988 non-null  int64  
 38  L1Y_Points_Sum           62988 non-null  int64  
 39  Ration_L1Y_Flight_Count  62988 non-null  float64
 40  Ration_P1Y_Flight_Count  62988 non-null  float64
 41  Ration_P1Y_BPS           62988 non-null  float64
 42  Ration_L1Y_BPS           62988 non-null  float64
 43  Point_NotFlight          62988 non-null  int64  
dtypes: float64(12), int64(24), object(8)
memory usage: 21.1+ MB
           
#去重
data.drop_duplicates(inplace=True)
data.info()
           
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62988 entries, 0 to 62987
Data columns (total 44 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   MEMBER_NO                62988 non-null  int64  
 1   FFP_DATE                 62988 non-null  object 
 2   FIRST_FLIGHT_DATE        62988 non-null  object 
 3   GENDER                   62985 non-null  object 
 4   FFP_TIER                 62988 non-null  int64  
 5   WORK_CITY                60719 non-null  object 
 6   WORK_PROVINCE            59740 non-null  object 
 7   WORK_COUNTRY             62962 non-null  object 
 8   AGE                      62568 non-null  float64
 9   LOAD_TIME                62988 non-null  object 
 10  FLIGHT_COUNT             62988 non-null  int64  
 11  BP_SUM                   62988 non-null  int64  
 12  EP_SUM_YR_1              62988 non-null  int64  
 13  EP_SUM_YR_2              62988 non-null  int64  
 14  SUM_YR_1                 62437 non-null  float64
 15  SUM_YR_2                 62850 non-null  float64
 16  SEG_KM_SUM               62988 non-null  int64  
 17  WEIGHTED_SEG_KM          62988 non-null  float64
 18  LAST_FLIGHT_DATE         62988 non-null  object 
 19  AVG_FLIGHT_COUNT         62988 non-null  float64
 20  AVG_BP_SUM               62988 non-null  float64
 21  BEGIN_TO_FIRST           62988 non-null  int64  
 22  LAST_TO_END              62988 non-null  int64  
 23  AVG_INTERVAL             62988 non-null  float64
 24  MAX_INTERVAL             62988 non-null  int64  
 25  ADD_POINTS_SUM_YR_1      62988 non-null  int64  
 26  ADD_POINTS_SUM_YR_2      62988 non-null  int64  
 27  EXCHANGE_COUNT           62988 non-null  int64  
 28  avg_discount             62988 non-null  float64
 29  P1Y_Flight_Count         62988 non-null  int64  
 30  L1Y_Flight_Count         62988 non-null  int64  
 31  P1Y_BP_SUM               62988 non-null  int64  
 32  L1Y_BP_SUM               62988 non-null  int64  
 33  EP_SUM                   62988 non-null  int64  
 34  ADD_Point_SUM            62988 non-null  int64  
 35  Eli_Add_Point_Sum        62988 non-null  int64  
 36  L1Y_ELi_Add_Points       62988 non-null  int64  
 37  Points_Sum               62988 non-null  int64  
 38  L1Y_Points_Sum           62988 non-null  int64  
 39  Ration_L1Y_Flight_Count  62988 non-null  float64
 40  Ration_P1Y_Flight_Count  62988 non-null  float64
 41  Ration_P1Y_BPS           62988 non-null  float64
 42  Ration_L1Y_BPS           62988 non-null  float64
 43  Point_NotFlight          62988 non-null  int64  
dtypes: float64(12), int64(24), object(8)
memory usage: 21.6+ MB


MEMBER_NO       會員卡号
FFP_DATE        入會時間
FIRST_FLIGHT_DATE  第一次飛行時間  
GENDER         性别      
FFP_TIER        會員卡級别     
WORK_CITY       城市    
WORK_PROVINCE    省份    
WORK_COUNTRY     國家  
AGE           年齡
LOAD_TIME       觀測視窗結束時間
FLIGHT_COUNT     觀測視窗内飛行次數
BP_SUM         總基本積分
EP_SUM_YR_1      
EP_SUM_YR_2            
SUM_YR_1        第一年總票價      
SUM_YR_2        第二年總票價     
SEG_KM_SUM       觀測視窗的總飛行公裡數  
WEIGHTED_SEG_KM        
LAST_FLIGHT_DATE      
AVG_FLIGHT_COUNT  平均飛次數    
AVG_BP_SUM            
BEGIN_TO_FIRST         
LAST_TO_END            
AVG_INTERVAL     平均時間間隔       
MAX_INTERVAL     最大時間間隔     
ADD_POINTS_SUM_YR_1     
ADD_POINTS_SUM_YR_2      
EXCHANGE_COUNT        
avg_discount      平均折扣率    
P1Y_Flight_Count     
L1Y_Flight_Count       
P1Y_BP_SUM            
L1Y_BP_SUM             
EP_SUM                    
ADD_Point_SUM             
Eli_Add_Point_Sum          
L1Y_ELi_Add_Points         
Points_Sum                 
L1Y_Points_Sum             
Ration_L1Y_Flight_Count  
Ration_P1Y_Flight_Count  
Ration_P1Y_BPS           
Ration_L1Y_BPS           
Point_NotFlight    非乘機的積分變動次數
           
男    48134
女    14851
Name: GENDER, dtype: int64
           
0        男
1        男
2        男
3        男
4        男
        ..
62983    女
62984    男
62985    女
62986    女
62987    女
Name: GENDER, Length: 62988, dtype: object
           

“FFP_DATE”, “LOAD_TIME”, “FLIGHT_COUNT”, “SUM_YR_1”, “SUM_YR_2”, “SEG_KM_SUM”, “AVG_INTERVAL” , “MAX_INTERVAL”, “avg_discount”

FFP_DATE 入會時間

LOAD_TIME 觀測視窗結束時間

FLIGHT_COUNT 觀測視窗内飛行次數

SUM_YR_1 第一年總票價

SUM_YR_2 第二年總票價

AVG_INTERVAL 平均時間間隔

MAX_INTERVAL 最大時間間隔

avg_discount 平均折扣率

選取的特征是第一年總票價、第二年總票價、觀測視窗總飛行公裡數是要計算平均飛行每公裡的票價,因為對于航空公司來說并不是票價越高,飛行公裡數越長越能創造利潤,相反而是那些近距離的高等艙的客戶創造更大的利益。

當然總飛行公裡數、飛行次數也都是評價一個客戶價值的重要的名額

入會時間可以看出客戶是不是老使用者及忠誠度

通過平均乘機時間間隔、觀察視窗内最大乘機間隔可以判斷客戶的乘機頻率是不是固定

平均折扣率可以反映出客戶給公裡帶來的利益,畢竟來說越是高價值的客戶享用的折扣率越高

“入會時間”, “飛行次數”, “平均每公裡票價”, “總裡程”, “時間間隔內插補點”, “平均折扣率”

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62988 entries, 0 to 62987
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   FFP_DATE      62988 non-null  object 
 1   LOAD_TIME     62988 non-null  object 
 2   FLIGHT_COUNT  62988 non-null  int64  
 3   SUM_YR_1      62437 non-null  float64
 4   SUM_YR_2      62850 non-null  float64
 5   SEG_KM_SUM    62988 non-null  int64  
 6   AVG_INTERVAL  62988 non-null  float64
 7   MAX_INTERVAL  62988 non-null  int64  
 8   avg_discount  62988 non-null  float64
dtypes: float64(4), int64(3), object(2)
memory usage: 4.8+ MB
           
filter_data['SUM_YR_1'].fillna(filter_data['SUM_YR_1'].mean(),inplace=True)
filter_data['SUM_YR_2'].fillna(filter_data['SUM_YR_2'].mean(),inplace=True)
           
filter_data.describe([.02,.10,.25,.5,.75,.90,.99]).T
           
count mean std min 2% 10% 25% 50% 75% 90% 99% max
FLIGHT_COUNT 62988.0 11.839414 14.049471 2.0 2.0000 2.000000 3.000000 7.000000 15.000000 27.00 69.00 213.0
SUM_YR_1 62988.0 5355.376064 8073.902161 0.0 0.0000 0.000000 1020.000000 2844.000000 6524.250000 12939.00 37858.47 239560.0
SUM_YR_2 62988.0 5604.026014 8693.824796 0.0 0.0000 0.000000 785.000000 2784.000000 6826.250000 14065.90 41179.73 234188.0
SEG_KM_SUM 62988.0 17123.878691 20960.844623 368.0 1475.7400 2727.000000 4747.000000 9994.000000 21271.250000 39729.60 100841.28 580717.0
AVG_INTERVAL 62988.0 67.749788 77.517866 0.0 2.0000 9.729730 23.370370 44.666667 82.000000 146.00 412.00 728.0
MAX_INTERVAL 62988.0 166.033895 123.397180 0.0 2.0000 18.000000 79.000000 143.000000 228.000000 339.00 551.00 728.0
avg_discount 62988.0 0.721558 0.185427 0.0 0.3775 0.508989 0.611997 0.711856 0.809476 0.92 1.41 1.5
data["LOAD_TIME"] = pd.to_datetime(data["LOAD_TIME"])
data["FFP_DATE"] = pd.to_datetime(data["FFP_DATE"])
data["入會時間"] = data["LOAD_TIME"] - data["FFP_DATE"]
data["平均每公裡票價"] = (data["SUM_YR_1"] + data["SUM_YR_2"]) / data["SEG_KM_SUM"]
data["時間間隔內插補點"] = data["MAX_INTERVAL"] - data["AVG_INTERVAL"]
deal_data = data.rename(
    columns = {"FLIGHT_COUNT" : "飛行次數", "SEG_KM_SUM" : "總裡程", "avg_discount" : "平均折扣率"},
    inplace = False
)
filter_data = deal_data[["入會時間", "飛行次數", "平均每公裡票價", "總裡程", "時間間隔內插補點", "平均折扣率"]]
           
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62988 entries, 0 to 62987
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype          
---  ------   --------------  -----          
 0   入會時間     62988 non-null  timedelta64[ns]
 1   飛行次數     62988 non-null  int64          
 2   平均每公裡票價  62988 non-null  float64        
 3   總裡程      62988 non-null  int64          
 4   時間間隔內插補點   62988 non-null  float64        
 5   平均折扣率    62988 non-null  float64        
dtypes: float64(3), int64(2), timedelta64[ns](1)
memory usage: 3.4 MB
           
filter_data['入會時間'].dt.days
           
0        2706
1        2597
2        2615
3        2047
4        1816
         ... 
62983    1046
62984    1484
62985    2923
62986     418
62987     407
Name: 入會時間, Length: 62988, dtype: int64
           
#filter_data['入會時間'] = filter_data['入會時間']/(60*60*24*10**9)
filter_data['入會時間']=filter_data['入會時間'].dt.days
           
from sklearn.preprocessing import StandardScaler
           
standard = StandardScaler()
standard.fit(filter_data)
           
StandardScaler()
           
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62988 entries, 0 to 62987
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   入會時間     62988 non-null  float64
 1   飛行次數     62988 non-null  float64
 2   平均每公裡票價  62988 non-null  float64
 3   總裡程      62988 non-null  float64
 4   時間間隔內插補點   62988 non-null  float64
 5   平均折扣率    62988 non-null  float64
dtypes: float64(6)
memory usage: 2.9 MB
           
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
#傳回所有輪廓系數的均值
from sklearn.metrics import silhouette_samples
#傳回每個樣本的自己的輪廓系數
           
inertia = []
silhouette = []
for i in range(2,10):

    cluster = KMeans(n_clusters=i,random_state=0,n_jobs=4).fit(S_data)
    
    inertia.append(cluster.inertia_)
    silhouette.append(silhouette_score(S_data,cluster.labels_))
           
print(inertia)
print(silhouette)
           
[300992.94143881754, 249961.1287967345, 212929.66150454507, 187429.24421259848, 170776.80489673465, 154981.14913712352, 145834.9083294653, 138235.00566447436]
[0.3592867371629582, 0.21454689862059148, 0.20674237627094663, 0.2213072501911795, 0.2103711574222313, 0.21673681639729542, 0.20137242231980962, 0.2065838067406841]
           
from matplotlib import pyplot as plt
           
#畫圖,通過觀察SSE與k的取值嘗試找出合适的k值
# 中文和負号的正常顯示
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['font.size'] = 12.0
plt.rcParams['axes.unicode_minus'] = False
# 使用ggplot的繪圖風格
plt.style.use('ggplot')

fig=plt.figure(figsize=(10, 8))
ax=fig.add_subplot(1,1,1)
ax.plot(range(2,10),inertia,marker="+")
ax.set_xlabel("n_clusters", fontsize=18)

fig.suptitle("KMeans", fontsize=20)
plt.show()
           

機器學習 day7 kmeans 聚類算法

#畫圖,通過觀察SSE與k的取值嘗試找出合适的k值
# 中文和負号的正常顯示
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['font.size'] = 12.0
plt.rcParams['axes.unicode_minus'] = False
# 使用ggplot的繪圖風格
plt.style.use('ggplot')

fig=plt.figure(figsize=(10, 8))
ax=fig.add_subplot(1,1,1)
ax.plot(range(2,10),silhouette,marker="+")
ax.set_xlabel("n_clusters", fontsize=18)

fig.suptitle("KMeans", fontsize=20)
plt.show()
           

機器學習 day7 kmeans 聚類算法

for i in range(4,9,2):
    kmodel = KMeans(n_clusters=i, n_jobs=4)
    kmodel.fit(S_data)
    # 簡單列印結果
    r1 = pd.Series(kmodel.labels_).value_counts() #統計各個類别的數目
    r2 = pd.DataFrame(kmodel.cluster_centers_) #找出聚類中心
    # 所有簇中心坐标值中最大值和最小值
    max = r2.values.max()
    min = r2.values.min()
    r = pd.concat([r2, r1], axis = 1) #橫向連接配接(0是縱向),得到聚類中心對應的類别下的數目
    r.columns = list(S_data.columns) + [u'類别數目'] #重命名表頭

    # 繪圖
    fig=plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, polar=True)
    center_num = r.values
    feature = ["入會時間", "飛行次數", "平均每公裡票價", "總裡程", "時間間隔內插補點", "平均折扣率"]
    N =len(feature)
    for i, v in enumerate(center_num):
        # 設定雷達圖的角度,用于平分切開一個圓面
        angles=np.linspace(0, 2*np.pi, N, endpoint=False)
        # 為了使雷達圖一圈封閉起來,需要下面的步驟
        center = np.concatenate((v[:-1],[v[0]]))
        angles=np.concatenate((angles,[angles[0]]))
        # 繪制折線圖
        ax.plot(angles, center, 'o-', linewidth=2, label = "第%d簇人群,%d人"% (i+1,v[-1]))
        # 填充顔色
        ax.fill(angles, center, alpha=0.25)
        # 添加每個特征的标簽
        ax.set_thetagrids(angles[:-1] * 180/np.pi, feature, fontsize=15)
        # 設定雷達圖的範圍
        ax.set_ylim(min-0.1, max+0.1)
        # 添加标題
        plt.title('客戶群特征分析圖', fontsize=20)
        # 添加網格線
        ax.grid(True)
        # 設定圖例
        plt.legend(loc='upper right', bbox_to_anchor=(1.3,1.0),ncol=1,fancybox=True,shadow=True)

    # 顯示圖形
    plt.show()
           

機器學習 day7 kmeans 聚類算法
機器學習 day7 kmeans 聚類算法
機器學習 day7 kmeans 聚類算法

第一簇人群,9991人,最大的特點是時間間隔內插補點最大,分析可能是“季節型客戶”,一年中在某個時間段需要多次乘坐飛機進行旅行,其他的時間則出行的不多,這類客戶我們需要在保持的前提下,進行一定的發展;

第二簇人群,3157人,最大的特點就是平均每公裡票價和平均折扣率都是最高的,應該是屬于乘坐高等艙的商務人員,應該重點保持的對象,也是需要重點發展的對象,另外應該積極采取相關的優惠政策是他們的乘坐次數增加,有錢人;

第三簇人群,16245人,入會時間較短,每公裡票價和平均折扣率屬于較高的 屬于新使用者

第四簇人群,5221人, 總裡程和飛行次數都是最多的,而且平均每公裡票價也較高,是重點保持對象

第五簇人群,14357人,最大的特點就是入會的時間較長,屬于老客戶按理說平均折扣率應該較高才對,但是觀察視窗的平均折扣率較低,而且總裡程和總次數都不高,分析可能是流失的客戶;

第六簇人群,14027人,各方面的資料都是比較低的,屬于一般或低價值使用者