#準确的客戶分類的結果是企業優化營銷資源的重要依據,本文利用了航空公司的部分資料,利用Kmeans聚類方法,對航空公司的客戶進行了分類,來識别出不同的客戶群體,從來發現有用的客戶,進而對不同價值的客戶類别提供個性化服務,指定相應的營銷政策。
# coding=utf-8
import pandas as pd
import numpy as np
#忽略報警
import warnings
warnings.filterwarnings("ignore")
#讀取原始資料,指定UTF-8編碼(需要用文本編輯器将資料裝換為ANSI編碼)
data = pd.read_csv(r'air_data - utf8.csv')
#檢視樣本
explore = data.describe(percentiles = [], include = 'all').T
data.head()
MEMBER_NO | FFP_DATE | FIRST_FLIGHT_DATE | GENDER | FFP_TIER | WORK_CITY | WORK_PROVINCE | WORK_COUNTRY | AGE | LOAD_TIME | ... | ADD_Point_SUM | Eli_Add_Point_Sum | L1Y_ELi_Add_Points | Points_Sum | L1Y_Points_Sum | Ration_L1Y_Flight_Count | Ration_P1Y_Flight_Count | Ration_P1Y_BPS | Ration_L1Y_BPS | Point_NotFlight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
54993 | 2006/11/2 | 2008/12/24 | 男 | 6 | . | 北京 | CN | 31.0 | 2014/3/31 | ... | 39992 | 114452 | 111100 | 619760 | 370211 | 0.509524 | 0.490476 | 0.487221 | 0.512777 | 50 | |
1 | 28065 | 2007/2/19 | 2007/8/3 | 男 | 6 | NaN | 北京 | CN | 42.0 | 2014/3/31 | ... | 12000 | 53288 | 53288 | 415768 | 238410 | 0.514286 | 0.485714 | 0.489289 | 0.510708 | 33 |
2 | 55106 | 2007/2/1 | 2007/8/30 | 男 | 6 | . | 北京 | CN | 40.0 | 2014/3/31 | ... | 15491 | 55202 | 51711 | 406361 | 233798 | 0.518519 | 0.481481 | 0.481467 | 0.518530 | 26 |
3 | 21189 | 2008/8/22 | 2008/8/23 | 男 | 5 | Los Angeles | CA | US | 64.0 | 2014/3/31 | ... | 34890 | 34890 | 372204 | 186100 | 0.434783 | 0.565217 | 0.551722 | 0.448275 | 12 | |
4 | 39546 | 2009/4/10 | 2009/4/15 | 男 | 6 | 貴陽 | 貴州 | CN | 48.0 | 2014/3/31 | ... | 22704 | 64969 | 64969 | 338813 | 210365 | 0.532895 | 0.467105 | 0.469054 | 0.530943 | 39 |
5 rows × 44 columns
explore
count | unique | top | freq | mean | std | min | 50% | max | |
---|---|---|---|---|---|---|---|---|---|
MEMBER_NO | 62988 | NaN | NaN | NaN | 31494.5 | 18183.2 | 1 | 31494.5 | 62988 |
FFP_DATE | 62988 | 3068 | 2011/1/13 | 184 | NaN | NaN | NaN | NaN | NaN |
FIRST_FLIGHT_DATE | 62988 | 3406 | 2013/2/16 | 96 | NaN | NaN | NaN | NaN | NaN |
GENDER | 62985 | 2 | 男 | 48134 | NaN | NaN | NaN | NaN | NaN |
FFP_TIER | 62988 | NaN | NaN | NaN | 4.10216 | 0.373856 | 4 | 4 | 6 |
WORK_CITY | 60719 | 3309 | 廣州 | 9385 | NaN | NaN | NaN | NaN | NaN |
WORK_PROVINCE | 59740 | 1183 | 廣東 | 17507 | NaN | NaN | NaN | NaN | NaN |
WORK_COUNTRY | 62962 | 118 | CN | 57748 | NaN | NaN | NaN | NaN | NaN |
AGE | 62568 | NaN | NaN | NaN | 42.4763 | 9.88591 | 6 | 41 | 110 |
LOAD_TIME | 62988 | 1 | 2014/3/31 | 62988 | NaN | NaN | NaN | NaN | NaN |
FLIGHT_COUNT | 62988 | NaN | NaN | NaN | 11.8394 | 14.0495 | 2 | 7 | 213 |
BP_SUM | 62988 | NaN | NaN | NaN | 10925.1 | 16339.5 | 5700 | 505308 | |
EP_SUM_YR_1 | 62988 | NaN | NaN | NaN | |||||
EP_SUM_YR_2 | 62988 | NaN | NaN | NaN | 265.69 | 1645.7 | 74460 | ||
SUM_YR_1 | 62437 | NaN | NaN | NaN | 5355.38 | 8109.45 | 2800 | 239560 | |
SUM_YR_2 | 62850 | NaN | NaN | NaN | 5604.03 | 8703.36 | 2773 | 234188 | |
SEG_KM_SUM | 62988 | NaN | NaN | NaN | 17123.9 | 20960.8 | 368 | 9994 | 580717 |
WEIGHTED_SEG_KM | 62988 | NaN | NaN | NaN | 12777.2 | 17578.6 | 6978.26 | 558440 | |
LAST_FLIGHT_DATE | 62988 | 731 | 2014/3/31 | 959 | NaN | NaN | NaN | NaN | NaN |
AVG_FLIGHT_COUNT | 62988 | NaN | NaN | NaN | 1.54215 | 1.787 | 0.25 | 0.875 | 26.625 |
AVG_BP_SUM | 62988 | NaN | NaN | NaN | 1421.44 | 2083.12 | 752.375 | 63163.5 | |
BEGIN_TO_FIRST | 62988 | NaN | NaN | NaN | 120.145 | 159.573 | 50 | 729 | |
LAST_TO_END | 62988 | NaN | NaN | NaN | 176.12 | 183.822 | 1 | 108 | 731 |
AVG_INTERVAL | 62988 | NaN | NaN | NaN | 67.7498 | 77.5179 | 44.6667 | 728 | |
MAX_INTERVAL | 62988 | NaN | NaN | NaN | 166.034 | 123.397 | 143 | 728 | |
ADD_POINTS_SUM_YR_1 | 62988 | NaN | NaN | NaN | 540.317 | 3956.08 | 600000 | ||
ADD_POINTS_SUM_YR_2 | 62988 | NaN | NaN | NaN | 814.689 | 5121.8 | 728282 | ||
EXCHANGE_COUNT | 62988 | NaN | NaN | NaN | 0.319775 | 1.136 | 46 | ||
avg_discount | 62988 | NaN | NaN | NaN | 0.721558 | 0.185427 | 0.711856 | 1.5 | |
P1Y_Flight_Count | 62988 | NaN | NaN | NaN | 5.76626 | 7.21092 | 3 | 118 | |
L1Y_Flight_Count | 62988 | NaN | NaN | NaN | 6.07316 | 8.17513 | 3 | 111 | |
P1Y_BP_SUM | 62988 | NaN | NaN | NaN | 5366.72 | 8537.77 | 2692 | 246197 | |
L1Y_BP_SUM | 62988 | NaN | NaN | NaN | 5558.36 | 9351.96 | 2547 | 259111 | |
EP_SUM | 62988 | NaN | NaN | NaN | 265.69 | 1645.7 | 74460 | ||
ADD_Point_SUM | 62988 | NaN | NaN | NaN | 1355.01 | 7868.48 | 984938 | ||
Eli_Add_Point_Sum | 62988 | NaN | NaN | NaN | 1620.7 | 8294.4 | 984938 | ||
L1Y_ELi_Add_Points | 62988 | NaN | NaN | NaN | 1080.38 | 5639.86 | 728282 | ||
Points_Sum | 62988 | NaN | NaN | NaN | 12545.8 | 20507.8 | 6328.5 | 985572 | |
L1Y_Points_Sum | 62988 | NaN | NaN | NaN | 6638.74 | 12601.8 | 2860.5 | 728282 | |
Ration_L1Y_Flight_Count | 62988 | NaN | NaN | NaN | 0.486419 | 0.319105 | 0.5 | 1 | |
Ration_P1Y_Flight_Count | 62988 | NaN | NaN | NaN | 0.513581 | 0.319105 | 0.5 | 1 | |
Ration_P1Y_BPS | 62988 | NaN | NaN | NaN | 0.522293 | 0.339632 | 0.514252 | 0.999989 | |
Ration_L1Y_BPS | 62988 | NaN | NaN | NaN | 0.468422 | 0.338956 | 0.476747 | 0.999993 | |
Point_NotFlight | 62988 | NaN | NaN | NaN | 2.72815 | 7.36416 | 140 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62988 entries, 0 to 62987
Data columns (total 44 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MEMBER_NO 62988 non-null int64
1 FFP_DATE 62988 non-null object
2 FIRST_FLIGHT_DATE 62988 non-null object
3 GENDER 62985 non-null object
4 FFP_TIER 62988 non-null int64
5 WORK_CITY 60719 non-null object
6 WORK_PROVINCE 59740 non-null object
7 WORK_COUNTRY 62962 non-null object
8 AGE 62568 non-null float64
9 LOAD_TIME 62988 non-null object
10 FLIGHT_COUNT 62988 non-null int64
11 BP_SUM 62988 non-null int64
12 EP_SUM_YR_1 62988 non-null int64
13 EP_SUM_YR_2 62988 non-null int64
14 SUM_YR_1 62437 non-null float64
15 SUM_YR_2 62850 non-null float64
16 SEG_KM_SUM 62988 non-null int64
17 WEIGHTED_SEG_KM 62988 non-null float64
18 LAST_FLIGHT_DATE 62988 non-null object
19 AVG_FLIGHT_COUNT 62988 non-null float64
20 AVG_BP_SUM 62988 non-null float64
21 BEGIN_TO_FIRST 62988 non-null int64
22 LAST_TO_END 62988 non-null int64
23 AVG_INTERVAL 62988 non-null float64
24 MAX_INTERVAL 62988 non-null int64
25 ADD_POINTS_SUM_YR_1 62988 non-null int64
26 ADD_POINTS_SUM_YR_2 62988 non-null int64
27 EXCHANGE_COUNT 62988 non-null int64
28 avg_discount 62988 non-null float64
29 P1Y_Flight_Count 62988 non-null int64
30 L1Y_Flight_Count 62988 non-null int64
31 P1Y_BP_SUM 62988 non-null int64
32 L1Y_BP_SUM 62988 non-null int64
33 EP_SUM 62988 non-null int64
34 ADD_Point_SUM 62988 non-null int64
35 Eli_Add_Point_Sum 62988 non-null int64
36 L1Y_ELi_Add_Points 62988 non-null int64
37 Points_Sum 62988 non-null int64
38 L1Y_Points_Sum 62988 non-null int64
39 Ration_L1Y_Flight_Count 62988 non-null float64
40 Ration_P1Y_Flight_Count 62988 non-null float64
41 Ration_P1Y_BPS 62988 non-null float64
42 Ration_L1Y_BPS 62988 non-null float64
43 Point_NotFlight 62988 non-null int64
dtypes: float64(12), int64(24), object(8)
memory usage: 21.1+ MB
#去重
data.drop_duplicates(inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62988 entries, 0 to 62987
Data columns (total 44 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MEMBER_NO 62988 non-null int64
1 FFP_DATE 62988 non-null object
2 FIRST_FLIGHT_DATE 62988 non-null object
3 GENDER 62985 non-null object
4 FFP_TIER 62988 non-null int64
5 WORK_CITY 60719 non-null object
6 WORK_PROVINCE 59740 non-null object
7 WORK_COUNTRY 62962 non-null object
8 AGE 62568 non-null float64
9 LOAD_TIME 62988 non-null object
10 FLIGHT_COUNT 62988 non-null int64
11 BP_SUM 62988 non-null int64
12 EP_SUM_YR_1 62988 non-null int64
13 EP_SUM_YR_2 62988 non-null int64
14 SUM_YR_1 62437 non-null float64
15 SUM_YR_2 62850 non-null float64
16 SEG_KM_SUM 62988 non-null int64
17 WEIGHTED_SEG_KM 62988 non-null float64
18 LAST_FLIGHT_DATE 62988 non-null object
19 AVG_FLIGHT_COUNT 62988 non-null float64
20 AVG_BP_SUM 62988 non-null float64
21 BEGIN_TO_FIRST 62988 non-null int64
22 LAST_TO_END 62988 non-null int64
23 AVG_INTERVAL 62988 non-null float64
24 MAX_INTERVAL 62988 non-null int64
25 ADD_POINTS_SUM_YR_1 62988 non-null int64
26 ADD_POINTS_SUM_YR_2 62988 non-null int64
27 EXCHANGE_COUNT 62988 non-null int64
28 avg_discount 62988 non-null float64
29 P1Y_Flight_Count 62988 non-null int64
30 L1Y_Flight_Count 62988 non-null int64
31 P1Y_BP_SUM 62988 non-null int64
32 L1Y_BP_SUM 62988 non-null int64
33 EP_SUM 62988 non-null int64
34 ADD_Point_SUM 62988 non-null int64
35 Eli_Add_Point_Sum 62988 non-null int64
36 L1Y_ELi_Add_Points 62988 non-null int64
37 Points_Sum 62988 non-null int64
38 L1Y_Points_Sum 62988 non-null int64
39 Ration_L1Y_Flight_Count 62988 non-null float64
40 Ration_P1Y_Flight_Count 62988 non-null float64
41 Ration_P1Y_BPS 62988 non-null float64
42 Ration_L1Y_BPS 62988 non-null float64
43 Point_NotFlight 62988 non-null int64
dtypes: float64(12), int64(24), object(8)
memory usage: 21.6+ MB
MEMBER_NO 會員卡号
FFP_DATE 入會時間
FIRST_FLIGHT_DATE 第一次飛行時間
GENDER 性别
FFP_TIER 會員卡級别
WORK_CITY 城市
WORK_PROVINCE 省份
WORK_COUNTRY 國家
AGE 年齡
LOAD_TIME 觀測視窗結束時間
FLIGHT_COUNT 觀測視窗内飛行次數
BP_SUM 總基本積分
EP_SUM_YR_1
EP_SUM_YR_2
SUM_YR_1 第一年總票價
SUM_YR_2 第二年總票價
SEG_KM_SUM 觀測視窗的總飛行公裡數
WEIGHTED_SEG_KM
LAST_FLIGHT_DATE
AVG_FLIGHT_COUNT 平均飛次數
AVG_BP_SUM
BEGIN_TO_FIRST
LAST_TO_END
AVG_INTERVAL 平均時間間隔
MAX_INTERVAL 最大時間間隔
ADD_POINTS_SUM_YR_1
ADD_POINTS_SUM_YR_2
EXCHANGE_COUNT
avg_discount 平均折扣率
P1Y_Flight_Count
L1Y_Flight_Count
P1Y_BP_SUM
L1Y_BP_SUM
EP_SUM
ADD_Point_SUM
Eli_Add_Point_Sum
L1Y_ELi_Add_Points
Points_Sum
L1Y_Points_Sum
Ration_L1Y_Flight_Count
Ration_P1Y_Flight_Count
Ration_P1Y_BPS
Ration_L1Y_BPS
Point_NotFlight 非乘機的積分變動次數
男 48134
女 14851
Name: GENDER, dtype: int64
0 男
1 男
2 男
3 男
4 男
..
62983 女
62984 男
62985 女
62986 女
62987 女
Name: GENDER, Length: 62988, dtype: object
“FFP_DATE”, “LOAD_TIME”, “FLIGHT_COUNT”, “SUM_YR_1”, “SUM_YR_2”, “SEG_KM_SUM”, “AVG_INTERVAL” , “MAX_INTERVAL”, “avg_discount”
FFP_DATE 入會時間
LOAD_TIME 觀測視窗結束時間
FLIGHT_COUNT 觀測視窗内飛行次數
SUM_YR_1 第一年總票價
SUM_YR_2 第二年總票價
AVG_INTERVAL 平均時間間隔
MAX_INTERVAL 最大時間間隔
avg_discount 平均折扣率
選取的特征是第一年總票價、第二年總票價、觀測視窗總飛行公裡數是要計算平均飛行每公裡的票價,因為對于航空公司來說并不是票價越高,飛行公裡數越長越能創造利潤,相反而是那些近距離的高等艙的客戶創造更大的利益。
當然總飛行公裡數、飛行次數也都是評價一個客戶價值的重要的名額
入會時間可以看出客戶是不是老使用者及忠誠度
通過平均乘機時間間隔、觀察視窗内最大乘機間隔可以判斷客戶的乘機頻率是不是固定
平均折扣率可以反映出客戶給公裡帶來的利益,畢竟來說越是高價值的客戶享用的折扣率越高
“入會時間”, “飛行次數”, “平均每公裡票價”, “總裡程”, “時間間隔內插補點”, “平均折扣率”
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62988 entries, 0 to 62987
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 FFP_DATE 62988 non-null object
1 LOAD_TIME 62988 non-null object
2 FLIGHT_COUNT 62988 non-null int64
3 SUM_YR_1 62437 non-null float64
4 SUM_YR_2 62850 non-null float64
5 SEG_KM_SUM 62988 non-null int64
6 AVG_INTERVAL 62988 non-null float64
7 MAX_INTERVAL 62988 non-null int64
8 avg_discount 62988 non-null float64
dtypes: float64(4), int64(3), object(2)
memory usage: 4.8+ MB
filter_data['SUM_YR_1'].fillna(filter_data['SUM_YR_1'].mean(),inplace=True)
filter_data['SUM_YR_2'].fillna(filter_data['SUM_YR_2'].mean(),inplace=True)
filter_data.describe([.02,.10,.25,.5,.75,.90,.99]).T
count | mean | std | min | 2% | 10% | 25% | 50% | 75% | 90% | 99% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
FLIGHT_COUNT | 62988.0 | 11.839414 | 14.049471 | 2.0 | 2.0000 | 2.000000 | 3.000000 | 7.000000 | 15.000000 | 27.00 | 69.00 | 213.0 |
SUM_YR_1 | 62988.0 | 5355.376064 | 8073.902161 | 0.0 | 0.0000 | 0.000000 | 1020.000000 | 2844.000000 | 6524.250000 | 12939.00 | 37858.47 | 239560.0 |
SUM_YR_2 | 62988.0 | 5604.026014 | 8693.824796 | 0.0 | 0.0000 | 0.000000 | 785.000000 | 2784.000000 | 6826.250000 | 14065.90 | 41179.73 | 234188.0 |
SEG_KM_SUM | 62988.0 | 17123.878691 | 20960.844623 | 368.0 | 1475.7400 | 2727.000000 | 4747.000000 | 9994.000000 | 21271.250000 | 39729.60 | 100841.28 | 580717.0 |
AVG_INTERVAL | 62988.0 | 67.749788 | 77.517866 | 0.0 | 2.0000 | 9.729730 | 23.370370 | 44.666667 | 82.000000 | 146.00 | 412.00 | 728.0 |
MAX_INTERVAL | 62988.0 | 166.033895 | 123.397180 | 0.0 | 2.0000 | 18.000000 | 79.000000 | 143.000000 | 228.000000 | 339.00 | 551.00 | 728.0 |
avg_discount | 62988.0 | 0.721558 | 0.185427 | 0.0 | 0.3775 | 0.508989 | 0.611997 | 0.711856 | 0.809476 | 0.92 | 1.41 | 1.5 |
data["LOAD_TIME"] = pd.to_datetime(data["LOAD_TIME"])
data["FFP_DATE"] = pd.to_datetime(data["FFP_DATE"])
data["入會時間"] = data["LOAD_TIME"] - data["FFP_DATE"]
data["平均每公裡票價"] = (data["SUM_YR_1"] + data["SUM_YR_2"]) / data["SEG_KM_SUM"]
data["時間間隔內插補點"] = data["MAX_INTERVAL"] - data["AVG_INTERVAL"]
deal_data = data.rename(
columns = {"FLIGHT_COUNT" : "飛行次數", "SEG_KM_SUM" : "總裡程", "avg_discount" : "平均折扣率"},
inplace = False
)
filter_data = deal_data[["入會時間", "飛行次數", "平均每公裡票價", "總裡程", "時間間隔內插補點", "平均折扣率"]]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62988 entries, 0 to 62987
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 入會時間 62988 non-null timedelta64[ns]
1 飛行次數 62988 non-null int64
2 平均每公裡票價 62988 non-null float64
3 總裡程 62988 non-null int64
4 時間間隔內插補點 62988 non-null float64
5 平均折扣率 62988 non-null float64
dtypes: float64(3), int64(2), timedelta64[ns](1)
memory usage: 3.4 MB
filter_data['入會時間'].dt.days
0 2706
1 2597
2 2615
3 2047
4 1816
...
62983 1046
62984 1484
62985 2923
62986 418
62987 407
Name: 入會時間, Length: 62988, dtype: int64
#filter_data['入會時間'] = filter_data['入會時間']/(60*60*24*10**9)
filter_data['入會時間']=filter_data['入會時間'].dt.days
from sklearn.preprocessing import StandardScaler
standard = StandardScaler()
standard.fit(filter_data)
StandardScaler()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62988 entries, 0 to 62987
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 入會時間 62988 non-null float64
1 飛行次數 62988 non-null float64
2 平均每公裡票價 62988 non-null float64
3 總裡程 62988 non-null float64
4 時間間隔內插補點 62988 non-null float64
5 平均折扣率 62988 non-null float64
dtypes: float64(6)
memory usage: 2.9 MB
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
#傳回所有輪廓系數的均值
from sklearn.metrics import silhouette_samples
#傳回每個樣本的自己的輪廓系數
inertia = []
silhouette = []
for i in range(2,10):
cluster = KMeans(n_clusters=i,random_state=0,n_jobs=4).fit(S_data)
inertia.append(cluster.inertia_)
silhouette.append(silhouette_score(S_data,cluster.labels_))
print(inertia)
print(silhouette)
[300992.94143881754, 249961.1287967345, 212929.66150454507, 187429.24421259848, 170776.80489673465, 154981.14913712352, 145834.9083294653, 138235.00566447436]
[0.3592867371629582, 0.21454689862059148, 0.20674237627094663, 0.2213072501911795, 0.2103711574222313, 0.21673681639729542, 0.20137242231980962, 0.2065838067406841]
from matplotlib import pyplot as plt
#畫圖,通過觀察SSE與k的取值嘗試找出合适的k值
# 中文和負号的正常顯示
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['font.size'] = 12.0
plt.rcParams['axes.unicode_minus'] = False
# 使用ggplot的繪圖風格
plt.style.use('ggplot')
fig=plt.figure(figsize=(10, 8))
ax=fig.add_subplot(1,1,1)
ax.plot(range(2,10),inertia,marker="+")
ax.set_xlabel("n_clusters", fontsize=18)
fig.suptitle("KMeans", fontsize=20)
plt.show()
#畫圖,通過觀察SSE與k的取值嘗試找出合适的k值
# 中文和負号的正常顯示
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['font.size'] = 12.0
plt.rcParams['axes.unicode_minus'] = False
# 使用ggplot的繪圖風格
plt.style.use('ggplot')
fig=plt.figure(figsize=(10, 8))
ax=fig.add_subplot(1,1,1)
ax.plot(range(2,10),silhouette,marker="+")
ax.set_xlabel("n_clusters", fontsize=18)
fig.suptitle("KMeans", fontsize=20)
plt.show()
for i in range(4,9,2):
kmodel = KMeans(n_clusters=i, n_jobs=4)
kmodel.fit(S_data)
# 簡單列印結果
r1 = pd.Series(kmodel.labels_).value_counts() #統計各個類别的數目
r2 = pd.DataFrame(kmodel.cluster_centers_) #找出聚類中心
# 所有簇中心坐标值中最大值和最小值
max = r2.values.max()
min = r2.values.min()
r = pd.concat([r2, r1], axis = 1) #橫向連接配接(0是縱向),得到聚類中心對應的類别下的數目
r.columns = list(S_data.columns) + [u'類别數目'] #重命名表頭
# 繪圖
fig=plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, polar=True)
center_num = r.values
feature = ["入會時間", "飛行次數", "平均每公裡票價", "總裡程", "時間間隔內插補點", "平均折扣率"]
N =len(feature)
for i, v in enumerate(center_num):
# 設定雷達圖的角度,用于平分切開一個圓面
angles=np.linspace(0, 2*np.pi, N, endpoint=False)
# 為了使雷達圖一圈封閉起來,需要下面的步驟
center = np.concatenate((v[:-1],[v[0]]))
angles=np.concatenate((angles,[angles[0]]))
# 繪制折線圖
ax.plot(angles, center, 'o-', linewidth=2, label = "第%d簇人群,%d人"% (i+1,v[-1]))
# 填充顔色
ax.fill(angles, center, alpha=0.25)
# 添加每個特征的标簽
ax.set_thetagrids(angles[:-1] * 180/np.pi, feature, fontsize=15)
# 設定雷達圖的範圍
ax.set_ylim(min-0.1, max+0.1)
# 添加标題
plt.title('客戶群特征分析圖', fontsize=20)
# 添加網格線
ax.grid(True)
# 設定圖例
plt.legend(loc='upper right', bbox_to_anchor=(1.3,1.0),ncol=1,fancybox=True,shadow=True)
# 顯示圖形
plt.show()
第一簇人群,9991人,最大的特點是時間間隔內插補點最大,分析可能是“季節型客戶”,一年中在某個時間段需要多次乘坐飛機進行旅行,其他的時間則出行的不多,這類客戶我們需要在保持的前提下,進行一定的發展;
第二簇人群,3157人,最大的特點就是平均每公裡票價和平均折扣率都是最高的,應該是屬于乘坐高等艙的商務人員,應該重點保持的對象,也是需要重點發展的對象,另外應該積極采取相關的優惠政策是他們的乘坐次數增加,有錢人;
第三簇人群,16245人,入會時間較短,每公裡票價和平均折扣率屬于較高的 屬于新使用者
第四簇人群,5221人, 總裡程和飛行次數都是最多的,而且平均每公裡票價也較高,是重點保持對象
第五簇人群,14357人,最大的特點就是入會的時間較長,屬于老客戶按理說平均折扣率應該較高才對,但是觀察視窗的平均折扣率較低,而且總裡程和總次數都不高,分析可能是流失的客戶;
第六簇人群,14027人,各方面的資料都是比較低的,屬于一般或低價值使用者