筆者看到lifelines的文檔裡面涵蓋的生存分析的子產品以及講解，比能查到的任何地方都要完整，是以把這個庫作為學習素材。

當然，另外也可以參考一下SPSS軟體裡面的子產品，也是非常經典的一些學習對象。

文章目錄

0 lifelines介紹
- 0.1 lifelines幾個重要方法
- 0.2 完整的生存分析流程
1 生存機率估計
- 1.1 非參數：KM資料樣式——durations
- 1.2 繪制KM曲線
- 1.3 分組KM圖
- 1.4 分組KM曲線差異名額
- - 1.4.1 中位數50%存活率
  - 1.4.2 Logrank檢驗
- 1.5 參數估計：生存函數
2 Cox 比例風險回歸模型
- 2.1 資料集加載
- 2.2 比例風險Cox回歸
- 2.3 比例cox回歸中協變量值如何影響生存曲線
3 非比例風險模型——時變風險：Time varying survival regression
- 3.1 Aalen’s Additive model 模型
- 3.2 COX風險時變模型：Cox’s time varying proportional hazard model
- - 3.2.1 資料樣式
  - 3.2.2 CoxTimeVaryingFitter 模組化
- 3.3 完整比例cox -> CoxTimeVarying 探索模組化過程
- - 3.3.1 資料加載
  - 3.3.2 先建構比例COX模型
  - 3.3.3 檢驗PH假定：proportional_hazard_test
  - 3.3.4 分類變量未溝通PH檢驗處理方式：strata分層變量
  - 3.3.5 連續變量未溝通PH檢驗處理方式一：strata分層變量
  - 3.3.6 連續變量未溝通PH檢驗處理方式二：修改cox公式
  - 3.3.7 連續變量未溝通PH檢驗處理方式三：分段資料集
- 3.4 PH檢驗（proportional hazard assumption）是否一定需要？
4 非比例風險模型—— 參數估計模型 Weibull AFT model
5 lifelines相關工具函數
- 5.1 壽命表制作函數——survival_table_from_events
- 5.2 KM曲線資料樣式制作函數
- 5.3 COX 時變回歸模型中的資料樣式制作函數
- - 5.3.1 第一種：add_covariate_to_timeline
  - 5.3.2 第二種：to_episodic_format
6 累計風險函數：Cumulative hazard function
- 6.1 非參數估計：Nelson-Aalen 累計風險函數圖
- 6.2 參數估計：其他累計風險參數估計模型
7 如何選擇最佳的參數估計模型
- 7.1 QQ圖
- 7.2 AIC圖
8 模型一緻性、準确率以及校準
- 8.1 模型一緻性
- 8.2 模型校準性
- 8.3 精準校驗值：brier_score_loss
8 其他

本系列的續作：

生存分析——快手的基于深度學習架構的內建⽣存分析軟體KwaiSurvival（一）
生存分析——KM生存曲線、hazard比例、PH假定檢驗、非比例風險模型（分層/時變/參數模型）（二）
生存分析——跟着lifelines學生存分析模組化（三）

0 lifelines介紹

github位址：CamDavidsonPilon/lifelines

文檔位址：lifelines

生存分析最初是由精算師和醫學界大量開發和應用的。它的目的是回答為什麼事件發生在現在，而不是在不确定的情況下(事件可能涉及死亡，疾病緩解等)。

這對那些對測量壽命感興趣的研究人員來說是件好事:他們可以回答諸如什麼因素可能影響死亡之類的問題。但在醫學和精算科學之外，還有許多其他有趣和令人興奮的生存分析應用。

例如:

SaaS提供商感興趣的是度量訂閱者的生命周期，即首次操作所需的時間
庫存缺貨是對商品真正“需求”的審查事件。
社會學家對衡量政黨的壽命、關系或婚姻感興趣
A/B測試是為了确定不同的團隊執行一個動作需要多長時間。

還有用來判斷使用者流失以及快手有一篇來判定使用者活躍度（本質也是從判定流失開始）

0.1 lifelines幾個重要方法

在生存分析——KM生存曲線、hazard比例、PH假定檢驗、非比例風險模型（分層/時變/參數模型）（二）提及到幾個，這裡筆者自己總結一下lifelines中幾個比較核心的：

子產品	描述	類型	方法
survival function	研究對象從試驗開始直到某個特定時間點仍然存活的機率	參數估計	Exponential, Log-Logistic, Log-Normal and Splines
非參數估計	Kaplan-Meier估計
cumulative hazard	風險函數的估計值	參數估計	Exponential, Log-Logistic, Log-Normal and Splines
非參數估計	Nelson-Aalen估計
Survival regression	會加入額外的協變量(如年齡、國家等)與另一個變量進行回歸	比例回歸	Cox 比例風險回歸模型，指數回歸模型，Weibull回歸模型，Poisson回歸模型
非比例回歸	含參數與半參模型：Aalen’s Additive model 模型、 CoxTimeVarying時變模型、AFT(accelerated failure time model)加速失效模型

0.2 完整的生存分析流程

從Time-Dependent 生存模型分析使用者流失來看一個完整的生存分析可歸納為：

原始資料格式處理：把資料處理為使用者、生存時長、是否删失的資料格式。
KM估計及生存曲線的繪制。
判斷協變量是否存在時變變量，如果有，進行資料格式的二次處理，将資料打斷為使用者、起始時間、結束時間、是否删失的格式。
判斷協變量系數是否存在時變效果，即著名的PH假設檢驗。如果檢驗不通過，對時變效果進行繪制，并基于繪制結果進行資料分層（stratify）。
建立Extended Cox PH Model，對風險因子進行影響估計。

一個比較好的案例可以參考：【3.3 完整比例cox -> CoxTimeVarying 探索模組化過程

1 生存機率估計

從這篇資料分析系列：生存分析（生存曲線分析、Cox回歸分析）——附生存分析python代碼。開始來看一下lifelines實作KM方法，結合lifelines文檔，來完整看看KM流程。

生存函數估計有非常多方法：

非參估計：Kaplan-Meier
參數估計：WeibullFitter等

1.1 非參數：KM資料樣式——durations

資料集樣式：

from lifelines.datasets import load_waltons
from lifelines import KaplanMeierFitter
from lifelines.utils import median_survival_times
import matplotlib.pyplot as plt

# 資料載入
df = load_waltons()
print(df.head(),'\n')
print(df['T'].min(), df['T'].max(),'\n')
print(df['E'].value_counts(),'\n')
print(df['group'].value_counts(),'\n')

複制

生存分析——跟着lifelines學生存分析模組化（三）0 lifelines介紹1 生存機率估計2 Cox 比例風險回歸模型3 非比例風險模型——時變風險：Time varying survival regression4 非比例風險模型—— 參數估計模型 Weibull AFT model5 lifelines相關工具函數6 累計風險函數：Cumulative hazard function7 如何選擇最佳的參數估計模型8 模型一緻性、準确率以及校準8 其他

可以看到資料有三列，其中T代表min(T, C)，其中T為死亡時間，C為觀測截止時間。

E代表是否觀察到“死亡”，1代表觀測到了，0代表未觀測到，即生存分析中的删失資料，共7個。

group代表是否存在病毒， miR-137代表存在病毒，control代表為不存在即對照組，根據統計，存在miR-137病毒人數34人，不存在129人。

需要注意，該格式并非嚴格的壽命表，具體的轉化為壽命表可以看

[5.1 小節]

1.2 繪制KM曲線

利用此資料取拟合拟生存分析中的Kaplan Meier模型（專用于估計生存函數的模型）

# KM初始化
kmf = KaplanMeierFitter()
kmf.fit(df['T'], event_observed=df['E'])

# 全體: K-M 存活曲線 
kmf.survival_function_  # km 生存機率
kmf.plot_survival_function()

複制

圖中藍色實線為生存曲線，淺藍色帶代表了95%置信區間。

随着時間增加，存活機率S(t)越來越小，這是一定的，同時S(t)=0.5時，t的95%置信區間為[53, 58]。

這裡來看幾種畫法：

kmf.survival_function_  # km 生存機率
kmf.plot_survival_function()

複制

kmf.survival_function_

給出了随着時間增加，KM生存估計機率的趨勢。

同時還可以在圖中畫出KM值随着時間上升，原始的删失值以及發生機率：

kmf.plot_survival_function(at_risk_counts=True) # plot_survival_function + 把所有數字也顯示出來
plt.tight_layout()

複制

這裡還可以畫出一條，關于生存機率的，累計機率密度圖：

kmf.cumulative_density_   # 累計機率密度圖
kmf.plot_cumulative_density() # # 繪制累積密度函數的漂亮圖

複制

1.3 分組KM圖

這并不是我們關注的重點，我們真正要關注的實驗組（存在病毒）和對照組（未存在病毒）的生存曲線差異。

是以我們要按照group等于“miR-137”和“control”分組，分别觀察對應的生存曲線:

# 分組 : K-M 存活曲線
ax = plt.subplot(111)
kmf = KaplanMeierFitter()

for name, grouped_df in df.groupby('group'):
    kmf.fit(grouped_df["T"], grouped_df["E"], label=name)
    kmf.plot_survival_function(ax=ax)

複制

可以看到，帶有miR-137病毒的生存曲線在control組下方。說明其平均存活時間明顯小于control組。同時帶有miR-137病毒存活50%對應的存活時間95%置信區間為[19,29]，對應的control組為[56,60]。

差異較大，這個方法可以應用在分析使用者流失等場景，比如我們對一組人群實行了一些防止流行活動，我們可以通過此種方式分析我們活動是否有效。

1.4 分組KM曲線差異名額

除了直接看圖，還可以用什麼方式來區分不同組的KM曲線：

1.4.1 中位數50%存活率

另外還有中位數50%存活率下存活時間的置信區間：

median_ = kmf.median_survival_time_
median_confidence_interval_ = median_survival_times(kmf.confidence_interval_)
print(median_confidence_interval_) 
>>>     KM_estimate_lower_0.95  KM_estimate_upper_0.95
>>>0.5                    53.0                    58.0

複制

代表着，53-58天左右,存活着整體的一半人。

1.4.2 Logrank檢驗

Logrank 檢驗的零假設是指兩組的生存時間分布完全一緻，當我們通過計算拒絕零假設時，就可以認為兩組的生存時間分布存在統計學差異。

from lifelines.statistics import logrank_test

T = df['T']
E = df['E']

dem = (df['group'] == 'miR-137')

results = logrank_test(T[dem], T[~dem], E[dem], E[~dem], alpha=.99)
results.print_summary()
print(results.p_value)
print(results.test_statistic)

複制

輸出結果：

<lifelines.StatisticalResult: logrank_test>
               t_0 = -1
 null_distribution = chi squared
degrees_of_freedom = 1
             alpha = 0.99
         test_name = logrank_test

---
 test_statistic      p  -log2(p)
         122.25 <0.005     91.99
2.0359832222854986e-28
122.2491255730062

複制

P值小于0.05 拒絕原假設,存在差異

1.5 參數估計：生存函數

參考：Other parametric models: Exponential, Log-Logistic, Log-Normal and Splines

from lifelines.datasets import load_waltons
from lifelines import KaplanMeierFitter
from lifelines.utils import median_survival_times

# 資料載入
df = load_waltons()
T = df['T']
E = df['E']
print(df.head(),'\n')
print(df['T'].min(), df['T'].max(),'\n')
print(df['E'].value_counts(),'\n')
print(df['group'].value_counts(),'\n')


import matplotlib.pyplot as plt
import numpy as np
from lifelines import *

fig, axes = plt.subplots(3, 3, figsize=(13.5, 7.5))

# 多種參數模型
kmf = KaplanMeierFitter().fit(T, E, label='KaplanMeierFitter')
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentialFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
ggf = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')
sf = SplineFitter(np.percentile(T.loc[E.astype(bool)], [0, 50, 100])).fit(T, E, label='SplineFitter')

wbf.plot_survival_function(ax=axes[0][0])
exf.plot_survival_function(ax=axes[0][1])
lnf.plot_survival_function(ax=axes[0][2])
kmf.plot_survival_function(ax=axes[1][0])
llf.plot_survival_function(ax=axes[1][1])
pwf.plot_survival_function(ax=axes[1][2])
ggf.plot_survival_function(ax=axes[2][0])
sf.plot_survival_function(ax=axes[2][1])

複制

2 Cox 比例風險回歸模型

2.1 資料集加載

與KM的壽命表不太一樣，COX是需要協變量的。

其中T代表min(T, C)，其中T為死亡時間，C為觀測截止時間。E代表是否觀察到“死亡”，1代表觀測到了，0代表未觀測到，即生存分析中的**“删失”**資料，删失資料共11個。

var1,var2,var3代表了我們關系的變量，可以是是否為實驗組的虛拟變量，可以是一個使用者的管道路徑，也可以是使用者自身的屬性

2.2 比例風險Cox回歸

# 模型一初始化 —— Cox proportional hazard model 
cph = CoxPHFitter()
cph.fit(regression_dataset, 'T', event_col='E')
cph.print_summary()

複制

event_col代表事件發生情況，死亡/存活。

結果分析：從結果來看，我們認為var1和var3在5%的顯著性水準下是顯著的。認為var1水準越高，使用者的風險函數值越大，即存活時間越短（cox回歸是對風險函數模組化，這與死亡加速模型剛好相反，死亡加速模型是對存活時間模組化，兩個模型的參數符号相反）。同理，var3水準越高，使用者的風險函數值越大。

這裡還可以畫出每個參數的風險水準coef值：

cph.plot()

複制

一個友善的特征重要性的可視化和各種特征的風險的影響。

2.3 比例cox回歸中協變量值如何影響生存曲線

在文章使用python來進行使用者流失預測的實戰提到了

plot_covariate_groups

展示在不同協變量下的生存曲線情況，在lifelines的0.25版本後，函數為:

plot_partial_effects_on_outcome

def plot_partial_effects_on_outcome(
        self, covariates, values, plot_baseline=True, times=None, y="survival_function", ax=None, **kwargs
    ):

複制

其中參數含義：

covariates: string or list

a string (or list of strings) of the covariate(s) in the original dataset that we wish to vary.
values: 1d or 2d iterable

an iterable of the specific values we wish the covariate(s) to take on.
plot_baseline: bool

also display the baseline survival, defined as the survival at the mean of the original dataset.
y: str

one of “survival_function”, or “cumulative_hazard”

産生一個可視化的表示，将模型的基線生存曲線與一組中協變量值發生變化時發生的情況進行比較。

當我們改變協變量(s)時，在其他條件相同的情況下，這有助于比較受試者的生存期。

基線生存曲線等于原始資料集中所有平均值上的預測生存曲線。

直接來看官方的例子，單個變量：

from lifelines import datasets, WeibullAFTFitter
rossi = datasets.load_rossi()
wf = WeibullAFTFitter().fit(rossi, 'week', 'arrest')
wf.plot_partial_effects_on_outcome('prio', values=[1,50,113], cmap='coolwarm')

複制

也就是當其他協變量都不變的時候，prio的三個取值，生存差異如何，其中還有baseline；

橫軸是時間，縱軸是生存機率；

可以看到prio = 50這條風險極大，

prio=1風險高于平均值，

prio = 113相當于已經“流失”了

多個變量：

# multiple variables at once
wf.plot_partial_effects_on_outcome(['prio', 'paro'], values=[[0, 0], [5, 0], [10, 0], [0, 1], [5, 1], [10, 1]], cmap='coolwarm', y="hazard")

複制

y = hazard

是風險值，雙變量等于某值的時候進行hazard風險值判定。

3 非比例風險模型——時變風險：Time varying survival regression

3.1 Aalen’s Additive model 模型

# 資料加載
from lifelines.datasets import load_regression_dataset
from lifelines import CoxPHFitter
regression_dataset = load_regression_dataset()

print(regression_dataset.head())
print(regression_dataset['E'].value_counts())

# 使用模型 Aalen's Additive model
from lifelines import AalenAdditiveFitter
aaf = AalenAdditiveFitter(fit_intercept=False)
aaf.fit(regression_dataset, 'T', event_col='E')

複制

将

2.2 + 4.1 + 3.1

的模型一樣繪制在圖上：

X = regression_dataset.loc[0]

ax = wft.predict_survival_function(X).rename(columns={0:'WeibullAFT'}).plot()
cph.predict_survival_function(X).rename(columns={0:'CoxPHFitter'}).plot(ax=ax)
aaf.predict_survival_function(X).rename(columns={0:'AalenAdditive'}).plot(ax=ax)

複制

可以觀察到三個模型風險預測情況

3.2 COX風險時變模型：Cox’s time varying proportional hazard model

參考：

Cox’s time varying proportional hazard model
生存分析——KM生存曲線、hazard比例、PH假定檢驗、非比例風險模型（分層/時變/參數模型）（二）

3.2.1 資料樣式

一個典型的例子就是多療程治療下使用者的死亡時間，如果以使用者接受的藥劑量來做協變量，則屬于一個經典時變變量。

因為使用者活得越久，接受大療程越多，注入的要藥劑也越多。換而言之，藥劑量在使用者的生存期内，是随時間變化的，不像性别這些因素一樣保持不變。

這樣的問題在使用者流失中同樣存在，如優惠券的累計發放量，同樣與藥劑量類似。

對于這種變量，我們需要對原始資料做split。根據變化的時間節點，把原始資料打斷為多行。

import pandas as pd
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline

base_df = pd.DataFrame([
    {'id': 1, 'duration': 10, 'event': True, 'var1': 0.1},
    {'id': 2, 'duration': 12, 'event': True, 'var1': 0.5}
])
base_df = to_long_format(base_df, duration_col="duration")

cv = pd.DataFrame([
    {'id': 1, 'time': 0, 'var2': 1.4},
    {'id': 1, 'time': 4, 'var2': 1.2},
    {'id': 1, 'time': 8, 'var2': 1.5},
    {'id': 2, 'time': 0, 'var2': 1.6},
])
base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="event", delay=5)\
            .fillna(0)

print(base_df)

複制

這裡解釋一下

base_df

是基礎上的資料，其中

ID=1

的事件，經曆了10天，然後event發生了，其中

var1

這裡不是時變量。

但是，發現有一個時變變量

var2

需要給入，而且這個時變量

var2

，将時間分割為：

[[0-4],[4-8],[8-10]]

，是以這裡通過

add_covariate_to_timeline

就可以構造這一特殊格式，資料樣式輸出為：

start  var1  var2  stop  id  event
0      0   0.1   NaN   5.0   1  False
1      5   0.1   1.4   9.0   1  False
2      9   0.1   1.2  10.0   1   True
3      0   0.5   NaN   5.0   2  False
4      5   0.5   1.6  12.0   2   True

複制

3.2.2 CoxTimeVaryingFitter 模組化

from lifelines import CoxTimeVaryingFitter

ctv = CoxTimeVaryingFitter(penalizer=0.1)
ctv.fit(base_df, id_col="id", event_col="event", start_col="start", stop_col="stop", show_progress=True)
ctv.print_summary()
ctv.plot()

複制

輸出的内容與之前COX一緻：

Iteration 5: norm_delta = 0.00000, step_size = 1.00000, ll = -0.35664, newton_decrement = 0.00000, seconds_since_start = 0.0Convergence completed after 5 iterations.
<lifelines.CoxTimeVaryingFitter: fitted with 5 periods, 2 subjects, 2 events>
         event col = 'event'
         penalizer = 0.1
number of subjects = 2
 number of periods = 5
  number of events = 2
partial log-likelihood = -0.36
  time fit was run = 2021-07-26 06:27:52 UTC

---
            coef  exp(coef)   se(coef)   coef lower 95%   coef upper 95%  exp(coef) lower 95%  exp(coef) upper 95%
covariate                                                                                                         
var1       -3.27       0.04       4.60           -12.28             5.74                 0.00               312.29
var2       -0.26       0.77       1.78            -3.74             3.23                 0.02                25.20

              z    p   -log2(p)
covariate                      
var1      -0.71 0.48       1.07
var2      -0.15 0.88       0.18
---
Partial AIC = 4.71
log-likelihood ratio test = 0.67 on 2 df
-log2(p) of ll-ratio test = 0.49

複制

同時附上

ctv.plot()

結果：

3.3 完整比例cox -> CoxTimeVarying 探索模組化過程

參考：Testing the proportional hazard assumptions

3.3.1 資料加載

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from matplotlib import pyplot as plt
from lifelines import CoxPHFitter
import numpy as np
import pandas as pd

rossi = load_rossi()

複制

資料集，這裡week 代表時間 T ， arrest 代表事件是否發生： event

3.3.2 先建構比例COX模型

cph = CoxPHFitter()
cph.fit(rossi, 'week', 'arrest')  # week = T /  arrest = event
cph.print_summary(model="untransformed variables", decimals=3)

複制

先建構比例COX模組化，來進行PH假定

3.3.3 檢驗PH假定：proportional_hazard_test

有兩種方式進行檢驗：

# 方式一
cph.check_assumptions(rossi, p_value_threshold=0.05, show_plots=True)
 
# 方式二：
from lifelines.statistics import proportional_hazard_test
results = proportional_hazard_test(cph, rossi, time_transform='rank')
results.print_summary(decimals=3, model="untransformed variables")

複制

方式一，這裡輸出的内容非常多，方拾二類似就不列舉了。

輸出的第一部分：變量檢驗表

輸出結果二：直接指出哪些變量有問題

輸出結果三：繪制有問題變量的scaled Schoenfeld residuals

最終結論：

age/wexp p值<0.05 則沒有通過假定,協變量相對于基線随時間變化的影響，需要做特殊處理

3.3.4 分類變量未溝通PH檢驗處理方式：strata分層變量

wexp 沒有通過proportional假定,而且wexp分類變量,适合作為strata分層變量

這裡不通過假定,可能是因為,樣本裡面不同分組的差異非常大,變量 -> 時間改變，

我們可以根據某些變量(我們稱之為strata分層變量)将資料集劃分為子樣本，對所有子樣本運作Cox模型，并比較它們的差異

cph.fit(rossi, 'week', 'arrest', strata=['wexp'])
cph.print_summary(model="wexp in strata")

複制

将wexp變為strata分層變量之後，再進行ph檢驗：

cph.check_assumptions(rossi, show_plots=True)

複制

發現wexp 已經檢驗通過，但是age還不行

3.3.5 連續變量未溝通PH檢驗處理方式一：strata分層變量

那麼這裡第一種方式還是把age變量分箱

pd.cut

，然後進行分層：

# age修複二:age進行分層 并作為分層變量 與wexp一緻的做法
rossi_strata_age = rossi.copy()
rossi_strata_age['age_strata'] = pd.cut(rossi_strata_age['age'], np.arange(0, 80, 3))

rossi_strata_age[['age', 'age_strata']].head()

rossi_strata_age = rossi_strata_age.drop('age', axis=1)
cph.fit(rossi_strata_age, 'week', 'arrest', strata=['age_strata', 'wexp'])

cph.print_summary(3, model="stratified age and wexp")
cph.plot()

cph.check_assumptions(rossi_strata_age)
>>> Proportional hazard assumption looks okay.

複制

這裡有一個問題就是，分箱越細，那麼結論越精準，分箱越粗，結論就會有資訊損失在分箱過程中。

這是一個需要平衡的問題。

最終檢驗，此時全部通過了檢驗。

3.3.6 連續變量未溝通PH檢驗處理方式二：修改cox公式

直接貼原文好了：

The proportional hazard test is very sensitive (i.e. lots of false positives) when the functional form of a variable is incorrect. For example, if the association between a covariate and the log-hazard is non-linear, but the model has only a linear term included, then the proportional hazard test can raise a false positive.

複制

這裡可以看到需要探索各類非線性age~log-hazard 是否可行：

# age修複一:修改cox公式
rossi['age**2'] = (rossi['age'] - rossi['age'].mean())**2
rossi['age**3'] = (rossi['age'] - rossi['age'].mean())**3


cph.fit(rossi, 'week', 'arrest', strata=['wexp'], formula="bs(age, df=4, lower_bound=10, upper_bound=50) + fin +race + mar + paro + prio")
cph.print_summary(model="spline_model"); print()
cph.check_assumptions(rossi, show_plots=True, p_value_threshold=0.05)

複制

這裡可以看到如果加入一些非線性age的形态，wexp之前造成的PH幹擾也在減小，

是以通常來說，改變公式形态，都會對其他變量産生有利影響，此時我們甚至可以移除作為分層變量的wexp（

strata=['wexp']

）

3.3.7 連續變量未溝通PH檢驗處理方式三：分段資料集

age修複三:Introduce time-varying covariates : 整改資料源，再次依據age進行時變整合

第一步：轉換成episodic格式
第二步：在年齡和stop時間節點上,構造一個時間互動變量
之後：使用TimeVaryingFitter進行分析

from lifelines.utils import to_episodic_format

# the time_gaps parameter specifies how large or small you want the periods to be.
rossi_long = to_episodic_format(rossi, duration_col='week', event_col='arrest', time_gaps=1.)
rossi_long.head(25)

rossi_long['time*age'] = rossi_long['age'] * rossi_long['stop']

from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()

ctv.fit(rossi_long,
        id_col='id',
        event_col='arrest',
        start_col='start',
        stop_col='stop',
        strata=['wexp'])

ctv.print_summary(3, model="age * time interaction")

ctv.plot()

複制

先來看一下

to_episodic_format

是一個什麼操作：

之前某一條記錄week = 20，那麼現在拆分成20條記錄，逐條記錄，從start -> stop，其餘協變量在20記錄中保持不變。

然後這裡

rossi_long['time*age'] = rossi_long['age'] * rossi_long['stop']

是一步非常有意思的操作，就是将age構造成随時間變化的

time-varying variable

了

最後使用

CoxTimeVaryingFitter

進行cox回歸檢驗

來看一下scaled Schoenfeld 殘差圖：

ctv.plot()

複制

在以上年齡的Schoenfeld殘差圖中，我們可以看到，時間值越大，風險為正，則有輕微的負效應。

這在coxtimevariyingfitter的輸出中得到了證明:我們看到

時間*年齡

的系數是-0.005。

3.4 PH檢驗（proportional hazard assumption）是否一定需要？

參考：Do I need to care about the proportional hazard assumption?

是否需要時變檢驗？以下情況可以忽略：

如果你在意生存預測，那麼風險的假定可以忽略
足夠多的樣本量，存在一定假定問題是正常的現象
有合理的理由假設所有的資料集都會違反比例風險假設【參考： Stensrud & Hernán’s “Why Test for Proportional Hazards?】

4 非比例風險模型—— 參數估計模型 Weibull AFT model

AFT(accelerated failure time model)加速失效模型

AFT模型可見文檔：the-log-normal-and-log-logistic-aft-models

模型詳細解釋：WeibullAFTFitter

與比例COX的例子一樣的資料集，來看一下Weibull AFT model：

from lifelines.datasets import load_regression_dataset
from lifelines import CoxPHFitter

regression_dataset = load_regression_dataset()
print(regression_dataset.head())
print(regression_dataset['E'].value_counts())

# 模型二初始化 —— Weibull accelerated failure time model
from lifelines import WeibullAFTFitter

wft = WeibullAFTFitter()
wft.fit(regression_dataset, 'T', event_col='E')
wft.print_summary()
wft.plot()

複制

結果展示：

"""
<lifelines.WeibullAFTFitter: fitted with 200 total observations, 11 right-censored observations>
             duration col = 'T'
                event col = 'E'
   number of observations = 200
number of events observed = 189
           log-likelihood = -504.48
         time fit was run = 2020-06-21 12:27:05 UTC

---
                     coef  exp(coef)   se(coef)   coef lower 95%   coef upper 95%  exp(coef) lower 95%  exp(coef) upper 95%
lambda_ var1        -0.08       0.92       0.02            -0.13            -0.04                 0.88                 0.97
        var2        -0.02       0.98       0.03            -0.07             0.04                 0.93                 1.04
        var3        -0.08       0.92       0.02            -0.13            -0.03                 0.88                 0.97
        Intercept   2.53      12.57       0.05             2.43             2.63                11.41                13.85
rho_    Intercept   1.09       2.98       0.05             0.99             1.20                 2.68                 3.32

                       z      p   -log2(p)
lambda_ var1       -3.45 <0.005      10.78
        var2       -0.56   0.57       0.80
        var3       -3.33 <0.005      10.15
        Intercept 51.12 <0.005        inf
rho_    Intercept 20.12 <0.005     296.66
---
Concordance = 0.58
AIC = 1018.97
log-likelihood ratio test = 19.73 on 3 df
-log2(p) of ll-ratio test = 12.34
"""

複制

最後畫圖：

5 lifelines相關工具函數

5.1 壽命表制作函數——survival_table_from_events

我們在1.1 小節看到KM的資料格式為：

from lifelines.datasets import load_waltons
from lifelines import KaplanMeierFitter
from lifelines.utils import median_survival_times
import matplotlib.pyplot as plt

# 資料載入
df = load_waltons()
print(df.head(),'\n')
print(df['T'].min(), df['T'].max(),'\n')
print(df['E'].value_counts(),'\n')
print(df['group'].value_counts(),'\n')

複制

這裡并非嚴格的壽命表的資料樣式，這裡lifelines可以有函數進行轉化：

# Perhaps you are interested in viewing the survival table given some durations and censoring vectors.
from lifelines.utils import survival_table_from_events
df = load_waltons()
T = df['T']
E = df['E']

table = survival_table_from_events(T, E)
print(table.head())

複制

進行

survival_table_from_events

可以變為壽命表的樣式。

5.2 KM曲線資料樣式制作函數

來看lifelines專門制作KM曲線資料集的函數，也就是剛剛貼的：

from lifelines.utils import datetimes_to_durations

start_dates = ['2015-01-01', '2015-04-01', '2014-04-05']
end_dates = ['2016-02-02', None, '2014-05-06']

T, E = datetimes_to_durations(start_dates, end_dates, freq="D")

>>> T # array([ 397., 1414.,   31.])
>>> E # array([ True, False,  True])

複制

比如這裡有三個樣本，開始時間為：

['2015-01-01', '2015-04-01', '2014-04-05']

，

其中有兩個在觀測期結束了，還有一個沒有結束，那麼就是

['2016-02-02', None, '2014-05-06']

最後給出的資料，T就是

結束 - 開始

的

durations

E就是最終事件是否發生的

event

5.3 COX 時變回歸模型中的資料樣式制作函數

5.3.1 第一種：add_covariate_to_timeline

這裡其實在3.2.1 已經提及了一種，就複制一下：

import pandas as pd
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline

base_df = pd.DataFrame([
    {'id': 1, 'duration': 10, 'event': True, 'var1': 0.1},
    {'id': 2, 'duration': 12, 'event': True, 'var1': 0.5}
])
base_df = to_long_format(base_df, duration_col="duration")

cv = pd.DataFrame([
    {'id': 1, 'time': 0, 'var2': 1.4},
    {'id': 1, 'time': 4, 'var2': 1.2},
    {'id': 1, 'time': 8, 'var2': 1.5},
    {'id': 2, 'time': 0, 'var2': 1.6},
])
base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="event", delay=5)\
            .fillna(0)

print(base_df)

複制

base_df

相當于沒有時變協變量的情況下的資料樣本，

cv

是有時變協變量的資料樣本，

最後兩者拼接在一些

5.3.2 第二種：to_episodic_format

第二種就是在3.3.7中提及的發現某一連續變量age需要改成時變，然後通過

to_episodic_format

變為逐條。

同時，需要

age變量*stoptime

構造成為随時間變化的變量

from lifelines.utils import to_episodic_format

# the time_gaps parameter specifies how large or small you want the periods to be.
rossi_long = to_episodic_format(rossi, duration_col='week', event_col='arrest', time_gaps=1.)
rossi_long.head(25)

rossi_long['time*age'] = rossi_long['age'] * rossi_long['stop']

from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()

ctv.fit(rossi_long,
        id_col='id',
        event_col='arrest',
        start_col='start',
        stop_col='stop',
        strata=['wexp'])

ctv.print_summary(3, model="age * time interaction")

ctv.plot()

複制

6 累計風險函數：Cumulative hazard function

The cumulative hazard has less obvious understanding than the survival functions, but the hazard functions is the basis of more advanced techniques in survival analysis. Recall that we are estimating cumulative hazard functions, H(t). (Why? The sum of estimates is much more stable than the point-wise estimates.) Thus we know the rate of change of this curve is an estimate of the hazard function.

複制

6.1 非參數估計：Nelson-Aalen 累計風險函數圖

參考：生存分析論文

在有删失的情況下，可以根據累積死亡率與生存函數的關系，來估計累計風險函數圖

也可以參考之前文章中的：2.3 生存/風險函數兩者之間關系

其與與以KM估計式為基礎的估計式相比，具有更好的小樣本性質，由Nelson提出然後Aalen加以改進

主要作用有：

選擇事件發生時間的參數模型方面的應用
其二是為死亡率h(t)提供粗估計

在lifelines中的實作：

from lifelines.datasets import load_dd

data = load_dd()
data.head()

T = data["duration"]
E = data["observed"]

from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()

naf.fit(T,event_observed=E)


print(naf.cumulative_hazard_.head())
naf.plot_cumulative_hazard()

複制

分組來看不同group的累計風險機率：

dem = (data["democracy"] == "Democracy")

naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_cumulative_hazard(loc=slice(0, 20))

naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot_cumulative_hazard(ax=ax, loc=slice(0, 20))

plt.title("Cumulative hazard function of different global regimes");

複制

6.2 參數估計：其他累計風險參數估計模型

與1.5一緻，累計風險函數圖和生存函數的估計，都有非參數以及參數估計的兩種：

參考：Other parametric models: Exponential, Log-Logistic, Log-Normal and Splines

from lifelines import (WeibullFitter, ExponentialFitter,
LogNormalFitter, LogLogisticFitter, NelsonAalenFitter,
PiecewiseExponentialFitter, GeneralizedGammaFitter, SplineFitter)

from lifelines.datasets import load_waltons
data = load_waltons()

fig, axes = plt.subplots(3, 3, figsize=(10, 7.5))

T = data['T']
E = data['E']

wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentialFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
naf = NelsonAalenFitter().fit(T, E, label='NelsonAalenFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
gg = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')
spf = SplineFitter([6, 20, 40, 75]).fit(T, E, label='SplineFitter')

wbf.plot_cumulative_hazard(ax=axes[0][0])
exf.plot_cumulative_hazard(ax=axes[0][1])
lnf.plot_cumulative_hazard(ax=axes[0][2])
naf.plot_cumulative_hazard(ax=axes[1][0])
llf.plot_cumulative_hazard(ax=axes[1][1])
pwf.plot_cumulative_hazard(ax=axes[1][2])
gg.plot_cumulative_hazard(ax=axes[2][0])
spf.plot_cumulative_hazard(ax=axes[2][1])

複制

7 如何選擇最佳的參數估計模型

從第一節和第六節來看，生存函數以及累計風險函數都有參數估計以及非參數估計，

參數估計也有非常多的分布可以選擇，那麼如何哪一種是最好的呢？

lifelines 提供了兩種：

qq-plots QQ圖： Selecting a parametric model using QQ plots
AIC對比圖：Selecting a parametric model using AIC

7.1 QQ圖

from lifelines import *

from lifelines.plotting import qq_plot

# generate some fake log-normal data
N = 1000
T_actual = np.exp(np.random.randn(N))
C = np.exp(np.random.randn(N))
E = T_actual < C
T = np.minimum(T_actual, C)

fig, axes = plt.subplots(2, 2, figsize=(8, 6))
axes = axes.reshape(4,)

for i, model in enumerate([WeibullFitter(), LogNormalFitter(), LogLogisticFitter(), ExponentialFitter()]):
    model.fit(T, E)
    qq_plot(model, ax=axes[i])

複制

這個圖形測試可以用來使模型失效。

例如，在上圖中，我們可以看到隻有對數正态(log-normal)參數模型是合适的(我們預計尾部會有偏差，但不會太多)。

7.2 AIC圖

from lifelines.utils import find_best_parametric_model
from lifelines.datasets import load_lymph_node

T = load_lymph_node()['rectime']
E = load_lymph_node()['censrec']

best_model, best_aic_ = find_best_parametric_model(T, E, scoring_method="AIC")

print(best_model)
# <lifelines.SplineFitter:"Spline_estimate", fitted with 686 total observations, 387 right-censored observations>

best_model.plot_hazard()

複制

這裡輸出結果的時候會有提示，也就是比較好的是

SplineFitter

估計：

# <lifelines.SplineFitter:"Spline_estimate", fitted with 686 total observations, 387 right-censored observations>

複制

8 模型一緻性、準确率以及校準

8.1 模型一緻性

簡單地說，一緻性是對模型内部一緻性的評估 —— 如果它說某個特征增加了風險，那麼具有該特征的觀測結果應該風險會高。如果它們是這樣的，那麼一緻性會上升，如果不是，那麼一緻性會下降。

from lifelines.datasets import load_regression_dataset
from lifelines import CoxPHFitter


regression_dataset = load_regression_dataset()

print(regression_dataset.head())
print(regression_dataset['E'].value_counts())

# 模型一初始化 —— Cox proportional hazard model 
cph = CoxPHFitter()
cph.fit(regression_dataset, 'T', event_col='E')
cph.print_summary()

複制

可以從summary看到，

duration col = 'week'
                event col = 'arrest'
      baseline estimation = breslow
   number of observations = 432
number of events observed = 114
   partial log-likelihood = -658.75
         time fit was run = 2021-08-01 03:07:18 UTC

---
            coef  exp(coef)   se(coef)   coef lower 95%   coef upper 95%  exp(coef) lower 95%  exp(coef) upper 95%
covariate                                                                                                         
fin        -0.38       0.68       0.19            -0.75            -0.00                 0.47                 1.00
age        -0.06       0.94       0.02            -0.10            -0.01                 0.90                 0.99
race        0.31       1.37       0.31            -0.29             0.92                 0.75                 2.50
wexp       -0.15       0.86       0.21            -0.57             0.27                 0.57                 1.30
mar        -0.43       0.65       0.38            -1.18             0.31                 0.31                 1.37
paro       -0.08       0.92       0.20            -0.47             0.30                 0.63                 1.35
prio        0.09       1.10       0.03             0.04             0.15                 1.04                 1.16

              z      p   -log2(p)
covariate                        
fin       -1.98   0.05       4.40
age       -2.61   0.01       6.79
race       1.02   0.31       1.70
wexp      -0.71   0.48       1.06
mar       -1.14   0.26       1.97
paro      -0.43   0.66       0.59
prio       3.19 <0.005       9.48
---
Concordance = 0.64
Partial AIC = 1331.50
log-likelihood ratio test = 33.27 on 7 df
-log2(p) of ll-ratio test = 15.37

複制

Concordance = 0.64 為一緻性檢驗，該模型并不高。

8.2 模型校準性

在文章使用python來進行使用者流失預測的實戰提到了模型校準。

我們知道我們的Cox模型是一個很好的模型，但這在實際中意味着什麼呢？它有多精确？

當你從機率的角度看待像流失(或欺詐或盜竊)這樣的事件時，檢查校準性比檢查準确性更重要。校準性是模型獲得機率随時間變化的傾向。

就像這樣，一個天氣預報服務是經過校準的話，如果在所有的時間裡它說有40%的可能性下雨，實際上就有40%的可能性下雨。

在Scikit-Learn中，我們可以使用calibration_curve方法從機率預測和資料集的真實值中獲得這個值：

自己舉一個例子：

from lifelines import datasets, CoxPHFitter
from numpy import dot, einsum, log, exp, zeros, arange, multiply, ndarray
import numpy as np
rossi = datasets.load_rossi()

cph = CoxPHFitter().fit(rossi, 'week', 'arrest')

from sklearn.calibration import calibration_curve 
from matplotlib import pyplot as plt 
plt.figure(figsize=(10, 10))
 
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2) 
ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")

probs = 1-np.array(cph.predict_survival_function(rossi).loc[13])

actual = rossi['arrest'] 

fraction_of_positives, mean_predicted_value = calibration_curve(actual, probs, n_bins=10, normalize=False) 

ax1.plot(mean_predicted_value, fraction_of_positives, "s-", label="%s" % ("CoxPH",)) 

ax1.set_ylabel("Fraction of positives") 
ax1.set_ylim([-0.05, 1.05]) 
ax1.legend(loc="lower right") 
ax1.set_title('Calibration plots (reliability curve)')

複制

截取第13天（

cph.predict_survival_function(rossi).loc[13]

）生存機率與真實機率（rossi[‘arrest’] = [0,1,1,0,0,1]）進行對比：

這裡整體曲線不接近虛線，效果不好；

可以看到藍色曲線在标準線（虛線）之上，說明模型一直在低估風險(<50%的流失率)；

如果在虛線一下，就說明高估了風險。

比如這個圖：

它非常接近對角線，代表完美的校準。然而，我們的模型似乎在低端低估了風險(<50%的流失率)和在高端輕微高估了風險(>50%的客戶流失率)。

8.3 精準校驗值：brier_score_loss

為了從數字上了解距離完美校準有多遠，我們可以使用Scikit-Learn包中的brier_score_loss：

from sklearn.metrics import brier_score_loss
import pandas as pd
loss_dict = {} 
event_times = 29 #cph.predict_survival_function(rossi).shape[0]
for i in range(1,event_times): 
    score = brier_score_loss( 
        rossi['arrest'], 1 -    
        np.array(cph.predict_survival_function(rossi).loc[i]),   
        pos_label=1 ) 
    loss_dict[i] = [score] 
    
loss_df = pd.DataFrame(loss_dict).T 

fig, ax = plt.subplots() 
ax.plot(loss_df.index, list(loss_df[0])) 
ax.set(xlabel='Prediction Time', ylabel='Calibration Loss', title='Cox PH Model Calibration Loss / Time') 
ax.grid() 

plt.show()

複制

截取了[1-29]天的結果進行校準值判定，

一開始校準的非常差，随着時間遞增校準恢複良性水準。

後續還可以讓我們為讓客戶做出改變所帶來的預期投資回報建立一個上界和下界：

loss_df.columns = ['loss'] 
temp_df = actions.reset_index().set_index('PaymentMethod_Credit card (automatic)').join(loss_df) 
temp_df = temp_df.set_index('index') 
actions['CreditCard Lower'] = temp_df['CreditCard Diff'] - (temp_df['loss'] * temp_df['CreditCard Diff']) 
actions['CreditCard Upper'] = temp_df['CreditCard Diff'] + (temp_df['loss'] * temp_df['CreditCard Diff']) 
temp_df = actions.reset_index().set_index('PaymentMethod_Bank transfer (automatic)').join(loss_df) 
temp_df = temp_df.set_index('index') 
actions['BankTransfer Lower'] = temp_df['BankTransfer Diff'] - (.5 * temp_df['loss'] * temp_df['BankTransfer Diff']) actions['BankTransfer Upper'] = temp_df['BankTransfer Diff'] + (.5 * temp_df['loss'] * temp_df['BankTransfer Diff']) 

temp_df = actions.reset_index().set_index('Contract_One year').join(loss_df) 
temp_df = temp_df.set_index('index') 
actions['1yrContract Lower'] = temp_df['1yrContract Diff'] - (.5 * temp_df['loss'] * temp_df['1yrContract Diff']) actions['1yrContract Upper'] = temp_df['1yrContract Diff'] + (.5 * temp_df['loss'] * temp_df['1yrContract Diff']) 

temp_df = actions.reset_index().set_index('Contract_Two year').join(loss_df) 
temp_df = temp_df.set_index('index') 
actions['2yrContract Lower'] = temp_df['2yrContract Diff'] - (.5 * temp_df['loss'] * temp_df['2yrContract Diff']) actions['2yrContract Upper'] = temp_df['2yrContract Diff'] + (.5 * temp_df['loss'] * temp_df['2yrContract Diff'])

複制

這裡我們對之前的值打了個折來考慮校準的不确定性。我們檢視模型在每個時間段的校準情況，我們預測會有一個特定的提升，并産生和建立一個投資回報的上下界。

8 其他

學不廢了，這個庫還有很多高階的可以學習的，本篇就到這了。。

還有：

Piecewise exponential models and creating custom models

分段指數模型以及可以自定義一些生存函數
Time-lagged conversion rates and cure models

譬如犯罪，有的犯罪的人第二次基本不會再犯了，那麼就是長期生存者or治愈者，在充分的随訪時間中，這些人不過發生某一特定事件，有非常長的截尾時間；

如果繼續使用參數估計會有造成問題：

There’s a serious fault in using parametric models for these types of problems that non-parametric models don’t have. The most common parametric models like Weibull, Log-Normal, etc. all have strictly increasing cumulative hazard functions, which means the corresponding survival function will always converge to 0.

複制