天天看點

python筆記:sklearn r2_score和explained_variance_score的本質差別是什麼?

python version 3.8.6

numpy version 1.19.2

sklearn version 0.23.2

Q:我知道

r2_score

表示的是在總變量中模型解釋的百分比。但是

explained_variance_score

和它有什麼差別?

A:從公式的差别角度看:

當殘差的均值為0時,它倆是一樣的。至于用哪個,就看你有沒有假設殘差均值為0。

——Answered by CT Zhu:

一、先舉個殘差均值不為0的栗子:
import numpy as np
from sklearn import metrics

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(metrics.explained_variance_score(y_true, y_pred))
print(metrics.r2_score(y_true, y_pred))

# 結果如下
0.9571734475374732
0.9486081370449679

# 注意:此處殘差的均值不為0
print((np.array(y_true) - np.array(y_pred)).mean())
# 結果如下
-0.25
           
  • explained_variance_score 和r^2其實是:

e x p l a i n e d _ v a r i a n c e _ s c o r e = 1 − V a r i a n c e ( Y t u r e − Y p r e d ) V a r i a n c e Y t r u e explained\_variance\_score = 1- \frac{ Variance_{(Y_{ture}-Y_{pred})} }{Variance_{Y_{true}}} explained_variance_score=1−VarianceYtrue​​Variance(Yture​−Ypred​)​​

r 2 = 1 − ∑ S q u a r e d R e s i d u a l s N V a r i a n c e Y t r u e = 1 − ∑ S q u a r e d R e s i d u a l s N ∗ V a r i a n c e Y t r u e r2 = 1-\frac{\frac{\sum SquaredResiduals}{N}}{Variance_{Y_{true}}} = 1-\frac{\sum SquaredResiduals}{N * Variance_{Y_{true}}} r2=1−VarianceYtrue​​N∑SquaredResiduals​​=1−N∗VarianceYtrue​​∑SquaredResiduals​

重點是: V a r i a n c e ( Y t u r e − Y p r e d ) = ∑ S q u a r e d R e s i d u a l s − M e a n E r r o r N Variance_{(Y_{ture}-Y_{pred})}=\frac{ \sum SquaredResiduals-MeanError}{N} Variance(Yture​−Ypred​)​=N∑SquaredResiduals−MeanError​。注:此處MeanError實質上取絕對值abs(MeanError)。
# 上邊的例子用numpy這樣實作:
explained_variance_score = 1- np.var( np.array(y_true)-np.array(y_pred) ) / np.var(y_true)
r2 = 1 - ((np.array(y_true) - np.array(y_pred))**2).sum() / (4 * np.array(y_true).var())    

print(explained_variance_score)
print(r2)

# 結果如下
0.9571734475374732
0.9486081370449679
           
1)

r2

分母

4 * np.array(y_true).var()

的另一種解釋:

依據R2 = 1 - Sum_of_Squares_for_Error/ Sum_of_Squares_for_Total,是以 分母應是

總方差SST

,即

4 * np.array(y_true).var() = ((y - y.mean())**2).sum()

,其中,

y 代表 np.array(y_true)

2) explained_variance_score = 1 - np.cov( np.array(y_pred)-np.array(y_true) )/np.cov(y_true)

二、再舉個殘差均值為0的栗子:
y_ture = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 7]

print((np.array(y_true) - np.array(y_pred)).mean())
# 結果如下
0.0

print(metrics.explained_variance_score(y_true, y_pred))
print(metrics.r2_score(y_true, y_pred))
# 結果入下
0.9828693790149893
0.9828693790149893
           
備注:對于一維資料,

協方差cov/

方差var

的差別僅僅是自由度的差別,或者說是前者是樣本方差,後者是總體方差。例如:
a = [1, 2, 3, 45]
print(np.cov(a))
print(np.var(a)*len(a)/(len(a)-1))   # 即 cov=離差的平方/(樣本數 -1),var=離差平方/(樣本數)
 # 結果如下:
 462.91666666666663
 462.9166666666667
           

從含義的差别角度看:Answered by Yahya:

  • 先看R2 / 可決系數 / 判定系數:

– 從公式上看:Variancetrue_y x R2true_y = Variancepred_y,很明顯R2越接近1,效果越好。

– R2的含義,是從最小二乘(就是2次方差)的角度出發,表示實際y值的方差有多大比重被預測y值解釋了。