天天看點

困惑我好久的Multivariate Linear Regression 求導問題

Suppose we have ( x ⃗ ( i ) , y ( i ) \vec{x}^{(i)}, y^{(i)} x

(i),y(i)) with sample size N N N, where x ⃗ ( i ) ∈ R D \vec{x}^{(i)} \in \mathbb{R}^D x

(i)∈RD.

y ^ = ∑ j = 1 D β j x j \hat{y} =\sum_{j=1}^D \beta_jx_j y^​=j=1∑D​βj​xj​

L ( a , b ) = 1 2 ( a − b ) 2 \mathcal{L}(a, b)=\frac{1}{2}(a-b)^2 L(a,b)=21​(a−b)2

ε ( β 0 , β 1 , . . . , β D ) = 1 N ∑ i = 1 N L ( y ^ ( i ) , y ( i ) ) = 1 2 N ∑ i = 1 N ( y ^ ( i ) , y ( i ) ) 2 = 1 2 N ∑ i = 1 N ( ∑ j = 1 D β j x j ( i ) − y ( i ) ) 2 \begin{aligned} \varepsilon(\beta_0, \beta_1,..., \beta_{D}) &= \frac{1}{N}\sum_{i=1}^{N}\mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \\ &= \frac{1}{2N} \sum_{i=1}^{N}(\hat{y}^{(i)}, y^{(i)})^2 \\ &=\frac{1}{2N} \sum_{i=1}^N (\sum_{j=1}^D \beta_j x_j^{(i)} -y^{(i)})^2 \end{aligned} ε(β0​,β1​,...,βD​)​=N1​i=1∑N​L(y^​(i),y(i))=2N1​i=1∑N​(y^​(i),y(i))2=2N1​i=1∑N​(j=1∑D​βj​xj(i)​−y(i))2​

Take Derivative with respect to w j w_j wj​:

∂ ε ∂ β j = 1 N ∑ i = 1 N x j ( i ) ( y ^ ( i ) − y ( i ) ) = 1 N ∑ i = 1 N x j ( i ) ( ∑ j ′ = 1 D β j ′ x j ′ ( i ) − y ( i ) ) ( 這 部 分 注 意 : 你 就 是 這 裡 不 明 白 ) = 1 N ∑ j ′ = 1 D ( ∑ i = 1 N x j ( i ) x j ′ ( i ) ) β j ′ − 1 N ∑ i = 1 N x j ( i ) y ( i ) \begin{aligned} \frac{\partial \varepsilon}{\partial \beta_j} &= \frac{1}{N} \sum_{i=1}^N x_j^{(i)}(\hat{y}^{(i)} -y^{(i)}) \\ &=\frac{1}{N} \sum_{i=1}^N x_j^{(i)}(\sum_{j'=1}^D \beta_{j'}x_{j'}^{(i)} -y^{(i)})(這部分注意:你就是這裡不明白) \\ &= \frac{1}{N} \sum_{j'=1}^D (\sum_{i=1}^{N} x_j^{(i)} x_{j'}^{(i)})\beta_{j'} - \frac{1}{N}\sum_{i=1}^Nx_j^{(i)}y^{(i)} \end{aligned} ∂βj​∂ε​​=N1​i=1∑N​xj(i)​(y^​(i)−y(i))=N1​i=1∑N​xj(i)​(j′=1∑D​βj′​xj′(i)​−y(i))(這部分注意:你就是這裡不明白)=N1​j′=1∑D​(i=1∑N​xj(i)​xj′(i)​)βj′​−N1​i=1∑N​xj(i)​y(i)​

Let A j j ′ = 1 N ∑ i = 1 N x j ( i ) x j ′ ( i ) ∈ R D A_{jj'}=\frac{1}{N} \sum_{i=1}^N x_j^{(i)}x_{j'}^{(i)} \in \mathbb{R}^D Ajj′​=N1​∑i=1N​xj(i)​xj′(i)​∈RD and c j = 1 N ∑ i = 1 N x j ( i ) y ( i ) ∈ R D c_j = \frac{1}{N}\sum_{i=1}^N x_j^{(i)}y^{(i)} \in \mathbb{R}^D cj​=N1​∑i=1N​xj(i)​y(i)∈RD. Then:

∂ ε ∂ β j = 1 N ∑ j ′ = 1 D ( ∑ i = 1 N x j ( i ) x j ′ ( i ) ) β j ′ − 1 N ∑ i = 1 N x j ( i ) y ( i ) = 1 N ∑ j ′ = 1 D A j j ′ β j ′ − c j = s e t 0 \begin{aligned} \frac{\partial \varepsilon}{\partial \beta_j} &= \frac{1}{N} \sum_{j'=1}^D (\sum_{i=1}^{N} x_j^{(i)} x_{j'}^{(i)})\beta_{j'} - \frac{1}{N}\sum_{i=1}^Nx_j^{(i)}y^{(i)} \\ &=\frac{1}{N}\sum_{j'=1}^D A_{jj'}\beta_{j'} -c_j \stackrel{set}{=}0 \end{aligned} ∂βj​∂ε​​=N1​j′=1∑D​(i=1∑N​xj(i)​xj′(i)​)βj′​−N1​i=1∑N​xj(i)​y(i)=N1​j′=1∑D​Ajj′​βj′​−cj​=set0​

Let X ∈ R N × D X \in \mathbb{R}^{N \times D} X∈RN×D , A = 1 N X T X A=\frac{1}{N}X^TX A=N1​XTX and c = 1 N X T y c = \frac{1}{N} X^Ty c=N1​XTy

(3) X = [ x ( 1 ) T x ( 2 ) T . . x ( n ) T ] X= \left[ \begin{matrix} x^{(1)T} \\ x^{(2)^T}\\ .\\ .\\ x^{(n)^T} \end{matrix} \right] \tag{3} X=⎣⎢⎢⎢⎢⎡​x(1)Tx(2)T..x(n)T​⎦⎥⎥⎥⎥⎤​(3)

∂ ε ∂ β j = 1 N ∑ j ′ = 1 D ( ∑ i = 1 N x j ( i ) x j ′ ( i ) ) β j ′ − 1 N ∑ i = 1 N x j ( i ) y ( i ) = 1 N ∑ j ′ = 1 D A j j ′ β j ′ − c j = A β − c = s e t 0 \begin{aligned} \frac{\partial \varepsilon}{\partial \beta_j} &= \frac{1}{N} \sum_{j'=1}^D (\sum_{i=1}^{N} x_j^{(i)} x_{j'}^{(i)})\beta_{j'} - \frac{1}{N}\sum_{i=1}^Nx_j^{(i)}y^{(i)} \\ &=\frac{1}{N}\sum_{j'=1}^D A_{jj'}\beta_{j'} -c_j \\ &=A\beta-c \stackrel{set}{=} 0 \end{aligned} ∂βj​∂ε​​=N1​j′=1∑D​(i=1∑N​xj(i)​xj′(i)​)βj′​−N1​i=1∑N​xj(i)​y(i)=N1​j′=1∑D​Ajj′​βj′​−cj​=Aβ−c=set0​

β ^ = A − 1 c = ( X T X ) − 1 X T t \hat{\beta} = A^{-1}c = (X^TX)^{-1}X^Tt β^​=A−1c=(XTX)−1XTt

終于解決了!

一種更簡單的方法是直接在risk做變換:

ε ( β 0 , β 1 , . . . , β D ) = 1 2 N ∑ i = 1 N ( ∑ j = 1 D β j x j ( i ) − y ( i ) ) 2 = 1 2 N [ X β − y ] T [ X β − y ] \begin{aligned} \varepsilon(\beta_0, \beta_1,..., \beta_{D}) &=\frac{1}{2N} \sum_{i=1}^N (\sum_{j=1}^D \beta_j x_j^{(i)} -y^{(i)})^2 \\ &= \frac{1}{2N}[X\beta-y]^T [X\beta-y] \end{aligned} ε(β0​,β1​,...,βD​)​=2N1​i=1∑N​(j=1∑D​βj​xj(i)​−y(i))2=2N1​[Xβ−y]T[Xβ−y]​

Finally, the MLE estimate is β ^ = ( X T X ) − 1 X T y \hat{\beta} = (X^TX)^{-1}X^Ty β^​=(XTX)−1XTy

This is only a estimate from one single training data, but we really want to get the true error or prediction error, which can be defined as:

ε t r u e ( β 0 , β 1 , . . . , β D ) = 1 2 E ( ∑ j = 1 D β j x j − y ) 2 = 1 2 ∫ x ( ∑ j = 1 D β j x j − y ) 2 p ( x ) d x \begin{aligned} \varepsilon_{true}(\beta_0, \beta_1,..., \beta_{D}) &=\frac{1}{2} E (\sum_{j=1}^D \beta_j \mathbf{x}_j - \mathbf{y})^2 \\ &= \frac{1}{2} \int_\mathbf{x}(\sum_{j=1}^D \beta_j \mathbf{x}_j - \mathbf{y})^2 p(\mathbf{x})d\mathbf{x} \end{aligned} εtrue​(β0​,β1​,...,βD​)​=21​E(j=1∑D​βj​xj​−y)2=21​∫x​(j=1∑D​βj​xj​−y)2p(x)dx​

If want to read more about bias-variance in linear regression model, read the following:

https://courses.cs.washington.edu/courses/cse546/12wi/slides/cse546wi12LinearRegression.pdf

繼續閱讀