矩阵微分与矩阵导数的关系

标量矩阵的求导,定义为 ∂ f ∂ X = [ ∂ f ∂ X i j ] \frac{\partial f}{\partial X}=\left[\frac{\partial f}{\partial X_{ij}}\right] Xf=[Xijf] f f f逐元素求导排成与 X X X相同的矩阵。

一元微积分中的导数与微分的关系 d f = f ′ ( x ) d x df=f'(x)dx df=f(x)dx,多元微积分中的梯度(标量对向量的导数)也与微分有联系 d f = ∑ i = 1 n ∂ f ∂ x i d x i = [ ∂ f ∂ x ] T d x df=\sum_{i=1}^n \frac{\partial f}{\partial x_i}dx_i=\left[\frac{\partial f}{\partial \boldsymbol x}\right]^Td\boldsymbol{x} df=i=1nxifdxi=[xf]Tdx这里第一个等号是全微分公式,第二个等号表达了梯度与微分的联系:

全微分 d f df df是梯度向量 ∂ f ∂ x   ( n × 1 ) \frac{\partial f}{\partial \boldsymbol{x}}\,(n\times1) xf(n×1)与微分向量 d x   ( n × 1 ) d\boldsymbol{x}\,(n\times1) dx(n×1)的内积。

受此启发,我们将矩阵导数与微分建立联系:
d f = ∑ i = 1 m ∑ j = 1 n ∂ f ∂ X i j d X i j = tr ( [ ∂ f ∂ X ] T d X ) df=\sum_{i=1}^m\sum_{j=1}^n\frac{\partial f}{\partial X_{ij}}dX_{ij}=\text{tr}\left(\left[\frac{\partial f}{\partial X} \right]^TdX\right) df=i=1mj=1nXijfdXij=tr([Xf]TdX)其中 tr \text{tr} tr代表迹(trace)是方阵对角线元素之和,且有性质,对尺寸相同的矩阵 A , B ,   tr ( A T B ) = ∑ i , j A i j B i j A,B,\, \text{tr}(A^TB)=\sum_{i,j}A_{ij}B_{ij} A,B,tr(ATB)=i,jAijBij,即 tr ( A T B ) \text{tr}(A^TB) tr(ATB)是矩阵 A , B A,B A,B的内积。

与梯度相似,这里第一个等号是全微分公式,第二个等号表达了矩阵导数与微分的联系:全微分 d f df df是导数 ∂ f ∂ X ( m × n ) \cfrac{\partial f}{\partial X}(m\times n) Xf(m×n)与微分矩阵 d X ( m × n ) dX(m\times n) dX(m×n)的内积。

矩阵微分的运算法则

想遇到较复杂的一元函数如 f = log ⁡ ( 2 + sin ⁡ x ) e x f=\log(2+\sin x)e^{\sqrt{x}} f=log(2+sinx)ex ,我们是如何求导的呢?通常不是从定义开始求极限,而是先建立了初等函数求导和四则运算、复合等法则,再来运用这些法则。故而,我们来创立常用的矩阵微分的运算法则:

  • 加减法: d ( X ± Y ) = d X ± d Y d(X\pm Y)=dX \pm dY d(X±Y)=dX±dY
  • 矩阵乘法: d ( X Y ) = d ( X ) Y + X d Y d(XY)=d(X)Y+XdY d(XY)=d(X)Y+XdY
  • 转置: d ( X T ) = ( d X ) T d(X^T)=(dX)^T d(XT)=(dX)T
  • 迹: d tr ( X ) = tr ( d X ) d\text{tr}(X)=\text{tr}(dX) dtr(X)=tr(dX)
  • 逆: d X − 1 = − X − 1 d X X − 1 dX^{-1}=-X^{-1}dXX^{-1} dX1=X1dXX1,此式可在 X X − 1 = I XX^{-1}=I XX1=I两侧求微分证明。
  • 行列式 d ∣ X ∣ = tr ( X # d X ) d\lvert X \rvert=\text{tr}(X^{\#}dX) dX=tr(X#dX),其中 X # X^\# X#表示 X X X的伴随矩阵,在 X X X可逆时又可写作 d ∣ X ∣ = ∣ X ∣ tr ( X − 1 d X ) d\lvert X \rvert=\lvert X \rvert \text{tr}(X^{-1}dX) dX=Xtr(X1dX)。此式可用Laplace展开来证明,详见张贤达《矩阵分析与应用》第279页。
  • 逐元素乘法: d ( X ⊙ Y ) = d X ⊙ Y + X ⊙ d Y d(X \odot Y)=dX \odot Y + X\odot dY d(XY)=dXY+XdY ⊙ \odot 表示尺寸相同的矩阵 X , Y X,Y X,Y逐元素相乘。
  • 逐元素函数: d σ ( X ) = σ ′ ( X ) ⊙ d X , σ ( X ) = [ σ ( X i j ) ] d\sigma(X)=\sigma'(X)\odot dX, \sigma(X)=\left[\sigma(X_{ij})\right] dσ(X)=σ(X)dX,σ(X)=[σ(Xij)]逐元素标量函数运算, σ ′ ( X ) = [ σ ′ ( X i j ) ] \sigma'(X)=\left[\sigma'(X_{ij})\right] σ(X)=[σ(Xij)]是逐元素求导数。例如
    X = [ X 11 X 12 X 21 X 22 ] , d sin ⁡ ( X ) = [ cos ⁡ X 11 d X 11 cos ⁡ X 12 d X 12 cos ⁡ X 21 d X 21 cos ⁡ X 22 d X 22 ] = cos ⁡ ( X ) ⊙ d ( X ) X= \begin{bmatrix} X_{11} & X_{12} \\ X_{21} & X_{22} \end{bmatrix}, d\sin(X)= \begin{bmatrix} \cos X_{11}dX_{11} & \cos X_{12}dX_{12} \\ \cos X_{21}dX_{21} & \cos X_{22}dX_{22} \end{bmatrix} =\cos(X) \odot d(X) X=[X11X21X12X22],dsin(X)=[cosX11dX11cosX21dX21cosX12dX12cosX22dX22]=cos(X)d(X)

利用迹求矩阵的导数

我们试图利用矩阵导数与微分的联系 d f = tr [ ( ∂ f ∂ X ) T d X ] df=\text{tr}\left[\left(\cfrac{\partial f}{\partial X}\right)^TdX \right] df=tr(Xf)TdX,在求出左侧的微分 d f df df后,该如何写成右侧的形式并得到导数呢?这需要一些迹技巧(trace trick):

  • 标量套上迹: a = tr ( a ) a=\text{tr}(a) a=tr(a)
  • 转置: tr ( A T ) = tr ( A ) \text{tr}(A^T)=\text{tr}(A) tr(AT)=tr(A)
  • 线性: tr ( A ± B ) = tr ( A ) ± tr ( B ) \text{tr}(A\pm B)=\text{tr}(A)\pm \text{tr}(B) tr(A±B)=tr(A)±tr(B)
  • 矩阵乘法交换: tr ( A B ) = tr ( B A ) \text{tr}(AB)=\text{tr}(BA) tr(AB)=tr(BA),其中 A A A B T B^T BT尺寸相同。两侧都等于 ∑ i j A i j B i j \sum_{ij}A_{ij}B_{ij} ijAijBij
  • 矩阵乘法/逐元素乘法交换: tr ( A T ( B ⊙ C ) ) = tr ( ( A ⊙ B ) T C ) \text{tr}(A^T(B\odot C))=\text{tr}((A\odot B)^TC) tr(AT(BC))=tr((AB)TC),其中 A , B , C A,B,C A,B,C尺寸相同。两侧都等于 ∑ i j A i j B i j C i j \sum_{ij}A_{ij}B_{ij}C_{ij} ijAijBijCij A T ( B ⊙ C ) = [ a 11 a 21 … a n 1 a 12 a 22 … a n 2 ⋮ ⋮ ⋱ ⋮ a 1 n a 2 n … a n n ] [ b 11 c 11 b 12 c 12 … b 1 n c 1 n b 21 c 21 b 22 c 22 … b 2 n c 2 n ⋮ ⋮ ⋱ ⋮ b n 1 c n 1 b n 2 c n 2 … b n n c n n ] = [ ∑ j = 1 a j 1 b j 1 c j 1 ∑ j = 1 a j 1 b j 2 c j 2 … ∑ j = 1 a j 1 b j n c j n ∑ j = 1 a j 2 b j 1 c j 1 ∑ j = 1 a j 2 b j 2 c j 2 … ∑ j = 1 a j 2 b j n c j n ⋮ ⋮ ⋱ ⋮ ∑ j = 1 a j n b j 1 c j 1 ∑ j = 1 a j n b j 2 c j 2 … ∑ j = 1 a j n b j n c j n ] \begin{aligned} A^T(B\odot C)&= \begin{bmatrix} a_{11} & a_{21} & \dots & a_{n1} \\ a_{12} & a_{22} & \dots & a_{n2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \dots & a_{nn} \end{bmatrix} \begin{bmatrix} b_{11}c_{11} & b_{12}c_{12} & \dots & b_{1n}c_{1n} \\ b_{21}c_{21} & b_{22}c_{22} & \dots & b_{2n}c_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ b_{n1}c_{n1} & b_{n2}c_{n2} & \dots & b_{nn}c_{nn} \end{bmatrix}\\ &=\begin{bmatrix} \sum_{j=1}a_{j1}b_{j1}c_{j1} & \sum_{j=1}a_{j1}b_{j2}c_{j2} & \dots &\sum_{j=1}a_{j1}b_{jn}c_{jn}\\ \sum_{j=1}a_{j2}b_{j1}c_{j1} & \sum_{j=1}a_{j2}b_{j2}c_{j2} & \dots &\sum_{j=1}a_{j2}b_{jn}c_{jn}\\ \vdots & \vdots & \ddots & \vdots \\ \sum_{j=1}a_{jn}b_{j1}c_{j1} & \sum_{j=1}a_{jn}b_{j2}c_{j2} & \dots &\sum_{j=1}a_{jn}b_{jn}c_{jn} \end{bmatrix}\\ \end{aligned} AT(BC)=a11a12a1na21a22a2nan1an2annb11c11b21c21bn1cn1b12c12b22c22bn2cn2b1nc1nb2nc2nbnncnn=j=1aj1bj1cj1j=1aj2bj1cj1j=1ajnbj1cj1j=1aj1bj2cj2j=1aj2bj2cj2j=1ajnbj2cj2j=1aj1bjncjnj=1aj2bjncjnj=1ajnbjncjn
    容易推得 tr [ A T ( B ⊙ C ) ] = ∑ i = 1 ∑ j = 1 a j i b j i c j i = tr [ ( A ⊙ B ) T C ] \text{tr}\left[A^T(B\odot C)\right]=\sum_{i=1}\sum_{j=1}a_{ji}b_{ji}c_{ji}=\text{tr}\left[(A\odot B)^TC\right] tr[AT(BC)]=i=1j=1ajibjicji=tr[(AB)TC]
    观察一下可以断言:

若标量函数 f f f是矩阵 X X X经加减乘法、逆、行列式、逐元素函数等运算构成,则使用相应的运算法则对 f f f求微分,再使用迹技巧给 d f df df套上迹并将其它项交换至 d X dX dX左侧,对照导数与微分的联系 d f = tr [ ( ∂ f ∂ X ) T d X ] df=\text{tr}\left[\left(\cfrac{\partial f}{\partial X}\right)^TdX \right] df=tr(Xf)TdX,即能得到导数。
特别地,若矩阵退化为向量,对照导数与微分的联系 d f = ( ∂ f ∂ x ) T d x df=\left(\cfrac{\partial f}{\partial \boldsymbol x}\right)^Td\boldsymbol{x} df=(xf)Tdx,即能得到导数。

在建立法则的最后,来谈一谈复合:假设已求得 ∂ f ∂ Y \cfrac{\partial f}{\partial Y} Yf,而 Y Y Y X X X的函数,如何求 ∂ f ∂ X \cfrac{\partial f}{\partial X} Xf呢?在微积分中有标量求导的链式法则 ∂ f ∂ x = ∂ f ∂ y ∂ y ∂ x \cfrac{\partial f}{\partial x}=\cfrac{\partial f}{\partial y}\cfrac{\partial y}{\partial x} xf=yfxy,但这里我们不能随意沿用标量的链式法则,因为矩阵对矩阵的导数 ∂ Y ∂ X \cfrac{\partial Y}{\partial X} XY截至目前仍是未定义的。于是我们继续追本溯源,链式法则是从何而来?源头仍然是微分。我们直接从微分入手建立复合法则:先写出[公式],再将 d Y dY dY d X dX dX表示出来代入,并使用迹技巧将其他项交换至 d X dX dX左侧,即可得到 ∂ f ∂ X \cfrac{\partial f}{\partial X} Xf

一些例子

例1

f = a T X b f=a^TXb f=aTXb,求 ∂ f ∂ X \cfrac{\partial f}{\partial X} Xf
其中 a a a m × 1 m\times1 m×1列向量, X X X m × n m\times n m×n矩阵, b b b n × 1 n\times 1 n×1向量, f f f是标量。
求微分
d f = d a T X b + a T d X b + a T X d b = a T d X b df=da^TXb+a^TdXb+a^TXdb=a^TdXb df=daTXb+aTdXb+aTXdb=aTdXb,这里 a , b a,b a,b为常量,所以 d a = 0 , d b = 0 da=0,db=0 da=0,db=0
d f = t r ( d f ) = t r ( a T d X b ) = t r ( b a T d X ) = t r ( ( a b T ) T d X ) df=tr(df)=tr(a^TdXb)=tr(ba^TdX)=tr((ab^T)^TdX) df=tr(df)=tr(aTdXb)=tr(baTdX)=tr((abT)TdX),得 ∂ f ∂ X = a b T \cfrac{\partial f}{\partial X}=ab^T Xf=abT

例2

f = a T exp ⁡ ( X b ) f=a^T\exp(Xb) f=aTexp(Xb),求 ∂ f ∂ X \cfrac{\partial f}{\partial X} Xf。其中 a a a m × 1 m\times1 m×1列向量, X X X m × n m\times n m×n矩阵, b b b n × 1 n\times 1 n×1向量, exp ⁡ \exp exp表示逐元素求指数, f f f是标量。
求微分
d f = a T ( exp ⁡ ( X b ) ⊙ ( d X b ) ) = tr ( d f ) = tr ( ( a ⊙ exp ⁡ ( X b ) ) T d X b ) = tr ( ( a ⊙ exp ⁡ ( X b ) ) b T ) T d X ) ⇒ ∂ f ∂ X = ( a ⊙ exp ⁡ ( X b ) ) b T df=a^T(\exp(Xb)\odot(dXb))=\text{tr}(df)=\text{tr}\left((a\odot \exp(Xb))^TdXb\right)=\text{tr}\left(\left(a\odot \exp(Xb))b^T\right)^TdX\right)\\ \Rightarrow \cfrac{\partial f}{\partial X}=(a\odot \exp(Xb))b^T df=aT(exp(Xb)(dXb))=tr(df)=tr((aexp(Xb))TdXb)=tr((aexp(Xb))bT)TdX)Xf=(aexp(Xb))bT

例3

f = tr ( Y T M Y ) , Y = σ ( W X ) f=\text{tr}(Y^TMY), Y=\sigma(WX) f=tr(YTMY),Y=σ(WX),求 ∂ f ∂ X \cfrac{\partial f}{\partial X} Xf
其中 W W W l × m l\times m l×m列向量, X X X m × n m\times n m×n矩阵, Y Y Y l × n l\times n l×n向量, M M M l × l l\times l l×l对称矩阵, σ \sigma σ表示逐元素求指数, f f f是标量。
求微分 d f = tr ( ( d Y ) T M Y ) + tr ( Y T M d Y ) = tr ( Y T M T d Y + Y T M d Y ) = tr ( Y T ( M T + M ) d Y ) df=\text{tr}((dY)^TMY)+\text{tr}(Y^TMdY)=\text{tr}(Y^TM^TdY+Y^TMdY)=\text{tr}(Y^T(M^T+M)dY) df=tr((dY)TMY)+tr(YTMdY)=tr(YTMTdY+YTMdY)=tr(YT(MT+M)dY)
得导数 ∂ f ∂ Y = ( M + M T ) Y = 2 M Y \cfrac{\partial f}{\partial Y}=(M+M^T)Y=2MY Yf=(M+MT)Y=2MY

继续求微分
d f = tr ( ∂ f ∂ Y T d Y ) = tr ( ∂ f ∂ Y T ( 1 − σ ( W X ) ) ⊙ W d X ) = tr ( ( ∂ f ∂ Y ⊙ ( 1 − σ ( W d X ) ) ) T W d X ) = tr ( ( W T ∂ f ∂ Y ⊙ ( 1 − σ ( W X ) ) ) T d W ) ⇒ ∂ f ∂ X = W T ∂ f ∂ Y ⊙ ( 1 − σ ( W X ) ) \begin{aligned} df&=\text{tr}(\cfrac{\partial f}{\partial Y}^TdY)=\text{tr}\left(\cfrac{\partial f}{\partial Y}^T(1-\sigma(WX))\odot WdX\right)\\ &=\text{tr}\left(\left(\cfrac{\partial f}{\partial Y}\odot(1-\sigma(WdX))\right)^T WdX\right)=\text{tr}\left(\left(W^T\cfrac{\partial f}{\partial Y}\odot(1-\sigma(WX))\right)^T dW\right)\\ &\Rightarrow \cfrac{\partial f}{\partial X}=W^T\cfrac{\partial f}{\partial Y}\odot(1-\sigma(WX)) \end{aligned} df=tr(YfTdY)=tr(YfT(1σ(WX))WdX)=tr(Yf(1σ(WdX)))TWdX=tr(WTYf(1σ(WX)))TdWXf=WTYf(1σ(WX))

线性回归

l = ∥ X w − y ∥ 2 l=\lVert Xw-y\rVert^2 l=Xwy2,求 w w w的最小二乘估计,即求 ∂ l ∂ w \cfrac{\partial l}{\partial w} wl的零点。
其中 y y y m × 1 m\times 1 m×1列向量, X X X m × n m\times n m×n矩阵, w w w n × 1 n\times 1 n×1列向量, l l l是标量。

这是标量对向量的导数,不过可以把向量看做矩阵的特例。
l = ( X w − y ) T ( X w − y ) l=(Xw-y)^T(Xw-y) l=(Xwy)T(Xwy)

求微分,使用矩阵乘法、转置等法则:
d l = ( X d w ) T ( X w − y ) + ( X w − y ) T ( X d w ) dl=(Xdw)^T(Xw-y)+(Xw-y)^T(Xdw) dl=(Xdw)T(Xwy)+(Xwy)T(Xdw)

由两个列向量满足性子: u T v = v T u u^Tv=v^Tu uTv=vTu
d l = 2 ( X w − y ) T ( X d w ) = tr ( d l ) = tr ( ( 2 X T ( X w − y ) ) T d w ) dl=2(Xw-y)^T(Xdw)=\text{tr}(dl)=\text{tr}\left(\left(2X^T\left(Xw-y\right)\right)^Tdw \right) dl=2(Xwy)T(Xdw)=tr(dl)=tr((2XT(Xwy))Tdw)
根据标量对向量的微分公式:
∂ l ∂ w = 2 X T ( X w − y ) = 0 ⇒ w = ( X T X ) − 1 X T y \cfrac{\partial l}{\partial w}=2X^T\left(Xw-y\right)=0 \Rightarrow w=(X^TX)^{-1}X^Ty wl=2XT(Xwy)=0w=(XTX)1XTy

线性规划的扩展

l = 1 N ∑ i = 1 N ∥ x i w + b − y i ∥ 2 l=\cfrac{1}{N}\sum_{i=1}^N\lVert x_iw+b-y_i\rVert^2 l=N1i=1Nxiw+byi2,求 w , b w,b w,b的最小二乘估计,即求 ∂ l ∂ w , ∂ l ∂ b \cfrac{\partial l}{\partial w},\cfrac{\partial l}{\partial b} wl,bl的零点。
其中 y i y_i yi n × 1 n\times 1 n×1行向量, x i x_i xi m × 1 m\times 1 m×1列向量, w w w m × n m\times n m×n矩阵, l l l是标量。
a i = x i w + b a_i=x_iw+b ai=xiw+b带入得: l = 1 N ∑ i = 1 N ∥ a i − y i ∥ 2 l=\cfrac{1}{N}\sum_{i=1}^N\lVert a_i-y_i\rVert^2 l=N1i=1Naiyi2
求全微分:
d l = d ( 1 N ∑ i = 1 N ( a i − y i ) T ( a i − y i ) ) = 1 N ∑ i = 1 N 2 ( a i − y i ) T d a i dl=d\left(\cfrac{1}{N}\sum_{i=1}^N(a_i-y_i)^T(a_i-y_i)\right)=\cfrac{1}{N}\sum_{i=1}^N2(a_i-y_i)^Tda_i dl=d(N1i=1N(aiyi)T(aiyi))=N1i=1N2(aiyi)Tdai
得到
∂ l ∂ a i = 2 N ( a i − y i ) \cfrac{\partial l}{\partial a_i}=\cfrac{2}{N}(a_i-y_i) ail=N2(aiyi)

根据链式求导法则:
d l = ∑ i = 1 N tr ( ( ∂ l ∂ a i ) T ( x i d w + d b ) ) = ∑ i = 1 N tr ( ( ∂ l ∂ a i ) T x i d w + ( ∂ l ∂ a i ) T d b ) dl=\sum_{i=1}^N\text{tr}\left(\left(\cfrac{\partial l}{\partial a_i}\right)^T(x_idw+db)\right)=\sum_{i=1}^N\text{tr}\left(\left(\cfrac{\partial l}{\partial a_i} \right)^Tx_idw +\left( \cfrac{\partial l}{\partial a_i} \right)^Tdb \right) dl=i=1Ntr(ail)T(xidw+db)=i=1Ntr(ail)Txidw+(ail)Tdb
w , b w,b w,b的偏导
∂ l ∂ b = ∑ i = 1 N 2 N ( x i w + b − y i ) = 0 ⇒ b = μ y − μ x w \cfrac{\partial l}{\partial b}=\sum_{i=1}^N\cfrac{2}{N}(x_iw+b-y_i)=0\quad\Rightarrow b=\mu_y-\mu_xw bl=i=1NN2(xiw+byi)=0b=μyμxw
∂ l ∂ w = ∑ i = 1 N x i T ∂ l ∂ a i = 2 N ∑ i = 1 N x i T ( x i w + b − y i ) = 2 N ∑ i = 1 N x i T ( x i w + μ y − μ x w − y i ) = 2 N ∑ i = 1 N ( x i − μ x ) T ( ( x i − μ x ) w + μ y − y i ) = 0 2 N ∑ i = 1 N μ x T ( x i w + μ y − μ x w − y i ) ) = 0 ⇒ w = Σ x x − 1 Σ x y , b = μ y − μ x Σ x x − 1 Σ x y \begin{aligned} \cfrac{\partial l}{\partial w} &=\sum_{i=1}^Nx_i^T\cfrac{\partial l}{\partial a_i}\\ &=\cfrac{2}{N}\sum_{i=1}^Nx_i^T(x_iw+b-y_i)\\ &=\cfrac{2}{N}\sum_{i=1}^Nx_i^T(x_iw+\mu_y-\mu_xw-y_i)\\ &=\cfrac{2}{N}\sum_{i=1}^N(x_i-\mu_x)^T((x_i-\mu_x)w+\mu_y-y_i)=0 \quad\frac{2}{N}\sum_{i=1}^N\mu_x^T(x_iw+\mu_y-\mu_xw-y_i))=0\\ &\Rightarrow w= \Sigma_{xx}^{-1}\Sigma_{xy}, b=\mu_y-\mu_x \Sigma_{xx}^{-1}\Sigma_{xy}\\ \end{aligned} wl=i=1NxiTail=N2i=1NxiT(xiw+byi)=N2i=1NxiT(xiw+μyμxwyi)=N2i=1N(xiμx)T((xiμx)w+μyyi)=0N2i=1NμxT(xiw+μyμxwyi))=0w=Σxx1Σxy,b=μyμxΣxx1Σxy

多维正态分布

m维随机向量可以写作:
p ( x ) = ( 2 π ) − m / 2 ∣ Σ ∣ − 1 / 2 e − 1 / 2 ( x − μ ) T Σ − 1 ( x − μ ) p(\boldsymbol x)=(2\pi)^{-m/2}\lvert \Sigma \rvert^{-1/2}e^{-1/2(\boldsymbol x-\mu)^T\Sigma^{-1}(\boldsymbol x-\mu)} p(x)=(2π)m/2Σ1/2e1/2(xμ)TΣ1(xμ)
极大似然估计 l = log ⁡ ∏ i = 1 N p ( x i ) = log ⁡ [ ( 2 π ) − m N / 2 ∣ Σ ∣ − N / 2 e ∑ i = 1 N − 1 / 2 ( x i − μ ) T Σ − 1 ( x i − μ ) ] = − m N 2 log ⁡ ( 2 π ) − [ N 2 ∣ Σ ∣ + N 2 ⋅ 1 N ∑ i = 1 N ( x i − μ ) T Σ − 1 ( x − μ ) ] = − m N 2 log ⁡ ( 2 π ) − N 2 [ ∣ Σ ∣ + ⋅ 1 N ∑ i = 1 N ( x i − μ ) T Σ − 1 ( x − μ ) ] \begin{aligned} l&=\log \prod_{i=1}^{N}p(x_i)\\ &=\log\left[ (2\pi)^{-mN/2} \lvert\Sigma\rvert^{-N/2}e^{\sum_{i=1}^N-1/2(x_i-\mu)^T\Sigma^{-1}(x_i-\mu)} \right]\\ &=-\frac{mN}{2}\log(2\pi)-\left[\frac{N}{2}\vert\Sigma\rvert + \frac{N}{2}\cdot \frac{1}{N}\sum_{i=1}^N(x_i-\mu)^T \Sigma^{-1} (x-\mu) \right]\\ &=-\frac{mN}{2}\log(2\pi)-\frac{N}{2}\left[\vert\Sigma\rvert + \cdot \frac{1}{N}\sum_{i=1}^N(x_i-\mu)^T \Sigma^{-1}(x-\mu) \right]\\ \end{aligned} l=logi=1Np(xi)=log[(2π)mN/2ΣN/2ei=1N1/2(xiμ)TΣ1(xiμ)]=2mNlog(2π)[2NΣ+2NN1i=1N(xiμ)TΣ1(xμ)]=2mNlog(2π)2N[Σ+N1i=1N(xiμ)TΣ1(xμ)]
其中 x i x_i xi表示 m × 1 m\times 1 m×1维向量, μ \mu μ表示 m × 1 m\times 1 m×1维均值向量。
所以,只需要求下式极值: l = ∣ Σ ∣ + 1 N ∑ i = 1 N ( x i − μ ) T Σ − 1 ( x − μ ) l=\vert\Sigma\rvert + \frac{1}{N}\sum_{i=1}^N(x_i-\mu)^T \Sigma^{-1}(x-\mu) l=Σ+N1i=1N(xiμ)TΣ1(xμ) 求微分,先看 log ⁡ ∣ Σ ∣ \log\lvert \Sigma \rvert logΣ
d log ⁡ ∣ Σ ∣ = ∣ Σ ∣ − 1 d ∣ Σ ∣ = ∣ Σ ∣ − 1 ∣ Σ ∣ tr ( Σ − 1 d Σ ) = t r ( Σ − 1 d Σ ) d\log\lvert\Sigma\rvert=\lvert\Sigma\rvert^{-1}d\lvert\Sigma\vert=\lvert\Sigma\rvert^{-1}\lvert\Sigma\rvert\text{tr}(\Sigma^{-1}d\Sigma)={tr}(\Sigma^{-1}d\Sigma) dlogΣ=Σ1dΣ=Σ1Σtr(Σ1dΣ)=tr(Σ1dΣ)
第二项求微分
d [ 1 N ∑ i = 1 N ( x i − μ ) T Σ − 1 ( x − μ ) ] = − 1 N ∑ i = 1 N ( x i − μ ) T Σ − 1 d Σ Σ − 1 ( x i − μ ) = − 1 N ∑ i = 1 N tr ( Σ − 1 ( x i − μ ) ( x i − μ ) T Σ − 1 d Σ ) = − tr ( Σ − 1 Σ x x Σ − 1 d Σ ) \begin{aligned} d\left[\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^T \Sigma^{-1}(x-\mu)\right]&=-\cfrac{1}{N}\sum_{i=1}^N(x_i-\mu)^T\Sigma^{-1}d\Sigma\Sigma^{-1}(x_i-\mu)\\ &=-\cfrac{1}{N}\sum_{i=1}^N\text{tr}(\Sigma^{-1}(x_i-\mu)(x_i-\mu)^T\Sigma^{-1}d\Sigma)\\ &=-\text{tr}(\Sigma^{-1}\Sigma_{xx}\Sigma^{-1}d\Sigma)\\ \end{aligned} d[N1i=1N(xiμ)TΣ1(xμ)]=N1i=1N(xiμ)TΣ1dΣΣ1(xiμ)=N1i=1Ntr(Σ1(xiμ)(xiμ)TΣ1dΣ)=tr(Σ1ΣxxΣ1dΣ)
两项微分相加
d l = tr ( Σ − 1 d Σ − Σ − 1 Σ x x Σ − 1 d Σ ) = tr ( ( Σ − 1 − Σ Σ x x Σ − 1 ) d Σ ) ⇒ ∂ l ∂ Σ = ( Σ − 1 − Σ − 1 Σ x x Σ − 1 ) T ⇒ ∂ l ∂ Σ = 0 ⇒ Σ = Σ x x \begin{aligned} dl&=\text{tr}(\Sigma^{-1}d\Sigma-\Sigma^{-1}\Sigma_{xx}\Sigma^{-1}d\Sigma)=\text{tr}\left(\left(\Sigma^{-1}-\Sigma\Sigma_{xx}\Sigma^{-1}\right)d\Sigma\right)\\ &\Rightarrow \cfrac{\partial l}{\partial \Sigma}=\left(\Sigma^{-1}-\Sigma^{-1}\Sigma_{xx}\Sigma^{-1}\right)^T\\ &\Rightarrow \cfrac{\partial l}{\partial \Sigma}=0 \Rightarrow \Sigma=\Sigma_{xx} \end{aligned} dl=tr(Σ1dΣΣ1ΣxxΣ1dΣ)=tr((Σ1ΣΣxxΣ1)dΣ)Σl=(Σ1Σ1ΣxxΣ1)TΣl=0Σ=Σxx

多元逻辑回归

l = − y T log ⁡ softmax ( W x ) l=-y^T\log\text{softmax}(Wx) l=yTlogsoftmax(Wx),求 ∂ l ∂ W \cfrac{\partial l}{\partial W} Wl
其中 y y y是除一个元素为1其它元素为0的 m × 1 m\times1 m×1列向量, W W W m × n m\times n m×n矩阵, x x x n × 1 n\times 1 n×1列向量, l l l是标量。
log ⁡ \log log表示自然对数, softmax ( a ) = exp ⁡ ( a ) 1 T exp ⁡ ( a ) \text{softmax}(a)=\cfrac{\exp{(a)}}{1^T\exp{(a)}} softmax(a)=1Texp(a)exp(a),其中 exp \text{exp} exp表示逐元素求指数, 1 1 1代表全1向量。
a = W x a=Wx a=Wx,将 softmax \text{softmax} softmax带入求微分
d l = d [ − y T ( a − 1 log ⁡ ( 1 T exp ⁡ ( a ) ) ) ] = d [ − y T a + log ⁡ ( 1 T exp ⁡ ( a ) ) ] = 1 T exp ⁡ ( a ) ⊙ d a 1 T exp ⁡ ( a ) − y T d a = tr ( 1 T exp ⁡ ( a ) ⊙ d a 1 T exp ⁡ ( a ) − y T d a ) = tr ( exp ⁡ ( a ) T d a 1 T exp ⁡ ( a ) − y T d a ) = tr ( ( softmax(a) − y ) T d a ) ⇒ ∂ l ∂ a = softmax ( a ) − y \begin{aligned} dl&=d\left[-y^T(a-1\log(1^T\exp(a)))\right]\\ &=d\left[-y^Ta+\log(1^T\exp(a))\right]\\ &=\cfrac{1^T\exp(a)\odot da}{1^T\exp(a)}-y^Tda\\ &=\text{tr}(\cfrac{1^T\exp(a)\odot da}{1^T\exp(a)}-y^Tda)\\ &=\text{tr}(\cfrac{\exp(a)^Tda}{1^T\exp(a)}-y^Tda)\\ &=\text{tr}\left((\text{softmax(a)}-y)^Tda\right)\\ &\Rightarrow \cfrac{\partial l}{\partial a}=\text{softmax}(a)-y \end{aligned} dl=d[yT(a1log(1Texp(a)))]=d[yTa+log(1Texp(a))]=1Texp(a)1Texp(a)dayTda=tr(1Texp(a)1Texp(a)dayTda)=tr(1Texp(a)exp(a)TdayTda)=tr((softmax(a)y)Tda)al=softmax(a)y
a = W x a=Wx a=Wx带入,求微分
d l = tr ( ∂ l ∂ a T d a ) = tr ( ∂ l ∂ a T d W x ) = tr ( ( ∂ l ∂ a x T ) T d x ) ⇒ ∂ l ∂ w = ∂ l ∂ a x T = ( softmax ( W x ) − y ) x T \begin{aligned} dl&=\text{tr}\left(\cfrac{\partial l}{\partial a}^Tda\right)=\text{tr}\left(\cfrac{\partial l}{\partial a}^TdWx\right)=\text{tr}\left(\left(\cfrac{\partial l}{\partial a}x^T\right)^Tdx\right)\\ &\Rightarrow \cfrac{\partial l}{\partial w}=\cfrac{\partial l}{\partial a}x^T=\left(\text{softmax}(Wx)-y\right)x^T \end{aligned} dl=tr(alTda)=tr(alTdWx)=tr(alxT)Tdxwl=alxT=(softmax(Wx)y)xT

二层神经网络

l = − y T log ⁡ softmax ( W 2 σ ( W 1 x ) ) l=-y^T\log \text{softmax}(W_2\sigma(W_1x)) l=yTlogsoftmax(W2σ(W1x)),求 ∂ l ∂ W 1 , ∂ l ∂ W 2 {\cfrac{\partial l}{\partial W_1}}, {\cfrac{\partial l}{\partial W_2}} W1l,W2l
其中 y y y是除一个元素为 1 1 1外其它元素为 0 0 0 m × 1 m\times 1 m×1列向量, W 2 W_2 W2 m × p m\times p m×p矩阵, W 1 W_1 W1 p × n p\times n p×n矩阵, x x x n × 1 n\times 1 n×1列向量, l l l是标量。
log ⁡ \log log表示自然对数, softmax ( a ) = exp ⁡ ( a ) 1 T exp ⁡ ( a ) \text{softmax}(a)=\cfrac{\exp{(a)}}{1^T\exp({a})} softmax(a)=1Texp(a)exp(a) σ \sigma σ是逐元素 sigmoid \text{sigmoid} sigmoid函数 σ ( a ) = 1 1 + exp ⁡ ( − a ) \sigma(a)=\cfrac{1}{1+\exp(-a)} σ(a)=1+exp(a)1

a 2 = W 2 σ ( W 1 x ) a_2=W_2\sigma(W_1x) a2=W2σ(W1x),根据上面的逻辑回归求导结果
∂ l ∂ a 2 = softmax ( a 2 ) − y \cfrac{\partial l}{\partial a_2}=\text{softmax}(a_2)-y a2l=softmax(a2)y

a 1 = σ ( W 1 x ) a_1=\sigma(W_1x) a1=σ(W1x)继续求微分
d l = tr ( ∂ l ∂ a 2 T d a 2 ) = tr ( ∂ l ∂ a 2 T d W 2 a 1 + ∂ l ∂ a 2 T W 2 d a 1 ) dl=\text{tr}(\cfrac{\partial l}{\partial a_2}^Tda_2)=\text{tr}\left(\cfrac{\partial l}{\partial a_2}^T dW_2a_1+\cfrac{\partial l}{\partial a_2}^TW_2da_1\right) dl=tr(a2lTda2)=tr(a2lTdW2a1+a2lTW2da1)
∂ l ∂ W 2 = ∂ l ∂ a 2 a 1 T \cfrac{\partial l}{\partial W_2}=\cfrac{\partial l}{\partial a_2}a_1^T W2l=a2la1T

继续求解第二部分 tr ( ∂ l ∂ a 2 T W 2 d a 1 ) = tr ( ∂ l ∂ a 2 T W 2 d σ ( W 1 x ) ) = tr ( ∂ l ∂ a 2 T W 2 ( 1 − σ ( W 1 x ) ) ⊙ d W 1 x ) = tr ( ( W 2 T ∂ l ∂ a 2 ) T ( 1 − σ ( W 1 x ) ) ⊙ d W 1 x ) = tr ( ( W 2 T ∂ l ∂ a 2 ⊙ ( 1 − σ ( W 1 x ) ) ) T d W 1 x ) = tr ( ( W 2 T ∂ l ∂ a 2 ⊙ ( 1 − σ ( W 1 x ) ) x T ) T d W 1 ) ⇒ ∂ l ∂ W 1 = W 2 T ∂ l ∂ a 2 ⊙ ( 1 − σ ( W 1 x ) ) x T \begin{aligned} \text{tr}(\cfrac{\partial l}{\partial a_2}^TW_2da_1) &=\text{tr}\left(\cfrac{\partial l}{\partial a_2}^TW_2d\sigma(W_1x)\right)\\ &=\text{tr}\left(\cfrac{\partial l}{\partial a_2}^TW_2(1-\sigma(W_1x))\odot dW_1x\right)\\ &=\text{tr}\left(\left(W_2^T\cfrac{\partial l}{\partial a_2}\right)^T(1-\sigma(W_1x))\odot dW_1x\right)\\ &=\text{tr}\left(\left(W_2^T\cfrac{\partial l}{\partial a_2}\odot (1-\sigma(W_1x))\right)^T dW_1x\right)\\ &=\text{tr}\left(\left(W_2^T\cfrac{\partial l}{\partial a_2}\odot (1-\sigma(W_1x))x^T\right)^T dW_1\right)\\ &\Rightarrow \cfrac{\partial l}{\partial W_1}=W_2^T\cfrac{\partial l}{\partial a_2}\odot (1-\sigma(W_1x))x^T \end{aligned} tr(a2lTW2da1)=tr(a2lTW2dσ(W1x))=tr(a2lTW2(1σ(W1x))dW1x)=tr(W2Ta2l)T(1σ(W1x))dW1x=tr(W2Ta2l(1σ(W1x)))TdW1x=tr(W2Ta2l(1σ(W1x))xT)TdW1W1l=W2Ta2l(1σ(W1x))xT
推广:样本 ( x 1 , y 1 ) , … , ( x N , y N ) (x_1,y_1),\dots,(x_N,y_N) (x1,y1),,(xN,yN) l = − ∑ i = 1 N y i T log ⁡ softmax ( W 2 σ ( W 1 x i + b 1 ) + b 2 ) l=-\sum_{i=1}^Ny_i^T\log \text{softmax}(W_2\sigma(W_1x_i+b_1)+b_2) l=i=1NyiTlogsoftmax(W2σ(W1xi+b1)+b2)。其中 b 1 b_1 b1 p × 1 p\times1 p×1列向量, b 2 b_2 b2 m × 1 m\times1 m×1列向量。

解1:定义 a 1 , i = W 1 x i + b 1 , a 2 , i = W 2 σ ( a 1 ) + b 2 a_{1,i}=W_1x_i+b_1, a_{2,i}=W_2\sigma(a_1)+b_2 a1,i=W1xi+b1,a2,i=W2σ(a1)+b2
l = − ∑ i = 1 N y T log ⁡ softmax ( a 2 , i ) l=-\sum_{i=1}^Ny^T\log\text{softmax}(a_{2,i}) l=i=1NyTlogsoftmax(a2,i),同上得 ∂ l ∂ a 2 , i = softmax ( a 2 , i ) − y i \cfrac{\partial l}{\partial a_{2,i}}=\text{softmax}(a_{2,i})-y_i a2,il=softmax(a2,i)yi,则
d l = ∑ i = 1 N tr ( ∂ l ∂ a 2 , i T d a 2 , i ) = ∑ i = 1 N tr ( ∂ l ∂ a 2 , i T d W 2 σ ( a 1 , i ) + ∂ l ∂ a 2 , i T W 2 d σ ( a 1 , i ) + ∂ l ∂ a 2 , i T d b 2 ) ⇒ ∂ l ∂ W 2 = ∑ i = 1 N ∂ l ∂ a 2 , i σ ( a 1 , i ) T ,   ∂ l ∂ b 2 = ∑ i = 1 N ∂ l ∂ a 2 , i \begin{aligned} dl&=\sum_{i=1}^N\text{tr}(\cfrac{\partial l}{\partial a_{2,i}}^Tda_{2,i})\\ &=\sum_{i=1}^N\text{tr}\left(\cfrac{\partial l}{\partial a_{2,i}}^T dW_2\sigma(a_{1,i})+\cfrac{\partial l}{\partial a_{2,i}}^TW_2d\sigma(a_{1,i})+\cfrac{\partial l}{\partial a_{2,i}}^Tdb_2\right)\\ &\Rightarrow \cfrac{\partial l}{\partial W_2}=\sum_{i=1}^N\cfrac{\partial l}{\partial a_{2,i}}\sigma(a_{1,i})^T,\, \cfrac{\partial l}{\partial b_2}=\sum_{i=1}^N\cfrac{\partial l}{\partial a_{2,i}} \end{aligned} dl=i=1Ntr(a2,ilTda2,i)=i=1Ntr(a2,ilTdW2σ(a1,i)+a2,ilTW2dσ(a1,i)+a2,ilTdb2)W2l=i=1Na2,ilσ(a1,i)T,b2l=i=1Na2,il继续求解 ∂ l ∂ W 1 , ∂ l ∂ b 1 \cfrac{\partial l}{\partial W_1}, \cfrac{\partial l}{\partial {b_1}} W1l,b1l
d l 2 = ∑ i = 1 N tr ( ∂ l ∂ a 2 , i T W 2 d σ ( a 1 , i ) ) = ∑ i = 1 N tr ( ∂ l ∂ a 2 , i T W 2 ( 1 − σ ( a 1 , i ) ) ⊙ d a 1 , i ) = ∑ i = 1 N tr ( ( W 2 T ∂ l ∂ a 2 , i ⊙ ( 1 − σ ( a 1 , i ) ) ) T d a 1 , i ) ⇒ ∂ l ∂ a 1 , i = ( W 2 T ∂ l ∂ a 2 , i ⊙ ( 1 − σ ( a 1 , i ) ) ) \begin{aligned} dl_2&=\sum_{i=1}^N\text{tr}(\cfrac{\partial l}{\partial a_{2,i}}^TW_2d\sigma(a_{1,i}))\\ &=\sum_{i=1}^N\text{tr}(\cfrac{\partial l}{\partial a_{2,i}}^TW_2(1-\sigma(a_{1,i}))\odot da_{1,i})\\ &=\sum_{i=1}^N\text{tr} \left(\left(W_2^T\cfrac{\partial l}{\partial a_{2,i}} \odot (1-\sigma(a_{1,i}))\right)^Tda_{1,i}\right)\\ &\Rightarrow \cfrac{\partial l}{\partial a_{1,i}}=\left(W_2^T\cfrac{\partial l}{\partial a_{2,i}} \odot (1-\sigma(a_{1,i}))\right) \end{aligned} dl2=i=1Ntr(a2,ilTW2dσ(a1,i))=i=1Ntr(a2,ilTW2(1σ(a1,i))da1,i)=i=1Ntr(W2Ta2,il(1σ(a1,i)))Tda1,ia1,il=(W2Ta2,il(1σ(a1,i)))则有
d l 2 = ∑ i = 1 N tr ( ∂ l ∂ a 1 , i T d a 1 , i ) = ∑ i = 1 N tr ( ∂ l ∂ a 1 , i T d W 1 x i + ∂ l ∂ a 1 , i T d b 1 ) ⇒ ∂ l ∂ W 1 = ∑ i = 1 N ∂ l ∂ a 1 , i x i T ,   ∂ l ∂ b 2 = ∑ i = 1 N ∂ l ∂ a 1 , i \begin{aligned} dl_2&=\sum_{i=1}^N\text{tr}\left(\cfrac{\partial l}{\partial a_{1,i}}^Tda_{1,i}\right)\\ &=\sum_{i=1}^N\text{tr}\left(\cfrac{\partial l}{\partial a_{1,i}}^TdW_1x_i + \cfrac{\partial l}{\partial a_{1,i}}^Tdb_1 \right)\\ &\Rightarrow \cfrac{\partial l}{\partial W_1}=\sum_{i=1}^N\cfrac{\partial l}{\partial a_{1,i}}x_i^T,\, \cfrac{\partial l}{\partial b_2}=\sum_{i=1}^N\cfrac{\partial l}{\partial a_{1,i}} \end{aligned} dl2=i=1Ntr(a1,ilTda1,i)=i=1Ntr(a1,ilTdW1xi+a1,ilTdb1)W1l=i=1Na1,ilxiT,b2l=i=1Na1,il
解2:可以用矩阵来表示 N N N个样本,以简化形式。定义
X = [ x 1 , x 2 , … , x N ] A 1 = [ a 1 , 1 , a 1 , 2 , … , a 1 , N ] = W 1 X + b 1 T A 2 = [ a 2 , 1 , a 2 , 2 , … , a 2 , N ] = W 2 σ ( A 1 ) + b 2 1 T σ ( A 1 ) = [ σ ( a 1 , 1 ) , σ ( a 1 , 2 ) , … , σ ( a 1 , N ) ] X=[x_1,x_2,\dots,x_N]\\ A_1=[a_{1,1},a_{1,2},\dots,a_{1,N}]=W_1X+b1^T\\ A_2=[a_{2,1},a_{2,2},\dots,a_{2,N}]=W_2\sigma(A_1)+b_21^T\\ \sigma(A_1)=[\sigma(a_{1,1}),\sigma(a_{1,2}),\dots,\sigma(a_{1,N})] X=[x1,x2,,xN]A1=[a1,1,a1,2,,a1,N]=W1X+b1TA2=[a2,1,a2,2,,a2,N]=W2σ(A1)+b21Tσ(A1)=[σ(a1,1),σ(a1,2),,σ(a1,N)]

则有 ∂ l ∂ A 2 = [ softmax ( a 2 , 1 ) − y 1 , softmax ( a 2 , 2 ) − y 2 , … , softmax ( a 2 , N ) − y N ] \cfrac{\partial l}{\partial A_2}=\left[\text{softmax}(a_{2,1}) - y_1,\text{softmax}(a_{2,2}) - y_2,\dots, \text{softmax}(a_{2,N}) - y_N \right] A2l=[softmax(a2,1)y1,softmax(a2,2)y2,,softmax(a2,N)yN]
d l = tr ( ∂ l ∂ A 2 T d A 2 ) = tr ( ∂ l ∂ A 2 T d W 2 σ ( A 1 ) + ∂ l ∂ A 2 T W 2 d σ ( A 1 ) + ∂ l ∂ A 2 T d b 2 1 T ) ⇒ ∂ l ∂ W 2 = ∂ l ∂ A 2 σ ( A 1 ) T ,   ∂ l ∂ b 2 = ∂ l ∂ A 2 \begin{aligned} dl&=\text{tr}(\cfrac{\partial l}{\partial A_2}^TdA_2)\\ &=\text{tr}\left(\cfrac{\partial l}{\partial A_2}^T dW_2\sigma(A_1)+\cfrac{\partial l}{\partial A_2}^TW_2d\sigma(A_1)+\cfrac{\partial l}{\partial A_2}^Tdb_21^T\right)\\ &\Rightarrow \cfrac{\partial l}{\partial W_2}=\cfrac{\partial l}{\partial A_2}\sigma(A_1)^T,\, \cfrac{\partial l}{\partial b_2}=\cfrac{\partial l}{\partial A_2} \end{aligned} dl=tr(A2lTdA2)=tr(A2lTdW2σ(A1)+A2lTW2dσ(A1)+A2lTdb21T)W2l=A2lσ(A1)T,b2l=A2l
求解 ∂ l ∂ A 1 \cfrac{\partial l}{\partial A_1} A1l d l 2 = tr ( ∂ l ∂ A 2 T W 2 d σ ( A 1 ) ) = tr ( ∂ l ∂ A 2 T W 2 ( 1 − σ ( A 1 ) ) ⊙ d A 1 ) = tr ( ( W 2 T ∂ l ∂ A 2 ⊙ ( 1 − σ ( A 1 ) ) ) T d A 1 ) ⇒ ∂ l ∂ A 1 = W 2 T ∂ l ∂ A 2 ⊙ ( 1 − σ ( A 1 ) ) \begin{aligned} dl_2&=\text{tr}\left(\cfrac{\partial l}{\partial A_2}^TW_2d\sigma(A_1)\right)\\ &=\text{tr}\left(\cfrac{\partial l}{\partial A_2}^TW_2\left(1-\sigma(A_1)\right)\odot dA_1 \right)\\ &=\text{tr}\left(\left(W_2^T\cfrac{\partial l}{\partial A_2}\odot\left(1-\sigma(A_1)\right)\right)^T dA_1 \right)\\ &\Rightarrow \cfrac{\partial l}{\partial A_1}=W_2^T\cfrac{\partial l}{\partial A_2}\odot\left(1-\sigma(A_1)\right) \end{aligned} dl2=tr(A2lTW2dσ(A1))=tr(A2lTW2(1σ(A1))dA1)=tr(W2TA2l(1σ(A1)))TdA1A1l=W2TA2l(1σ(A1))
继续求解 ∂ l ∂ W 1 , ∂ l ∂ b 1 \cfrac{\partial l}{\partial W_1}, \cfrac{\partial l}{\partial {b_1}} W1l,b1l
d l 2 = tr ( ∂ l ∂ A 1 T d A 1 ) = tr ( ∂ l ∂ A 1 T d W 1 X + ∂ l ∂ A 1 T d b 1 ) ⇒ ∂ l ∂ W 1 = ∂ l ∂ A 1 X T ,   ∂ l ∂ b 1 = ∂ l ∂ A 1 \begin{aligned} dl_2&=\text{tr}\left(\cfrac{\partial l}{\partial A_1}^TdA_1\right)=\text{tr}\left(\cfrac{\partial l}{\partial A_1}^TdW_1X+\cfrac{\partial l}{\partial A_1}^Tdb_1 \right)\\ &\Rightarrow \cfrac{\partial l}{\partial W_1}=\cfrac{\partial l}{\partial A_1}X^T,\, \cfrac{\partial l}{\partial b_1}=\cfrac{\partial l}{\partial A_1} \end{aligned} dl2=tr(A1lTdA1)=tr(A1lTdW1X+A1lTdb1)W1l=A1lXT,b1l=A1l

矩阵求导术(上)

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐