本文是根据WildML的写的学习笔记。
原文中计划实现一个循环神经网络,用于发现自然语言句子中单词出现的模式,最终可以生成一些合理的句子。
理解:y的每n位是x前n位的期望输出。
每个输入\(X_t\)(尽管有8000维),只有一个维度有值且为1,代表第\(t\)的单词的token的维度。
比如:what的token是51。那么\(X_t\)只有第51位为1,其它都是0。
这个叫做one-hot vector。
输出:每个token的可能性。
state的维度是100。
\[ s_t = tanh(x_tU + s_{t_1}W) \\ o_t = softmax(s_tV) \\ where \\ x_t.dimension = 8000 \\ o_t.dimension = 8000 \\ s_t.dimension = 100 \\ U.dimension = 100 * 8000 : x_tU \text{ is a 100 dimension vector} \\ W.dimension = 100 * 100 : s_{t-1}W \text{ is a 100 dimension vector} \\ V.dimension = 8000 * 100 : s_tV \text{ is a 8000 dimension vector} \]
训练的过程:
有上面可见,反向传播的算法是训练的关键。(因为其它步骤的方法都是已知的。)
反向传播的算法的目的是:计算预测算法权值的偏微分
关于激活函数和损失函数微分的证明请看:
sigmoid函数
\[
\sigma(x) = \frac{1}{1 + e^{-x}} \\
\sigma'(x) = (1 - \sigma(x))\sigma(x)
\]
tanh函数
\[
\tanh(x) = \frac{e^{2x} - 1}{e^{2x} + 1} \\
tanh'(x) = 1 - tanh(x)^2
\]
softmax函数
\[ \text{softmax:} \\ \hat{y_{t_i}} = softmax(o_{t_i}) = \frac{e^{o_{t_i}}}{\sum_{k}e^{o_{t_k}}} \\ \hat{y_t} = softmax(z_t) = \begin{bmatrix} \cdots & \frac{e^{o_{t_i}}}{\sum_{k}e^{o_{t_k}}} & \cdots \end{bmatrix} \\ \\ softmax'(z_t) = \frac{\partial{y_t}}{\partial{z_t}} = \begin{cases} \hat{y_{t_i}}(1 - \hat{y_{t_i}}), & \text{if } i = j \\ -\hat{y_{t_i}} \hat{y_{t_j}}, & \text{if } i \ne j \end{cases} \]
cross entropy loss函数
\[
L_t(y_t, \hat{y_t}) = - y_t \log \hat{y_t} \\
L(y, \hat{y}) = - \sum_{t} y_t \log \hat{y_t} \\
\frac{ \partial L_t } { \partial z_t } = \hat{y_t} - y_t \\
\text{where} \\
z_t = s_tV \\
\hat{y_t} = softmax(z_t) \\
y_t \text{ : for training data x, the expected result y at time t. which are from training data}
\]
预测公式和前面是一样的。为了方便反向传播的计算。我们写成这样:
\[ s_t = tanh(x_tU + s_{t_1}W) \\ z_t = s_tV \\ \hat{y_t} = softmax(z_t) \\ where \\ s_{-1} = [0 \cdots 0] \]
\[ L_t(y_t, \hat{y_t}) = - y_t \log \hat{y_t} \\ L(y, \hat{y}) = - \sum_{t} y_t \log \hat{y_t} \\ \text{where} y_t \text{ : for training data x, the expected result y at time t. which are from training data} \]
\[ W_{new} = W - s * dW \\ where \\ s \text{ : step size, learning rate, a value between } (0, 1) \\ dW = \frac{\partial L}{partial W} \text{ : W's descent, loss differentiation at W.} \\ \]
注:\(U,V,W\)的随机梯度下降是一样的。
现在就只剩下求\(U,V,W\)的偏微分了。
\[ \begin{align} \frac{\partial L_t}{\partial V} & = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial V} \\ & = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial V} \\ & = \frac{\partial L_t}{\partial z_t} \frac{\partial z_t}{\partial V} \\ & = (\hat{y_t} - y_t) \otimes s_t \end{align} \]
计算公式
\[
\frac{\partial L_t}{\partial W}
= (\hat{y} - y) V (1 - s_t^2) \left ( s_{t-1} + W \frac{\partial (s_{t-1})}{\partial W} \right ) \\
\frac{\partial s_t}{\partial W}
= (1 - s_t^2) \left ( s_{t-1} + W \frac{\partial (s_{t-1})}{\partial W} \right )
\]
证明
在计算\(L_t\)在\(W\)的偏微分前,我们需要先做一些辅助计算。
\[
\begin{align}
\frac{\partial s_t}{\partial W}
& = \frac{\partial (tanh(x_tU + s_{t-1}W))}{\partial W} \\
& \because \text{tanh differentiation formula and the chain rule of differentiation} \\
& = (1 - s_t^2) \frac{\partial (x_tU + s_{t-1}W)}{\partial W} \\
& \because \text{sum rule of differentiation} \\
& = (1 - s_t^2) \frac{\partial (s_{t-1}W)}{\partial W} \\
& \because \text{product rule of differentiation} \\
& = (1 - s_t^2) \left ( \frac{\partial (s_{t-1})}{\partial W}W + s_{t-1}\frac{\partial W}{\partial W} \right ) \\
& = (1 - s_t^2) \left ( s_{t-1} + W \frac{\partial (s_{t-1})}{\partial W} \right ) \\
\end{align} \\
\because s_{t-1} \text{ is a function of W. we need to calculate the chain with the product rule of differentiation.}
\]
\[ \begin{align} \frac{\partial z_t}{\partial s_t} & = \frac{\partial (s_tV )}{\partial s_t} \\ & = V \end{align} \]
\[ \begin{align} \frac{\partial L_t}{\partial W} & = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial W} \\ & = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial W} \\ & = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial s_t} \frac{\partial s_t}{\partial W} \\ & = \frac{\partial L_t}{\partial z_t} \frac{\partial z_t}{\partial s_t} \frac{\partial s_t}{\partial W} \\ & = (\hat{y} - y) V \frac{\partial s_t}{\partial W} \\ & = (\hat{y} - y) V \prod_{k=0}^{t} ((1 - s_k^2) W) \\ \end{align} \]
计算公式
\[
\frac{\partial L_t}{\partial U}
= (\hat{y} - y) V (1 - s_t^2) \left( x_t + W \frac{\partial s_{t-1}}{\partial U} \right ) \\
\frac{\partial s_t}{\partial U}
= (1 - s_t^2) (x_t + W \frac{\partial s_{t-1}}{\partial U})
\]
证明
\[
\begin{align}
\frac{\partial s_t}{\partial U}
& = \frac{\partial (tanh(x_tU + s_{t-1}W))}{\partial U} \\
& = (1 - s_t^2) (x_t + \frac{\partial (s_{t-1}W)}{\partial U}) \\
& = (1 - s_t^2) (x_t + W \frac{\partial s_{t-1}}{\partial U}) \\
\end{align} \\
\because s_{t-1} \text{ is a function of U. we need to calculate the chain.}
\]
\[ \begin{align} \frac{\partial L_t}{\partial U} & = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial U} \\ & = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial U} \\ & = \frac{\partial L_t}{\partial \hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} \frac{\partial z_t}{\partial s_t} \frac{\partial s_t}{\partial U} \\ & = \frac{\partial L_t}{\partial z_t} \frac{\partial z_t}{\partial s_t} \frac{\partial s_t}{\partial U} \\ & = (\hat{y} - y) V \frac{\partial s_t}{\partial U} \\ \end{align} \]
突然有种万事到头一场空的感觉。
RNN有一个Vanishing Gradients Problem。我没有仔细研究这个问题。主要原因是激活函数tanh的使用,导致梯度消失(\((1 - s_t^2) = 0\)),无法计算偏分。
这个问题可以用激活函数ReLU来解决。
LSTM和GRU的出现,提供了一个新的解决方案。
来源: http://www.cnblogs.com/steven-yang/p/6407445.html