Recurrent neural networks — teaching materials by Jordi Pons

Arguably, temporal dependencies (at different time scales) are of importance when modeling music computationally. Recurrent neural networks (RNNs) seem a reasonable fit towards modeling those, since RNNs can explicitly encode temporal dependencies. That is because its latent representation $\mathbf{h}_{(1)}^{(t)}$ does not only depend on the current time-step $t$:

$\mathbf{h}_{(1)}^{(t)}=f(\mathbf{x}^{(t)}),$

but also depends on the previous time-step:

$\mathbf{h}_{(1)}^{(t)}=f(\mathbf{h}_{(1)}^{(t-1)},\mathbf{x}^{(t)}).$

For that reason, RNNs have been historically used to model time-series. The traditional formulation of RNNs (also known as vanilla RNNs) is as follows:

$\mathbf{\hat{y}}^{(t)}= \sigma_{(1)}(\mathbf{W}_{(1)} \mathbf{h}_{(1)}^{(t)}+\mathbf{b}_{(1)})$

$\mathbf{h}_{(1)}^{(t)}=\sigma_{(0)}(\mathbf{W}_{(0)}\mathbf{x}^{(t)}+\mathbf{W}_{rec}\mathbf{h}_{(1)}^{(t-1)}+\mathbf{b}_{(0)})$,

where $\mathbf{W}_{rec} \in{\Re}^{ d_{(l)} \times d_{(l)}}$ denotes the recurrent weights that explicitly encode temporal dependencies. Note that $\mathbf{\hat{y}}^{(t)}$ remains the same as for the MLP, only $\mathbf{h}_{(l)}^{(t)}$ changes to be recursive. Although for this example we only considered a single-layered RNN (L=1), one can stack several RNN layers (L>1) that can even be set to encode opposite temporal directions. For example, a bi-derectional RNN considers representations from the past $\mathbf{h}^{(t-1)}$ and from the future $\mathbf{h}^{(t+1)}$.

Note that adding the recursive connection $\mathbf{W}_{rec}$ seems a promising step towards modeling the long-term temporal dependencies that are so relevant in music. However, this goal is still far beyond our reach. Unfortunately, RNNs have difficulties in learning long-term dependencies due to the "vanishing/exploding gradient" problem.

Although it is out of our scope to deeply discuss the "vanishing/exploding gradient", to follow our discussion it will suffice to know that the long-term dependencies are hardly reachable at time $t$ because latent representations at time $t-n$ are only accessible via a problematic path defined by $\textbf{W}_{rec}$. Note that to access $\mathbf{h}^{(t-n)}$ at time $t$ one needs to recursively go through ${\textbf{W}_{rec}}$ in a multiplicative fashion for $n$ steps: $\approx{\textbf{W}_{rec}}^n\mathbf{h}^{(t-n)}$. Consequently, if ${\textbf{W}_{rec}}$ values are too small: ${\textbf{W}_{rec}}^n\mathbf{h}^{(t-n)}$ will vanish. And, if ${\textbf{W}_{rec}}$ values are too big: ${\textbf{W}_{rec}}^n\mathbf{h}^{(t-n)}$ will explode. Even though for this didactic simplification we assume that the learning phase already occurred, this problem actually takes place during training: the gradients used for guiding the learning are dominated by ${\textbf{W}_{rec}}^n$ and these can vanish or explode.

The RNNs that work: LSTMs

Following the previous discussion, vanilla RNNs struggle to learn long-term dependencies because these access to past information via a problematic path defined by $\textbf{W}_{rec}$. A practical solution to this problem is to allow direct paths from the past. LSTMs, a RNN variant, explored this solution via defining a state representation ( ${\color{magenta}\textbf{s}^{(t)}}$) that explicitly carries information from the past ( ${\color{magenta}\textbf{s}^{(t-1)}}$).

LSTMs are defined as follows:

Input: ${\color{red}\textbf{i}^{(t)}}=\sigma(\mathbf{W} \mathbf{x}^{(t)} + \mathbf{W}_{rec} \mathbf{h}^{(t-1)} + \textbf{b})$
State: ${\color{magenta}\textbf{s}^{(t)}}={\color{blue}\textbf{g}_i^{(t)}}{\color{red}\textbf{i}^{(t)}}+{\color{blue}\textbf{g}_f^{(t)}}{\color{magenta}\textbf{s}^{(t-1)}}$
Output: $\textbf{h}^{(t)}=\sigma({\color{magenta}\textbf{s}^{(t)}}){\color{blue}\textbf{g}_o^{(t)}}$

While the path to the past coming from $\textbf{W}_{rec}\mathbf{h}^{(t-1)}$ (as defined in ${\color{red}\textbf{i}^{(t)}}$) is again multiplied by $\textbf{W}_{rec}$, the other path to the past ${\color{blue}\textbf{g}_f^{(t)}}{\color{magenta}\textbf{s}^{(t-1)}}$ (as defined in ${\color{magenta}\textbf{s}^{(t)}}$) is directly accessible — one just needs to bypass a sigmoidal gate ${\color{blue}\textbf{g}_f^{(t)}}$. We omitted the layer sub-indices $l$ for clarity, and the $\sigma(\cdot)$'s can be any non-linearity, but the $tanh(\cdot)$ is oftentimes used.

Note that the input representation formulation ${\color{red}\textbf{i}^{(t)}}$ is the same as the traditional RNN; the state representation decides what's important to keep: if "present" ( ${\color{red}\textbf{i}^{(t)}}$) or "past" information ( ${\color{magenta}\textbf{s}^{(t-1)}}$); and the output representation is to control what's visible from the output. This behavior is controlled by a set of sigmoidal gates: ${\color{blue}\textbf{g}_i^{(t)}}$ (that reads the input), ${\color{blue}\textbf{g}_f^{(t)}}$ (that forgets from the past), and ${\color{blue}\textbf{g}_o^{(t)}}$ (that writes to the output), that are defined as follows:

Gates: ${\color{blue}\textbf{g}_?^{(t)}}=\sigma(\textbf{b}_? + \mathbf{W_?} \mathbf{x}^{(t)} + \mathbf{W}_{rec_?} \mathbf{h}^{(t-1)})$,

where the $?$ sub-index is a placeholder for $i$ (standing for input), $f$ (standing for forget), and $o$ (standing for output). Hence, each gate has independent parameters (i.e., $W_i \neq W_f \neq W_o$).

Finally, it is important to remark that LSTMs (or similar gated recurrent models like the GRUs) are the RNNs that most practitioners use because these are the ones that actually work in practice.