Discussion: spectrogram-based deep learning

The building blocks of modern deep learning architectures are the multilayer perceptron (MLP), convolutional neural networks (CNNs), recurrent neural networks (RNNs), and attention. In previous introductory sections, we have seen that MLPs compute a prediction considering all its input, that CNNs preserve locality by computing filter-selective feature maps, that RNNs can capture short-term temporal dependencies, and that attention can capture long-term temporal dependencies. In the following, we further develop these ideas.

In the figures below, we exemplify all these intuitions graphically with a spectrogram example. Provided that spectrogram inputs are widely used due to their good performance in several music tasks of relevance (including auto-tagging, source separation or transcription), we introduce this discussion to guide the reader through the process of designing a state-of-the-art artificial neural network.

Before all else, spectrograms need to be pre-processed and normalized so that artificial neural networks can deliver a good performance. A common setup consists of using log-mel spectrograms, and to normalize those to have zero-mean and unit-var. The mel mapping reduces the size of the input by providing less frequency resolution to the perceptually less relevant parts of the spectrogram, and the logarithmic compression reduces the dynamic range of the input. Finally, the zero-mean and unit-var normalization centers the data around zero, which facilitates learning.

When a MLP is employed, all its input is used to compute a weighted average output. For the spectrogram example, for every output one needs to learn as many weights as spectrogram bins are visible in the input. Consequently, for high dimensional data like spectrograms, it does not seem a good idea to use MLPs since the resulting model can easily become huge and one might inquire the risk of overfitting. A common solution to this issue consists in using CNNs when processing high-dimensional data like spectrograms.

Pelican stood on the beach
The black area represents the receptive field of a CNN filter.

The picture above depicts the receptive field of a vertical CNN filter capable to encode timbral traces. The black area (the receptive field) can be interpreted as the number of parameters to be learned by this model. Hence, it is not difficult to understand that the size of the model is significantly reduced when using CNNs (only the black area vs. all the spectrogram bins). A direct consequence of having a smaller model is that it might have more chances to generalize. CNN models are smaller because the convolution operation shifts CNN filters horizontally and vertically throughout the spectrogram. Consequently, for spectrogram-based CNNs, the filter weights are shared across time and frequency. The arrows in the figure above denote the horizontal and vertical shifts performed by the CNN. Since spectrogram-based CNNs convolve across time (horizontal shift) and frequency (vertical shift), these can capture representations that are time and frequency invariant by construction.

Pelican stood on the beach
The RNNs capacity to model long-term dependencies decreases with time.

The previous figure graphically illustrates how RNNs are not able to learn long-term dependencies due to the "vanishing/exploding gradient" problem. We highlight this phenomena with the black & green lines above the spectrogram, where we showcase the amount of past & fugure information that could be employed by a RNN at time t (denoted by the red vertical line). Note that one can feed future information into RNNs by simply changing the temporal direction we look at. For example, by changing $\mathbf{h}^{(t-1)} \to \mathbf{h}^{(t+1)}$ as in the following RNN equations:

$\mathbf{h}_{(1)}^{(t)}=f(\mathbf{\color{red}h}_{(1)}^{\color{red}(t-1)},\mathbf{x}^{(t)}) \hspace{5mm} \to \hspace{5mm} \mathbf{h}_{(1)}^{(t)}=f(\mathbf{x}^{(t)},\mathbf{\color{red}h}_{(1)}^{\color{red}(t+1)}).$

$\hspace{-7mm}\mathbf{h}_{(1)}^{(t)}=\sigma_{(0)}(\mathbf{W}_{(0)}\mathbf{x}^{(t)}+\mathbf{W}_{rec}\mathbf{\color{red}h}_{(1)}^{\color{red}(t-1)}+\mathbf{b}_{(0)}) \hspace{2mm} \to \hspace{2mm} \mathbf{h}_{(1)}^{(t)}=\sigma_{(0)}(\mathbf{W}_{(0)}\mathbf{x}^{(t)}+\mathbf{W}_{rec}\mathbf{\color{red}h}_{(1)}^{\color{red}(t+1)}+\mathbf{b}_{(0)})$,


Pelican stood on the beach
Attention-based models can encode long-term
dependencies from the future and the past.

In contrast, this last diagram shows that for a given time t an attention-based model can employ information from any time position along the spectrogram. This is depicted by the black line above the spectrogram, where we show the attention weights' $\alpha_{({\color{red}t},n)}$ values can access information from any time position along the input spectrogram. Remember that the resulting representations of an attention layer are obtained via a weighted average (through time) that considers information from any time position:

$\textbf{h}_{(l)}^{(t)} = \sum_{n=0}^{T} \mathbf{\alpha}_{(t,n)} \textbf{h}^{(n)}_{(l-1)}$,

The weighted average mechanism (through time considering the $\alpha_{({\color{red}t},n)}$ values) employed by the attention models enables a direct path where long-term dependencies can flow. Remember that $\alpha_{({\color{red}t},n)}$ is computed considering a context is arbitrarily defined, for example: $f(\textbf{h}^{(t)}_{(l-1)}, \textbf{h}^{(n)}_{(l-1)})$. This meaning that the amount of attention $\alpha_{({\color{red}t},n)}$ allocated at time ${\color{red}t}$ would be estimated considering the information available at time ${\color{red}t}$ and at time $n$ — or at any other arbitary timestep (either close or far-away from ${\color{red}t}$) that might be useful for the task at hand.

Finally, we want to remark that the shapes of the previously depicted black & green were invented just for didactic purposes.

Although in previous paragraphs we assumed single-layered models processing spectrograms and discussed each architecture separately, these are normally stacked one on top of the other to construct deep models. For example, a common pipeline consists of (i) stacking several CNN layers to extract local features from spectrograms, (ii) RNNs or attention are used to aggregate the CNN feature maps across time, and (iii) the temporally aggregated signal is used to predict what's intended with a MLP. Note that this deeper pipeline allows to learn musical characteristics at different time-scales, and to take advantage of the merits of each architecture.