Deep learning architectures for audio classification: a personal (re)view5 min read

One can divide deep learning models into two parts: front-end and back-end – see Figure 1. The front-end is the part of the model that interacts with the input signal in order to map it into a latent-space, and the back-end predicts the output given the representation obtained by the front-end.

Figure 1 – Deep learning pipeline.

In the following, we discuss the different front- and back-ends we identified in the audio classification literature.

Front-ends to process the audio input

Front-ends are generally conformed by convolutional neural networks (CNNs), since these can learn efficient representations by sharing weights (which constitute the learnt feature representations) along the signal. Figure 2 depicts six different CNN front-end paradigms, which can be divided into two groups depending on the used input signal: waveforms or pre-processed waveforms (such as spectrograms). Further, the design of the filters can be either based on domain knowledge or not. For example, one leverages domain knowledge when the frame-level single-shape front-end for waveforms is designed so that the length of the filter is set to be the same as the window length in a STFT. Or for a spectrogram front-end, it is used vertical filters to learn timbral representations or horizontal filters to learn longer temporal cues. Generally, a single filter shape is used in the first CNN layer, but some recent work reported performance gains when using several filter shapes in such first layer. Using many filters promotes a more rich feature extraction in the first layer, and facilitates leveraging domain knowledge for designing the filters’ shape. For example: a frame-level many-shapes front-end for waveforms can be motivated from a multi-resolution time-frequency transform perspective – the Constant-Q Transform is an example of such transform; or since it is known that some patterns in spectrograms are occurring at different time-frequency scales, one can intuitively incorporate many vertical and/or horizontal filters in a spectrogram front-end. As seen, using domain knowledge when designing the models allows to naturally connect the deep learning literature with previous relevant signal processing work. On the other hand, when domain knowledge is not used, it is common to employ a stack of small filters, e.g.: 3×1 in the sample-level front-end for waveforms, or 3×3 in the small-rectangular filters front-end for spectrograms. These VGG-like models make minimal assumptions over the local stationarities of the signal so that any structure can be learnt via hierarchically combining small-context representations. It is important to note that audio structures are not discrete, making techniques that explicitly focus on specific levels of structure inherently suboptimal. These architectures with small filters are flexible models able to potentially learn any structure given enough depth and data.

Figure 2 – CNN front-ends for audio classification. Throughout the text, the different front-end classes are highlighted (underlined italics).

Our experience: front-ends for music audio tagging

After experimenting with the different front-ends depicted in Figure 2 for the task of music audio tagging, we concluded differently for every considered input signal: waveforms and spectrograms.

For waveforms, we observed that the sample-level front-end was remarkably superior than the other waveform-based front ends – as shown in the original paper. And we also found that the frame-level many-shapes front-end was performing better than the frame-level single-shape front-end. In other words:

sample-level >> frame-level many-shapes > frame-level single-shape.

For the spectrograms case, we found domain knowledge intuitions to be valid guides for designing front-ends. For example, we observed that models based on many vertical and horizontal (musically motivated) filters where consistently superior to models based on a single vertical filter modeling timbre. In addition, the small-rectangular filters front-end was achieving equivalent performance than the other front-ends when input segments were shorter than 10s. But when considering models with longer inputs (which yielded better performance), the small-rectangular filters front-end was impractical since one starts paying the computational cost of this deeper model: longer inputs means having larger feature maps and therefore, more GPU’s memory consumption. For that reason we discarded using the small-rectangular filters front-end because, in practice, our 12GBs of VRAM where not enough. Note, then, that domain knowledge also provides guidance for minimizing the computational cost of the model.

This discussion motivated the architectures of our ML4Audio@NIPS paper: End-to-end learning for music audio tagging at scale – or, alternatively, see my previous blogpost for a more interactive reading!

Wait, but we want to go deep! Back-ends discussion

By far the discussion was centered in finding the best way to approach the signal (with the front-end part). But now let’s focus on how the deeper layers of the model (the back-end) can turn the latent space (extracted by the front-end) into some useful predictions.

Many back-ends could be used, and among the different options we identified two main groups: (i) fixed-length input back-end, and (ii) variable-length input back-end. Interestingly, the generally convolutional nature of the front-end allows to naturally process different input lengths. Therefore, the back-end unit adapts the variable-length feature map to a fixed output-size. Former group of models (i) assume that the input of the model will be kept constant – examples of those are feed-forward neural-networks or fully-convolutional stacks. The second group (ii) can use different input-lengths, since the model is flexible in at least one of its input dimensions – examples of those are back-ends using temporal-aggregation strategies such as max-pooling, average-pooling, attention models or recurrent neural networks. Given the variation in audio clips lengths, this type of back-ends are ideal candidates for music/audio processing.

Unfortunately, our experience dealing with the previously listed back-ends is not as exhaustive as for the front-ends case – and I cannot share our findings experimenting with the different back-ends we identified in the literature. However, recent publications in drums transcription or recent trends in the DCASE challenge clearly denote that it exists potential in using variable-length input back-ends (more suitable for audio processing) – although, currently, most methods rely on back-ends assuming fixed-length inputs.

Academic reference: Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann and Xavier Serra. End-to-end learning for music audio tagging at scale. In proceedings of the Workshop on Machine Learning for Audio Signal Processing (ML4Audio) at NIPS, 2017.

Acknowledgements: Special thanks to Oriol Nieto for his valuable feedback on the text and its structure, and to Yann Bayle for his ideas to make Figure 2 more clear!