Deep end-to-end learning for music audio tagging at Pandora12 min read

TL;DR – Summary:

Machine listening is a research area where deep supervised learning is delivering promising advances. However, the lack of data tends to limit the outcomes of deep learning research – specially, when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study we train models with musical labels annotated for one million tracks, which provides novel insights to the audio tagging task since the largest commonly used (academic) dataset is composed of ≈ 200k songs. This large amount of data allows us to unrestrictedly explore different deep learning paradigms for the task of auto-tagging: from assumption-free models – using waveforms as input with very small convolutional filters; to models that rely on domain knowledge – log-mel spectrograms processed with a convolutional neural network designed to learn temporal and timbral features. Results suggest that, while spectrogram-based models surpass their waveform-based counterparts, the difference in performance shrinks as more data are employed.

We also compare our deep learning models with a traditional method based on feature-design, namely: the Gradient Boosted Trees (GBT) + features model. Results show that the proposed deep models are capable of outperforming the traditional method when trained with 1M tracks, however the proposed models under-perform the baseline when trained with only 100K tracks. This result aligns with the notion that deep learning models require large datasets for outperforming strong (traditional) methods based on feature-design.

Let’s see what our best performing model (a musically motivated convolutional neural network processing spectrograms) yields when fed with a J.S. Bach aria:

Top10: Human-labels
Female vocals, triple meter, acoustic, classical music, baroque period, lead vocals, string ensemble, major, compositional dominance of: lead vocals and melody.
Top10: Deep learning
Acoustic, string ensemble, classical music, baroque period, major, compositional dominance of: the arrangement, form, performance, rhythm and lead vocals.
Top10: Traditional method (GBT + features)
Acoustic, triple meter, string ensemble, classical music, baroque period, classic period, string solo, major, compositional dominance of: melody and form.

Italicized tags are not in the human-labels list – but these might still be valid tags.

At first glance, the models seem to do a decent job – let’s see how to do that!


The music audio tagging task consists in automatically estimating the musical attributes of a song. These attributes may include: moods, language of the lyrics, year of composition, genre(s), instruments, harmony traits, or rhythmic traits. Many approaches have been considered for this task (mostly based on feature extraction + model), with recent publications showing promising results using deep learning – which enables end-to-end learning pipelines. In this work we confirm this trend, and we study how different deep learning architectures scale with a large music collection. To this end: we compare our models with a traditional method based on feature extraction + model, and we train our models with one million tracks annotated by musicologists.

We divide deep learning models into two parts: front-end and back-end – see Figure 1. The front-end is the part of the model that interacts with the input signal in order to map it into a latent-space, and the back-end predicts the output given the representation obtained by the front-end. In the following, we discuss different front- and back-ends.

Figure 1 – Deep learning pipeline.

Experimental setup

We aim to study how superior/inferior waveform front-ends are when compared to spectrogram-based ones given the unprecedented amount of data used for this study – 1M tracks for training, 100k for validation, and 100k for test. In the following, we describe the models that performed the best in our preliminary experiments for the two considered inputs: waveforms and spectrograms. Experiments below share the same back-end, which enables a fair comparison among different front-ends. Implementation details for the presented models are available online.

Shared back-end. It is conformed by three CNN layers (with 512 filters each and two residual connections), two pooling layers and a dense layer – see Figure 2. We introduced residual connections in our model to explore very deep architectures, such that we can take advantage of the large data available. Although adding more residual layers did not drastically improve our results, we observed that adding these residual connections stabilized learning while slightly improving performance. The used 1D-CNN filters are computationally efficient and shaped such that all extracted features are considered across a reasonable amount of temporal context (note the 7xM’ filter shapes, representing time x all features). We also make a drastic use of temporal pooling: firstly, down-sapling x2 the temporal dimensionality of the feature maps; and secondly, by making use of a global pooling layer with mean and max statistics – as a result, the proposed model allows for variable-length inputs. Finally, a dense layer with 500 units connects the pooled features to the output. The study of alternative temporal aggregation strategies (such as recurrent neural networks or attention) is left for future work.

Figure 2 – Back-end. M’ stands for the feature map’s vertical axis,
BN for batch norm, and MP for max-pool.

Waveform front-end. It is based on the sample-level front-end proposed by Lee et al., and is composed of a stack of 7 CNNs (3×1 filters), batch norm, and max pool layers – see Figure 3. Each layer has 64, 64, 64, 128, 128, 128 and 256 filters, respectively. By hierarchically combining small-context representations and making use of max pooling, the sample-level front-end yields a feature map for an audio segment of 15 seconds (down-sampled to 16kHz) which is further processed by the previously described back-end.

Figure 3 – Waveform front-end.

Spectrogram front-end. Firstly, audio segments are converted to log-MEL magnitude spectrograms (15 seconds and 90 MEL bins) and normalized to have zero mean and unit variance. Secondly, we propose using vertical and horizontal filters explicitly designed to facilitate learning the timbral and temporal patterns present in spectrograms. Note in Figure 4 that the proposed front-end is a single-layer CNN with many filter shapes that are grouped into two branches: (i) top branch – timbral features; and (ii) lower branch – temporal features. The top branch is designed to capture pitch-invariant timbral features that are occurring at different time-frequency scales in the spectrogram. Pitch invariance is enforced via enabling CNN filters to convolve through the frequency domain, and via max-pooling the feature map across its vertical axis. Note that several filter shapes are used to efficiently capture many different time-frequency patterns, e.g.: kick-drums (with small-rectangular filters capturing sub-band information for a short period of time), or string ensemble instruments (with long vertical filters capturing timbral patterns spread in the frequency axis). The lower branch is meant to learn temporal features, designed to efficiently capture different time-scale representations by using several filter shapes. These CNN filters operate over an energy envelope (not directly over the spectrogram) obtained via mean-pooling the frequency-axis of the spectrogram. By computing the energy envelope in that way, we are considering high and low frequencies together while minimizing the computations of the model – note that no frequency/vertical convolutions are performed, only 1D convolutions are computed. Therefore, domain knowledge is also providing guidance to minimize the computational cost of the model. The output of these two branches is merged, and the previously described back-end is used for going deep.

Figure 4 – A musically motivated CNN is our spectrogram front-end.

Additional settings: 50% dropout before every dense layer, ReLUs as non-linearities, and our model is trained with stochastic gradient descent – initial learning rate of 0.001, optimizing the MSE with ADAM and a batch size of 16. We optimize MSE instead of cross-entropy because part of our target tags (annotations) are not bi-modal. During training our data are converted to audio patches of 15 seconds, but during prediction one aims to consider the whole song. To this end, several predictions are computed for a song (by a moving window of 15 sec) and then averaged. Although our model is capable of predicting tags for variable-length inputs, we use fixed length patches since predicting the whole song at once yielded worse results than averaging several 15 second patch predictions. In future work we aim to further study this behavior, to better exploit the whole song during prediction.


As our baseline we set a system consisting of a music feature extractor (in essence: timbre, rhythm and harmony descriptors) and a model based on gradient boosted trees (GBT) for predicting each of the tags. By predicting each tag individually, one aims to turn a hard problem into multiple (hopefully simpler) problems. A careful inspection of our dataset reveals that, among tags, two different data distributions dominate the annotations: (i) tags with classifiable bi-modal distributions, where most of the annotations are zero; and (ii) tags with pseudo-uniform distributions that can be regressed. An example of a classification tag is any genre; and an example of a regression tag is ‘acoustic’ – which indicates how acoustic a song is (from zero to one, e.g.: zero being an electronic music song and one a string quartet). We use two sets of performance measurements: ROC-AUC and PR-AUC for the classification tags, and error (√MSE) for the regression tags – although note that the trained deep models learn to jointly predict both classification and regression tags via optimizing the MSE. Furthermore, note that ROC-AUC can lead to over-optimistic scores in cases where data are unbalanced. Given that classification tags are highly unbalanced, we also consider the PR-AUC metric since it is more indicative than ROC-AUC in these cases. We also depict the performance difference (Δ) between spectrogram and waveform models for every measurement. Results are presented in the following table:

The spectrogram-based model trained with 1M tracks achieves better results than the baseline in every measurement. However, the deep learning models (waveform and spectrogram) trained with 100k tracks were performing worse than the baseline. This result confirms that deep learning models require large datasets for clearly outperforming strong (traditional) methods based on feature-design – although these large datasets are generally not available for most audio tasks. Moreover, the biggest performance improvement w.r.t. the baseline is seen for PR-AUC, which provides a more informative picture of the performance when the dataset is unbalanced. And finally, note that there is room for improving the proposed models – e.g., one could explicitly address the data imbalance problem during training, or improve the back-end via exploring alternative temporal aggregation strategies.

Also note that the waveform model achieves remarkable results. For most measurements it achieves better results than the baseline, yet worse than those of the spectrogram model. However, a closer inspection of the results reveals that the differences between spectrogram and waveform models shrink as more data are used for training – such differences (Δ) are halved when models are trained with more data.

Qualitative results

Previously, we showed the Bach aria example to illustrate how capable are the studied systems: the proposed deep models, but also the baseline. In the following, we attach another example to further discuss our results. Let’s listen some music from Kendrick Lamar: Complexion (A Zulu Love)!

Top10: Human-labels
English, male vocals, rap, East Coast, breathy vocal, joyful lyrics and compositional dominance of: lyrics, melody, rhythm, accompanying vocals.
Top10: Deep learning
English, lead vocals, male vocals, rap, accompanying vocals, danceable and compositional dominance of: accompanying vocals, lead vocals, rhythm, lyrics.
Top10: Traditional method (GBT + features)
East Coast, West Coast, rap, hardcore, lead vocals, funk, old school, drums, electronic drums, party.

Italicized tags are not in the human-labels list – but these might still be valid tags.

In these rap track estimations we observe that the traditional method predicts some unexpected tags. We also see that the proposed deep learning model is biased towards predicting popular tags – such as ‘lead vocals’, ‘English’ or ‘male vocals’. Note that this is expected since we are not addressing the data unbalancing issue during training.

Secondly, the baseline model (which predicts the probability of each tag with an independent GBT model) predicts mutually exclusive tags with high confidence. For example, it predicted with high scores: East Coast’ and West Coast’ for the East Cost rap; or baroque period’ and classic period’ for the Bach aria. However, the deep learning model (predicting the probability of all tags together) was able to better differentiate these similar but mutually exclusive tags. Following the same examples: ‘East Coast’ was predicted with twice more confidence than ‘West Coast’ (0.23 and 0.12, respectively); and baroque period’ is predicted with high score while classic period’ is out of the Top10 tags. This suggests that deep learning models have an advantage when compared to traditional approaches, since these mutually exclusive relations can be jointly encoded within the model.

And finally, note that the traditional method (GBT + features) positioned the triple meter’ tag within the Top10 tags of the Bach aria, while the deep learning model did not. We speculate that this is because the (musical) features of the baseline approach can be very explicit about the definition of musical concepts, while deep learning has to learn these concepts from scratch/data. Although the proposed spectrogram model tailors the model towards learning temporal features, the learnt representations will be tempo-variant due to the nature of the filters – and therefore, it might be challenging to learn the ‘triple meter’ concept. However, interestingly, the deep learning model has predicted ‘triple meter’ with a score of 0.37, which is clearly far from 0.


The two proposed models are based on two conceptually different design principles. The first is based on a waveform front-end, and no domain knowledge inspired its design. Note that the assumptions of this model are reduced to its minimum expression: raw audio is set as input, and the used CNN does minimal assumptions over the structure of the data due to its set of very small filters. For the second model, with a spectrogram front-end, we make use of domain knowledge to guide the model’s design – and our best performing model was designed following this design strategy. The proposed models are capable of outperforming the baseline based on a traditional method, and our results denote that spectrogram-based architectures are still superior to waveform-based models. However, the gap between waveform-based and spectrogram-based models is reduced when training with more data.

Github code:
Reference: Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann and Xavier Serra. End-to-end learning for music audio tagging at scale. In proceedings of the Workshop on Machine Learning for Audio Signal Processing (ML4Audio) at NIPS, 2017.
Acknowledgements: I want to thank Pandora for allowing me to publish these preliminary results I got during my summer internship – specially to Oriol Nieto (my mentor and close collaborator) and to Matthew Prockup, Erik Schmidt and Andreas Ehmann (for sharing their knowledge about the baseline and data).