ISMIR article: End-to-end learning for music audio tagging at scale

Our accepted ISMIR paper on music auto-tagging at scale is now online – read it on arXiv, and listen to our demo!

TL;DR:
1) Given that enough training data is available: waveform models (sampleCNN) > spectrogram models (musically motivated CNN).
2) But spectrogram models > waveform models when no sizable data are available.
3) Musically motivated CNNs achieve state-of-the-art results for the MTT & MSD datasets.

Abstract. The lack of data tends to limit the outcomes of deep learning research, particularly when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study, 1.2M tracks annotated with musical labels are available to train our end-to-end models. This large amount of data allows us to unrestrictedly explore two different design paradigms for music auto-tagging: assumption-free models – using waveforms as input with very small convolutional filters; and models that rely on domain knowledge – log-mel spectrograms with a convolutional neural network designed to learn timbral and temporal features. Our work focuses on studying how these two types of deep architectures perform when datasets of variable size are available for training: the MagnaTagATune (25k songs), the Million Song Dataset (240k songs), and a private dataset of 1.2M songs. Our experiments suggest that music domain assumptions are relevant when not enough training data are available, thus showing how waveform-based models outperform spectrogram-based ones in large-scale data scenarios.

Reproduce our results! Here a Github link with the TensorFlow implementation of our models. Further, to reproduce our results with the public datasets one needs to download the data – here some useful links: for the MagnaTagATune dataset (mirg.city.ac.uk/codeapps/the-magnatagatune-dataset and github.com/keunwoochoi/magnatagatune-list), and for the Million Song Dataset (github.com/jongpillee/music_dataset_split).

Acknowledgments. This work was partially done during my internship at Pandora (summer 2017). Part of the writing and the experiments done with public data were supported by the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) – and the Universitat Pompeu Fabra is grateful for the GPUs donated by NVidia.