Currently, successful neural network audio classifiers use log-mel spectrograms as input. Given a mel-spectrogram matrix X, the logarithmic compression is computed as follows:
f(x) = log(α·X + β).
Common pairs of (α,β) are (1, eps) or (10000,1). In this post we investigate the possibility of learning (α,β). To this end, we study two log-mel spectrogram variants:
- Log-learn: The logarithmic compression of the mel spectrogram X is optimized via SGD together with the rest of the parameters of the model. We use exponential and softplus gates to control the pace of α and β, respectively. We set the initial pre-gate values to 7 and 1, what results in out-of-gate α and β initial values of 1096.63 and 1.31, respectively.
- Log-EPS: We set as baseline a log-mel spectrogram which does not learn the logarithmic compression. (α,β) are set to (1, eps). Note eps stands for “machine epsilon”, a very small number.
TL;DR: We are publishing a negative result,
log-learn did not improve our results! 🙂
During the last summer, I have been a research intern at Telefónica Research (Barcelona). This article is the outcome of this short (but intense!) collaboration with Joan Serrà, where we explore how to train deep learning models with just 1, 2 or 10 audios per class. Check it out on arXiv, and reproduce our results running our code!
This last year I have been collaborating with Francesc Lluís. He is master student in our research group, who worked on “A Wavenet for Music Source Separation”. For more info about our investigation, you can read his thesis or our arXiv paper. Code, and some separations are also available for you!
Our accepted ISMIR paper on music auto-tagging at scale is now online – read it on arXiv, and listen to our demo!
1) Given that enough training data is available: waveform models (sampleCNN) > spectrogram models (musically motivated CNN).
2) But spectrogram models > waveform models when no sizable data are available.
3) Musically motivated CNNs achieve state-of-the-art results for the MTT & MSD datasets.
A few weeks ago Olga Slizovskaya and I were invited to give a talk to the Centre for Digital Music (C4DM) @ Queen Mary Universtity of London – one of the most renowned music technology research institutions in Europe, and possibly in the world. It’s been an honor, and a pleasure to share our thoughts (and some beers) with you!
Download the slides!
The talk was centered in our recent work on music audio tagging, which is available on arXiv, where we study how non-trained (randomly weighted) convolutional neural networks perform as feature extractors for (music) audio classification tasks.