Learning the logarithmic compression of the mel spectrogram

Currently, successful neural network audio classifiers use log-mel spectrograms as input. Given a mel-spectrogram matrix X, the logarithmic compression is computed as follows:

f(x) = log(α·X + β).

Common pairs of (α,β) are (1, eps) or (10000,1). In this post we investigate the possibility of learning (α,β). To this end, we study two log-mel spectrogram variants:

  • Log-learn: The logarithmic compression of the mel spectrogram X is optimized via SGD together with the rest of the parameters of the model. We use exponential and softplus gates to control the pace of α and β, respectively. We set the initial pre-gate values to 7 and 1, what results in out-of-gate α and β initial values of 1096.63 and 1.31, respectively.
  • Log-EPS: We set as baseline a log-mel spectrogram which does not learn the logarithmic compression. (α,β) are set to (1, eps). Note eps stands for “machine epsilon”, a very small number.

TL;DR: We are publishing a negative result,
log-learn did not improve our results! 🙂

Continue reading

arXiv paper & slides: Randomly weighted CNNs for (music) audio classification

A few weeks ago Olga Slizovskaya and I were invited to give a talk to the Centre for Digital Music (C4DM) @ Queen Mary Universtity of London – one of the most renowned music technology research institutions in Europe, and possibly in the world. It’s been an honor, and a pleasure to share our thoughts (and some beers) with you!

Download the slides!

The talk was centered in our recent work on music audio tagging, which is available on arXiv, where we study how non-trained (randomly weighted) convolutional neural networks perform as feature extractors for (music) audio classification tasks.

Slides: Deep learning architectures for music audio classification: a personal (re)view

I was invited to give a talk to the Deep Learning for Speech and Language Winter Seminar at the UPC in Barcelona. Since UPC is the university where I did my undergraduate studies, it was a great pleasure to give a talk there!

Download the slides!

The talk was centered in my recent work on music audio tagging, which is available on arXiv and is summarized in these previous posts: deep learning architectures for music audio classification, and deep end-to-end learning for music audio tagging at Pandora.

Thanks to @DocXavi for the picture!