During the last summer, I have been a research intern at Telefónica Research (Barcelona). The article “Training neural audio classifiers with few data” is the outcome of this short (but intense!) collaboration with Joan Serrà, where we explored how to train deep learning models with just 1, 2 or 10 audios per class. Check it out on arXiv, and reproduce our results running our code! These slides are the extended version of what I will be presenting next week in ICASSP! See you in Brighton 🙂
In this series of posts I have written a couple of articles discussing the pros & cons of spectrogram-based VGG architectures, to think about which is the role of the computer vision deep learning architectures in the audio field. Now is time to discuss what’s up with waveform-based VGGs!
- Post I: Why do spectrogram-based VGGs suck?
- Post II: Why do spectrogram-based VGGs rock?
- Post III: What’s up with waveform-based VGGs? [this post]
Me: VGGs suck because they are computationally inefficient, and because they are a naive adoption of a computer vision architecture.
Random person on Internet: Jordi, you might be wrong. People use VGGs a lot!Continue reading
Me: VGGs suck because they are computationally inefficient and because they are a naive adoption of a computer vision architecture.
Random person on Internet: Jordi, you might be wrong. People use VGGs a lot!
Currently, successful neural network audio classifiers use log-mel spectrograms as input. Given a mel-spectrogram matrix X, the logarithmic compression is computed as follows:
f(x) = log(α·X + β).
Common pairs of (α,β) are (1, eps) or (10000,1). In this post we investigate the possibility of learning (α,β). To this end, we study two log-mel spectrogram variants:
- Log-learn: The logarithmic compression of the mel spectrogram X is optimized via SGD together with the rest of the parameters of the model. We use exponential and softplus gates to control the pace of α and β, respectively. We set the initial pre-gate values to 7 and 1, what results in out-of-gate α and β initial values of 1096.63 and 1.31, respectively.
- Log-EPS: We set as baseline a log-mel spectrogram which does not learn the logarithmic compression. (α,β) are set to (1, eps). Note eps stands for “machine epsilon”, a very small number.
TL;DR: We are publishing a negative result,
log-learn did not improve our results! 🙂