Our accepted ISMIR paper on music auto-tagging at scale is now online – read it on arXiv, and listen to our demo!
1) Given that enough training data is available: waveform models (sampleCNN) > spectrogram models (musically motivated CNN).
2) But spectrogram models > waveform models when no sizable data are available.
3) Musically motivated CNNs achieve state-of-the-art results for the MTT & MSD datasets.
After assisting to the Google Speech Summit 2018, I can adventure to say that Google’s speech interests for the future are: (i) to continue improving their automatic speech recognition (w/ Listen, Attend and Spell, a seq2seq model) and speech synthesis (w/ Tacotron 2 + Wavenet/WaveRNN) systems so that a robust interface is available for their conversational agent; (ii) they want to keep simplifying pipelines – having less “separated” blocks in order to be end-to-end whenever is possible; (iii) they are studying how to better control some aspects of their end-to-end models – for example, with style tokens they aim to control some Tacotron (synthesis) parameters; and (iv) lots of efforts are put in building the Google Assistant, a conversational agent that I guess will be the basis of their next generation of products.
The following lines aim to summarize (by topics) what I found relevant – and, ideally, describe some details that are not in the papers.
This year’s ICASSP keywords are: generative adversarial networks (GANs), wavenet, speech enhancement, source separation, industry, music transcription, cover song identification, sampleCNN, monophonic pitch tracking, and gated/dilated CNNs. This time, passionate scientific discussions happened in random sport bars at downtown Calgary – next to dirty snow piles that were melting.
This has been my first ISMIR ever, and I am thrilled for being part of this amazing community. It was fun to put faces (and hight, and weight) to these names I respect so much!
All awarded papers were amazing, and these are definitely in my list of highlights:
- Choi et al. – every time I re-read this paper I am more impressed about the efforts they put in assessing the generalization capabilities of deep learning models. This work defines a high evaluation standard for those working in deep auto-tagging models!
- Bittner et al. proposes a fully-convolutional model for tracking f0 contours in polyphonic music. The article has a brilliant introduction drawing parallelisms between their proposed fully-convolutional architecture and previous traditional models – making clear that it is worth building bridges between deep learning works and previous signal processing literature.
- Oramas et al. – deep learning enables to easily combine information from many sources, such as: audio, text or images. They do so by combining representations extracted from audio-spectrograms, word-embeddings and ImageNet-based features. Moreover, they released a new dataset: MuMu, with 147,295 songs belonging to 31,471 albums.
- Jansson et al.‘s work proposes a U-net model for singing voice separation. It seems that adding connections between layers at the same hierarchical level in the encoder and decoder for reconstructing masked audio signals is a good idea since several papers already reported good results using this setup.
But there were many other inspiring papers.. Continue reading
The signal processing community is very into machine learning. Although I am not sure of the implications of this fact, this intersection already produced very interesting results – such as Smaragdis et al.’s work. Lots of papers related to deep learning were presented. Although in many cases people were naively applying DNN or LSTMs to a new problem, there also was (of course) amazing work with inspiring ideas – I highlight some:
- Koizumi et al. propose using reinforcement learning for source separation. This work introduces how to use reinforcement learning for audio signal processing.
- Ewert et al. propose using a variant of dropout that can be used to induce models to learn specific structures by using information from weak labels.
- Ting-Wei et al. propose doing frame-level predictions with a fully convolutional model that also uses gaussian kernel filters (first introduced by them) trained with clip-level annotations in a weakly-supervised learning setup.