Our accepted ISMIR paper on music auto-tagging at scale is now online – read it on arXiv, and listen to our demo!
1) Given that enough training data is available: waveform models (sampleCNN) > spectrogram models (musically motivated CNN).
2) But spectrogram models > waveform models when no sizable data are available.
3) Musically motivated CNNs achieve state-of-the-art results for the MTT & MSD datasets.
A few weeks ago Olga Slizovskaya and I were invited to give a talk to the Centre for Digital Music (C4DM) @ Queen Mary Universtity of London – one of the most renowned music technology research institutions in Europe, and possibly in the world. It’s been an honor, and a pleasure to share our thoughts (and some beers) with you!
Download the slides!
The talk was centered in our recent work on music audio tagging, which is available on arXiv, where we study how non-trained (randomly weighted) convolutional neural networks perform as feature extractors for (music) audio classification tasks.
After assisting to the Google Speech Summit 2018, I can adventure to say that Google’s speech interests for the future are: (i) to continue improving their automatic speech recognition (w/ Listen, Attend and Spell, a seq2seq model) and speech synthesis (w/ Tacotron 2 + Wavenet/WaveRNN) systems so that a robust interface is available for their conversational agent; (ii) they want to keep simplifying pipelines – having less “separated” blocks in order to be end-to-end whenever is possible; (iii) they are studying how to better control some aspects of their end-to-end models – for example, with style tokens they aim to control some Tacotron (synthesis) parameters; and (iv) lots of efforts are put in building the Google Assistant, a conversational agent that I guess will be the basis of their next generation of products.
The following lines aim to summarize (by topics) what I found relevant – and, ideally, describe some details that are not in the papers.
This year’s ICASSP keywords are: generative adversarial networks (GANs), wavenet, speech enhancement, source separation, industry, music transcription, cover song identification, sampleCNN, monophonic pitch tracking, and gated/dilated CNNs. This time, passionate scientific discussions happened in random sport bars at downtown Calgary – next to dirty snow piles that were melting.
Extreme Learning Machines (ELMs) are very controversial and very fast machine learning models that perform very well. Of course, very is in italics because such word is susceptible to change depending on your background or application field. However, this sentence provides an idea of what ELMs can deliver – and why these might be interesting for an audio community that rarely uses them. Continue reading