Post written in collaboration with and sponsorship of Exxact (@Exxactcorp).
Many things have happened between the pioneering papers written by Lewis and Todd in the 80s and the current wave of GANs composers. Along that journey, connectionists’ work was forgotten during the AI winter, very influential names (like Schmidhuber or Ng) contributed seminal publications and, in the meantime, researchers have made tons of awesome progress.
I won’t be going through every single paper in the field of neural networks for music nor diving into technicalities, but I’ll cover what are the milestones that helped shaping the current state of music AI – this being a nice excuse to give credit to these wild researchers who decided to care about a signal that is nothing else but cool. Let’s start!
Here my first personal AMA interview! But wait, what’s an AMA interview? AMA stands for “Ask Me Anything” in Reddit jargon. After reading this interview you will know a bit more about my life and way of thinking 🙂 This interview is a dissemination effort done by the María de Maeztu program (who funds my PhD research), and the AI Grant (who supports our Freesound Datasets project). Let’s start!
Our accepted ISMIR paper on music auto-tagging at scale is now online – read it on arXiv, and listen to our demo!
1) Given that enough training data is available: waveform models (sampleCNN) > spectrogram models (musically motivated CNN).
2) But spectrogram models > waveform models when no sizable data are available.
3) Musically motivated CNNs achieve state-of-the-art results for the MTT & MSD datasets.
A few weeks ago Olga Slizovskaya and I were invited to give a talk to the Centre for Digital Music (C4DM) @ Queen Mary Universtity of London – one of the most renowned music technology research institutions in Europe, and possibly in the world. It’s been an honor, and a pleasure to share our thoughts (and some beers) with you!
Download the slides!
The talk was centered in our recent work on music audio tagging, which is available on arXiv, where we study how non-trained (randomly weighted) convolutional neural networks perform as feature extractors for (music) audio classification tasks.
After assisting to the Google Speech Summit 2018, I can adventure to say that Google’s speech interests for the future are: (i) to continue improving their automatic speech recognition (w/ Listen, Attend and Spell, a seq2seq model) and speech synthesis (w/ Tacotron 2 + Wavenet/WaveRNN) systems so that a robust interface is available for their conversational agent; (ii) they want to keep simplifying pipelines – having less “separated” blocks in order to be end-to-end whenever is possible; (iii) they are studying how to better control some aspects of their end-to-end models – for example, with style tokens they aim to control some Tacotron (synthesis) parameters; and (iv) lots of efforts are put in building the Google Assistant, a conversational agent that I guess will be the basis of their next generation of products.
The following lines aim to summarize (by topics) what I found relevant – and, ideally, describe some details that are not in the papers.