Post written in collaboration with and sponsorship of Exxact (@Exxactcorp).
Many things have happened between the pioneering papers written by Lewis and Todd in the 80s and the current wave of GANs composers. Along that journey, connectionists’ work was forgotten during the AI winter, very influential names (like Schmidhuber or Ng) contributed seminal publications and, in the meantime, researchers have made tons of awesome progress.
I won’t be going through every single paper in the field of neural networks for music nor diving into technicalities, but I’ll cover what are the milestones that helped shaping the current state of music AI – this being a nice excuse to give credit to these wild researchers who decided to care about a signal that is nothing else but cool. Let’s start!
Our accepted ISMIR paper on music auto-tagging at scale is now online – read it on arXiv, and listen to our demo!
1) Given that enough training data is available: waveform models (sampleCNN) > spectrogram models (musically motivated CNN).
2) But spectrogram models > waveform models when no sizable data are available.
3) Musically motivated CNNs achieve state-of-the-art results for the MTT & MSD datasets.
This post aims to share our experience setting up our deep learning server – thanks to NVidia for the two Titan X Pascal, but also thanks to the Maria de Maeztu Research Program for the machine ! 🙂 The text is divided in two parts: bringing the pieces together, and install TensorFlow. Let’s start!
I was invited to give a talk to the Deep Learning for Speech and Language Winter Seminar at the UPC in Barcelona. Since UPC is the university where I did my undergraduate studies, it was a great pleasure to give a talk there!
Download the slides!
The talk was centered in my recent work on music audio tagging, which is available on arXiv and is summarized in these previous posts: deep learning architectures for music audio classification, and deep end-to-end learning for music audio tagging at Pandora.
Thanks to @DocXavi for the picture!
One can divide deep learning models into two parts: front-end and back-end – see Figure 1. The front-end is the part of the model that interacts with the input signal in order to map it into a latent-space, and the back-end predicts the output given the representation obtained by the front-end.
Figure 1 – Deep learning pipeline.
In the following, we discuss the different front- and back-ends we identified in the audio classification literature. Continue reading