Our accepted ISMIR paper on music auto-tagging at scale is now online – read it on arXiv, and listen to our demo!
TL;DR: 1) Given that enough training data is available: waveform models (sampleCNN) > spectrogram models (musically motivated CNN). 2) But spectrogram models > waveform models when no sizable data are available. 3) Musically motivated CNNs achieve state-of-the-art results for the MTT & MSD datasets.
A few weeks ago Olga Slizovskaya and I were invited to give a talk to the Centre for Digital Music (C4DM) @ Queen Mary Universtity of London – one of the most renowned music technology research institutions in Europe, and possibly in the world. It’s been an honor, and a pleasure to share our thoughts (and some beers) with you!
The talk was centered in our recent work on music audio tagging, which is available on arXiv, where we study how non-trained (randomly weighted) convolutional neural networks perform as feature extractors for (music) audio classification tasks.
These last months have been very intense for us – and, as a result, three papers were recently uploaded to arXiv. Two of those have been accepted for presentation in ISMIR, and are the result of a collaboration with Rong – who is an amazing PhD student (also advised by Xavier) working on Jingju music:
Our EUSIPCO 2017 paper got accepted! This paper was done in collaboration with Olga Slizovskaia, Rong Gong, Emilia Gómez and Xavier Serra. And it is entitled: “Timbre Analysis of Music Audio Signals with Convolutional Neural Networks”.
And I have been awarded with one of the AI Grants given by Nat Friedman for creating a dataset of sounds from Freesound and using it in my research. The AI grants are an initiative of Nat Friedman, Cofounder/CEO of Xamarin, to support open-source AI projects. The project I proposed is part of an initiative of the MTG to promote the use of Freesound.org for research. The goal is to create a large dataset of sounds, following the same principles as Imagenet – in order to make audio AI more accessible to everyone. The project will contribute in developing an infrastructure to organize a crowdsource tool to convert Freesound into a research dataset. The following video presents the aforementioned project:
Abstract. The focus of this work is to study how to efficiently tailor Convolutional Neural Networks (CNNs) towards learning timbre representations from log-mel magnitude spectrograms. We first review the trends when designing CNN architectures. Through this literature overview we discuss which are the crucial points to consider for efficiently learning timbre representations using CNNs. From this discussion we propose a design strategy meant to capture the relevant time-frequency contexts for learning timbre, which permits using domain knowledge for designing architectures. In addition, one of our main goals is to design efficient CNN architectures – what reduces the risk of these models to over-fit, since CNNs’ number of parameters is minimized. Several architectures based on the design principles we propose are successfully assessed for different research tasks related to timbre: singing voice phoneme classification, musical instrument recognition and music auto-tagging.