These last months have been very intense for us – and, as a result, three papers were recently uploaded to arXiv. Two of those have been accepted for presentation in ISMIR, and are the result of a collaboration with Rong – who is an amazing PhD student (also advised by Xavier) working on Jingju music:
Our EUSIPCO 2017 paper got accepted! This paper was done in collaboration with Olga Slizovskaia, Rong Gong, Emilia Gómez and Xavier Serra. And it is entitled: “Timbre Analysis of Music Audio Signals with Convolutional Neural Networks”.
And I have been awarded with one of the AI Grants given by Nat Friedman for creating a dataset of sounds from Freesound and using it in my research. The AI grants are an initiative of Nat Friedman, Cofounder/CEO of Xamarin, to support open-source AI projects. The project I proposed is part of an initiative of the MTG to promote the use of Freesound.org for research. The goal is to create a large dataset of sounds, following the same principles as Imagenet – in order to make audio AI more accessible to everyone. The project will contribute in developing an infrastructure to organize a crowdsource tool to convert Freesound into a research dataset. The following video presents the aforementioned project:
Abstract. The focus of this work is to study how to efficiently tailor Convolutional Neural Networks (CNNs) towards learning timbre representations from log-mel magnitude spectrograms. We first review the trends when designing CNN architectures. Through this literature overview we discuss which are the crucial points to consider for efficiently learning timbre representations using CNNs. From this discussion we propose a design strategy meant to capture the relevant time-frequency contexts for learning timbre, which permits using domain knowledge for designing architectures. In addition, one of our main goals is to design efficient CNN architectures – what reduces the risk of these models to over-fit, since CNNs’ number of parameters is minimized. Several architectures based on the design principles we propose are successfully assessed for different research tasks related to timbre: singing voice phoneme classification, musical instrument recognition and music auto-tagging.
The signal processing community is very into machine learning. Although I am not sure of the implications of this fact, this intersection already produced very interesting results – such as Smaragdis et al.’s work. Lots of papers related to deep learning were presented. Although in many cases people were naively applying DNN or LSTMs to a new problem, there also was (of course) amazing work with inspiring ideas – I highlight some:
Koizumi et al. propose using reinforcement learning for source separation. This work introduces how to use reinforcement learning for audio signal processing.
Ewert et al. propose using a variant of dropout that can be used to induce models to learn specific structures by using information from weak labels.
Ting-Wei et al. propose doing frame-level predictions with a fully convolutional model that also uses gaussian kernel filters (first introduced by them) trained with clip-level annotations in a weakly-supervised learning setup.