CNNs | Jordi Pons

Slides: A Wavenet for Speech Denoising

By Jordi Pons in CNNs, Deep learning, Slides August 23, 2017

These lasts weeks we have been disseminating our recent work: “A Wavenet for Speech Denoising”. To this end, I gave two talks in the Bay Area of San Francisco: one at Dolby Laboratories and the other one at Pandora Radio — where I am currently doing an internship.

Here my slides.

But Dario (coauthor of the paper) also gave a talk in the Technical University of Munich, and I am excited to share his slides with you — since these have fantastic and very clarifying figures!

Here Dario’s slides deck.

Hopefully, checking our complementary views might help folks better understanding our work.

Three new arXiv articles

By Jordi Pons in CNNs, Deep learning, Paper is out, Results July 13, 2017

These last months have been very intense for us – and, as a result, three papers were recently uploaded to arXiv. Two of those have been accepted for presentation in ISMIR, and are the result of a collaboration with Rong – who is an amazing PhD student (also advised by Xavier) working on Jingju music:

The third paper was done in collaboration with Dario (an excellent master student!) who was interested in using deep learning models operating directly on the audio:

A Wavenet for Speech Denoising [code][audio examples]

AI Grant and EUSIPCO paper accepted!

By Jordi Pons in CNNs, Datasets, Deep learning, Paper is out June 1, 2017

Our EUSIPCO 2017 paper got accepted! This paper was done in collaboration with Olga Slizovskaia, Rong Gong, Emilia Gómez and Xavier Serra. And it is entitled: “Timbre Analysis of Music Audio Signals with Convolutional Neural Networks”.

Paper blogpost with further details!

Link to the paper!

And I have been awarded with one of the AI Grants given by Nat Friedman for creating a dataset of sounds from Freesound and using it in my research. The AI grants are an initiative of Nat Friedman, Cofounder/CEO of Xamarin, to support open-source AI projects. The project I proposed is part of an initiative of the MTG to promote the use of Freesound.org for research. The goal is to create a large dataset of sounds, following the same principles as Imagenet – in order to make audio AI more accessible to everyone. The project will contribute in developing an infrastructure to organize a crowdsource tool to convert Freesound into a research dataset. The following video presents the aforementioned project:

arXiv article: Timbre Analysis of Music Audio Signals with Convolutional Neural Networks

By Jordi Pons in CNNs, Deep learning, Paper is out March 21, 2017

Abstract. The focus of this work is to study how to efficiently tailor Convolutional Neural Networks (CNNs) towards learning timbre representations from log-mel magnitude spectrograms. We first review the trends when designing CNN architectures. Through this literature overview we discuss which are the crucial points to consider for efficiently learning timbre representations using CNNs. From this discussion we propose a design strategy meant to capture the relevant time-frequency contexts for learning timbre, which permits using domain knowledge for designing architectures. In addition, one of our main goals is to design efficient CNN architectures – what reduces the risk of these models to over-fit, since CNNs’ number of parameters is minimized. Several architectures based on the design principles we propose are successfully assessed for different research tasks related to timbre: singing voice phoneme classification, musical instrument recognition and music auto-tagging.

Link to the paper!

Conference paper: Designing efficient architectures for modeling temporal features with CNNs

By Jordi Pons in CNNs, Deep learning, Paper is out, Results December 15, 2016

Abstract – Many researchers use convolutional neural networks with small rectangular filters for music (spectrograms) classification. First, we discuss why there is no reason to use this filters setup by default and second, we point that more efficient architectures could be implemented if the characteristics of the music features are considered during the design process. Specifically, we propose a novel design strategy that might promote more expressive and intuitive deep learning architectures by efficiently exploiting the representational capacity of the first layer – using different filter shapes adapted to fit musical concepts within the first layer. The proposed architectures are assessed by measuring their accuracy in predicting the classes of the Ballroom dataset. We also make available the used code (together with the audio-data) so that this research is fully reproducible.