Deep learning architectures for audio classification: a personal (re)view

One can divide deep learning models into two parts: front-end and back-end – see Figure 1. The front-end is the part of the model that interacts with the input signal in order to map it into a latent-space, and the back-end predicts the output given the representation obtained by the front-end.

Figure 1 – Deep learning pipeline.

In the following, we discuss the different front- and back-ends we identified in the audio classification literature. Continue reading

Deep end-to-end learning for music audio tagging at Pandora

TL;DR – Summary:

Machine listening is a research area where deep supervised learning is delivering promising advances. However, the lack of data tends to limit the outcomes of deep learning research – specially, when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study we train models with musical labels annotated for one million tracks, which provides novel insights to the audio tagging task since the largest commonly used (academic) dataset is composed of ≈ 200k songs. This large amount of data allows us to unrestrictedly explore different deep learning paradigms for the task of auto-tagging: from assumption-free models – using waveforms as input with very small convolutional filters; to models that rely on domain knowledge – log-mel spectrograms processed with a convolutional neural network designed to learn temporal and timbral features. Results suggest that, while spectrogram-based models surpass their waveform-based counterparts, the difference in performance shrinks as more data are employed.

We also compare our deep learning models with a traditional method based on feature-design, namely: the Gradient Boosted Trees (GBT) + features model. Results show that the proposed deep models are capable of outperforming the traditional method when trained with 1M tracks, however the proposed models under-perform the baseline when trained with only 100K tracks. This result aligns with the notion that deep learning models require large datasets for outperforming strong (traditional) methods based on feature-design.

Let’s see what our best performing model (a musically motivated convolutional neural network processing spectrograms) yields when fed with a J.S. Bach aria:

Top10: Human-labels
Female vocals, triple meter, acoustic, classical music, baroque period, lead vocals, string ensemble, major, compositional dominance of: lead vocals and melody.
Top10: Deep learning
Acoustic, string ensemble, classical music, baroque period, major, compositional dominance of: the arrangement, form, performance, rhythm and lead vocals.
Continue reading

Slides: A Wavenet for Speech Denoising

These lasts weeks we have been disseminating our recent work: “A Wavenet for Speech Denoising”. To this end, I gave two talks in the Bay Area of San Francisco: one at Dolby Laboratories and the other one at Pandora Radio — where I am currently doing an internship.

Here my slides.

But Dario (coauthor of the paper) also gave a talk in the Technical University of Munich, and I am excited to share his slides with you — since these have fantastic and very clarifying figures!

Here Dario’s slides deck.

Hopefully, checking our complementary views might help folks better understanding our work.

Three new arXiv articles

These last months have been very intense for us – and, as a result, three papers were recently uploaded to arXiv. Two of those have been accepted for presentation in ISMIR, and are the result of a collaboration with Rong – who is an amazing PhD student (also advised by Xavier) working on Jingju music:

The third paper was done in collaboration with Dario (an excellent master student!) who was interested in using deep learning models operating directly on the audio: