Deep learning architectures for audio classification: a personal (re)view

One can divide deep learning models into two parts: front-end and back-end – see Figure 1. The front-end is the part of the model that interacts with the input signal in order to map it into a latent-space, and the back-end predicts the output given the representation obtained by the front-end.

Figure 1 – Deep learning pipeline.

In the following, we discuss the different front- and back-ends we identified in the audio classification literature. Continue reading

ISMIR 2017 highlights

This has been my first ISMIR ever, and I am thrilled for being part of this amazing community. It was fun to put faces (and hight, and weight) to these names I respect so much!

All awarded papers were amazing, and these are definitely in my list of highlights:
  • Choi et al. – every time I re-read this paper I am more impressed about the efforts they put in assessing the generalization capabilities of deep learning models. This work defines a high evaluation standard for those working in deep auto-tagging models!
  • Bittner et al. proposes a fully-convolutional model for tracking f0 contours in polyphonic music. The article has a brilliant introduction drawing parallelisms between their proposed fully-convolutional architecture and previous traditional models – making clear that it is worth building bridges between deep learning works and previous signal processing literature.
  • Oramas et al. – deep learning enables to easily combine information from many sources, such as: audio, text or images. They do so by combining representations extracted from audio-spectrograms, word-embeddings and ImageNet-based features. Moreover, they released a new dataset: MuMu, with 147,295 songs belonging to 31,471 albums.
  • Jansson et al.‘s work proposes a U-net model for singing voice separation. It seems that adding connections between layers at the same hierarchical level in the encoder and decoder for reconstructing masked audio signals is a good idea since several papers already reported good results using this setup.

But there were many other inspiring papers.. Continue reading

Impressions from ICASSP 2017

The signal processing community is very into machine learning. Although I am not sure of the implications of this fact, this intersection already produced very interesting results – such as Smaragdis et al.’s work. Lots of papers related to deep learning were presented. Although in many cases people were naively applying DNN or LSTMs to a new problem, there also was (of course) amazing work with inspiring ideas – I highlight some:

  • Koizumi et al. propose using reinforcement learning for source separation. This work introduces how to use reinforcement learning for audio signal processing.
  • Ewert et al. propose using a variant of dropout that can be used to induce models to learn specific structures by using information from weak labels.
  • Ting-Wei et al. propose doing frame-level predictions with a fully convolutional model that also uses gaussian kernel filters (first introduced by them) trained with clip-level annotations in a weakly-supervised learning setup.

Continue reading

Slides: Deep learning for music data processing – a personal (re)view

I was invited to give a talk to the Deep Learning for Speech and Language Winter Seminar @ UPC,  Barcelona. Since UPC is the university where I did my undergaduate sudies, it was a great pleasure to give an introductory talk about how our community is using deep learning for approaching music technology problems.

Download the slides!

Overall, the talk was centered in reviewing the state-of-the-art (1988-2016) in deep learning for music data processing in order to boost some discussion about current trends. Several key papers were chronologically listed and briefly described: pioneer papers using MLP [1], RNNs [2], LSTMs [3] and CNNs [4] for music data processing; and pioner papers using symbolic data [1], spectrograms [5] and waveforms [6] – among others.

Continue reading