ISMIR article: End-to-end learning for music audio tagging at scale

Our accepted ISMIR paper on music auto-tagging at scale is now online – read it on arXiv, and listen to our demo!

1) Given that enough training data is available: waveform models (sampleCNN) > spectrogram models (musically motivated CNN).
2) But spectrogram models > waveform models when no sizable data are available.
3) Musically motivated CNNs achieve state-of-the-art results for the MTT & MSD datasets.

Continue reading

Takeaways from the Google Speech Summit 2018

After assisting to the Google Speech Summit 2018, I can adventure to say that Google’s speech interests for the future are: (i) to continue improving their automatic speech recognition (w/ Listen, Attend and Spell, a seq2seq model) and speech synthesis (w/ Tacotron 2 + Wavenet/WaveRNN) systems so that a robust interface is available for their conversational agent; (ii) they want to keep simplifying pipelines – having less “separated” blocks in order to be end-to-end whenever is possible; (iii) they are studying how to better control some aspects of their end-to-end models – for example, with style tokens they aim to control some Tacotron (synthesis) parameters; and (iv) lots of efforts are put in building the Google Assistant, a conversational agent that I guess will be the basis of their next generation of products.

The following lines aim to summarize (by topics) what I found relevant – and, ideally, describe some details that are not in the papers.

Continue reading

My ICASSP 2018 highlights

This year’s ICASSP keywords are: generative adversarial networks (GANs), wavenet, speech enhancement, source separation, industry, music transcription, cover song identification, sampleCNN, monophonic pitch tracking, and gated/dilated CNNs. This time, passionate scientific discussions happened in random sport bars at downtown Calgary – next to dirty snow piles that were melting.

Continue reading

ISMIR 2017 highlights

This has been my first ISMIR ever, and I am thrilled for being part of this amazing community. It was fun to put faces (and hight, and weight) to these names I respect so much!

All awarded papers were amazing, and these are definitely in my list of highlights:
  • Choi et al. – every time I re-read this paper I am more impressed about the efforts they put in assessing the generalization capabilities of deep learning models. This work defines a high evaluation standard for those working in deep auto-tagging models!
  • Bittner et al. proposes a fully-convolutional model for tracking f0 contours in polyphonic music. The article has a brilliant introduction drawing parallelisms between their proposed fully-convolutional architecture and previous traditional models – making clear that it is worth building bridges between deep learning works and previous signal processing literature.
  • Oramas et al. – deep learning enables to easily combine information from many sources, such as: audio, text or images. They do so by combining representations extracted from audio-spectrograms, word-embeddings and ImageNet-based features. Moreover, they released a new dataset: MuMu, with 147,295 songs belonging to 31,471 albums.
  • Jansson et al.‘s work proposes a U-net model for singing voice separation. It seems that adding connections between layers at the same hierarchical level in the encoder and decoder for reconstructing masked audio signals is a good idea since several papers already reported good results using this setup.

But there were many other inspiring papers.. Continue reading