ICASSP 2020: my selection!4 min read

It’s unfortunate that ICASSP 2020 was held online.. because we were excited to share our beloved Barcelona with you, folks. I’m sure, though, that we’ll have other chances to discover the best tapas in town!

This year I assisted ICASSP for “presenting” a couple of papers:

  • TensorFlow models in Essentia: Essentia is a reference open-source C++/Python library for audio and music analysis. In this work, we present a set of algorithms that employ TensorFlow in Essentia that allow predictions with pre-trained deep learning models — and some of those are based on musicnn! Here a link to a nice post on how to use it.
  • An empirical study of Conv-TasNet: We propose a (deep) non-linear variant of the encoder/decoder of Conv-TasNet, that is based on a deep stack of small filters. We also challenge the generalisation capabilities of Conv-TasNet. We report a LARGE performance drop when using cross-dataset evaluation. This result is important. It showcases the limitations of the current evaluation setup.

Speech and singing synthesis

Many works on speech synthesis work on improving the control and expressiveness (prosody, style, etc.) of the generated signals.

Others were investigating using quantised latents for neural vocoders (VQ-VAE style).

Speech synthesis is also actively looking at non-autoregressive models for fast (parallel) synthesis.

During ICASSP 2020, several improvements were proposed to LPCNET:

I was very glad to know more about ESPNET-TTS, an open source toolkit for reproducible TTS.

Music and audio synthesis

While neural speech synthesis engines are capable to generate high-quality samples, this is not the case for music and general audio synthesis models. Here a list of interesting works trying to push the boundaries of what’s possible in this (very challenging) area:

As in speech synthesis, this community is also working on improving the control and expressiveness of the generated signals.

Source separation

Researchers are investigating how to approach source separation in an unsupervised fashion:

Two closely related works propose to learn source separation models from week labels for universal sound separation:

These works on source separation were also interesting:

Audio classification

Some works were exploring to incorporate taxonomies within the deep learning framework:

Neural classifiers trained with few data

I’m happy to see that the community is embracing a research question that we also find relevant. How far can we get via training with just few data?

Interestingly, many of these works also found Prototypical Networks to work very well.

Automatic Speech Recognition with SincNet