It’s unfortunate that ICASSP 2020 was held online.. because we were excited to share our beloved Barcelona with you, folks. I’m sure, though, that we’ll have other chances to discover the best tapas in town!

This year I assisted ICASSP for “presenting” a couple of papers:
- TensorFlow models in Essentia: Essentia is a reference open-source C++/Python library for audio and music analysis. In this work, we present a set of algorithms that employ TensorFlow in Essentia that allow predictions with pre-trained deep learning models — and some of those are based on musicnn! Here a link to a nice post on how to use it.
- An empirical study of Conv-TasNet: We propose a (deep) non-linear variant of the encoder/decoder of Conv-TasNet, that is based on a deep stack of small filters. We also challenge the generalisation capabilities of Conv-TasNet. We report a LARGE performance drop when using cross-dataset evaluation. This result is important. It showcases the limitations of the current evaluation setup.
Speech and singing synthesis
Many works on speech synthesis work on improving the control and expressiveness (prosody, style, etc.) of the generated signals.
- MELLOTRON: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens
- Many-to-many Voice Conversion Using Conditional Cycle-Consistent Adversarial Networks
Others were investigating using quantised latents for neural vocoders (VQ-VAE style).
Speech synthesis is also actively looking at non-autoregressive models for fast (parallel) synthesis.
- Parallel WaveGAN: A Fast Waveform GenerationModel Based on Generative Adversarial Networks with multi-resolution spectrogram
- Flow-TTS: A Non-autoregressive Network for Text To Speech Based on Flow
During ICASSP 2020, several improvements were proposed to LPCNET:
- Gaussian LPCNET for Multisample Speech Synthesis
- Improving LPCNET-based Text-To-Speech With Linear Prediction-Structured Mixture Density Networks
I was very glad to know more about ESPNET-TTS, an open source toolkit for reproducible TTS.
Music and audio synthesis
While neural speech synthesis engines are capable to generate high-quality samples, this is not the case for music and general audio synthesis models. Here a list of interesting works trying to push the boundaries of what’s possible in this (very challenging) area:
- Transferring neural speech waveform synthesizers to musical instrument sounds generation
- Source Coding of Audio Signals with a Generative ModelSource Coding of Audio Signals with a Generative Model
- Sound Texture Synthesis Using RI Spectrograms
As in speech synthesis, this community is also working on improving the control and expressiveness of the generated signals.
- Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System
- Neural Percussive Synthesis Parametrised by High-Level Timbral Features
Source separation
Researchers are investigating how to approach source separation in an unsupervised fashion:
- Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss
- Mixup-Breakdown: A Consistency Training Method for Improving Generalization of Speech Separation Models
Two closely related works propose to learn source separation models from week labels for universal sound separation:
- Learning to Separate Sounds from Weakly Labeled Scenes
- Source separation with weakly labelled data: an approach to computational auditory scene analysis
These works on source separation were also interesting:
- SincNet for source separation: Filterbank Design for End-to-end Speech Separation
- Approaching the task holistically: Source separation, counting, and diarization system
- Yet another work on end-to-end music source separation: Improving Singing Voice Separation with the Wave-U-Net Using Minimum Hyperspherical Energy
- tPIT + clustering > uPIT? Deep CASA for Talker-Independent Monaural Speech Separation
Audio classification
- A-CRNN: a Domain Adaptation Model for Sound Event Detection: this is an adversarial domain adaptation model that uses a two-step training procedure. 1) warm-up, 2) adversarial domain adaptation.
- Data-Driven Harmonic Filters for Audio Representation Learning: they propose a trainable front-end module for audio representation learning that exploits the inherent harmonic structure of audio signals.
Some works were exploring to incorporate taxonomies within the deep learning framework:
- Chirping up the Right Tree: Incorporating Biological Taxonomies into Deep Bioacoustic Classifiers
- An Ontology-Aware Framework for Audio Event Classification
Neural classifiers trained with few data
I’m happy to see that the community is embracing a research question that we also find relevant. How far can we get via training with just few data?
- Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events
- Few-Shot Sound Event Detection
- SPIDERnet: Attention Network for One-Shot Anomaly Detection in Sounds
- Few-shot Acoustic Event Detection via Meta Learning
Interestingly, many of these works also found Prototypical Networks to work very well.
Automatic Speech Recognition with SincNet
- Self-supervised learning for ASR: Multi-task Self-supervised Learning for Robust Speech Recognition
- End-to-end Sincnet for ASR: E2E-SINCNET: Toward Fully End-to-end Speech Recognition
Miscelaneous
- On latent disentanglement: Disentangled Multidimensional Metric Learning for Music Similarity
- Self-supervised learning: SPICE: Pitch Estimation via Self-Supervision
- Cover song detection: Accurate and Scalable Version Identification using Musically-Motivated Embeddings
- Speech enhancement: Speaker Independence Of Neural Vocoders And Their Effect On Parametric Resynthesis Speech Enhancement
- A nice paper on beam-forming: BEAM-TASNET: Time-domain Audio Separation Network Meets Time-domain Beamformer
- Discussion on using the phase for source separation: Mask-dependent Phase Estimation for Monaural Speaker Separation