ICASSP 2021 is all you need6 min read

Actually, what I really need is less papers with “all you need” in the title – and to share a (non-virtual) beer with you folks!! Here some of the papers I enjoyed, together with the papers we presented. You’ll see that I don’t include classification/tagging papers, I guess I need a break from my PhD topic 🙂 Enjoy!

Source separation

  • (our work) Multichannel-based learning for audio object extraction. Link: arxiv.
    • Learn to extract objects, without defining a loss at the object level. Instead, the loss is defined at the after rendering to multichannels (like 5.1 or stereo).
  • (our work) On permutation invariant training for speech source separation. Link: arxiv.
    • Speaker permutation errors are a known problem in speech source separation, we explicitly investigate and discuss this issue.
  • Towards listening to 10 people simultaneously: an efficient permutation invariant training of audio source separation using Sinkhorn’s algorithm. Link: arxiv.
    • As clearly noted in the title, an efficient alternative for permutation invariant training in audio source separation.
  • The following works investigate MISI, the “Griffin-Lim algorithm for source separation”.
    • Online spectrogram inversion for low-latency audio source separation. Link: arxiv.
    • Phase recovery with bregman divergences for audio source separation. Link: arxiv.
  • One-shot conditional audio filtering of arbitrary sound. Links: arxiv, audios.
    • Universal sound separation by example, but it can be tricky to find the “right” example that is on the audio at hand.
  • Transcription is all you need: learning to separate musical mixtures with score as supervision. Link: arxiv.
    • They perform weakly-supervised source separation, and find that using a transcription system to guide the separations works better than a classifier.
    • They also use harmonic masks derived from the musical score, as well as adversarial losses on the transcriptor. Hence, transcription is not all you need 🙂


  • (our work) Upsampling artifacts in neural audio synthesis. Links: arxivcode.
    • If you work on audio synthesis, you might have noticed that upsampling layers (e.g., transposed or subixel CNNs, or nearest neighbour) introduce artefacts, which are documented in this article.
  • LOOPNET: musical loop synthesis conditioned on intuitive musical parameters. Links: arxiv, audio.
    • Study intuitive controls for composers to control a generative model of music audio loops. Arguably, the future of sampling culture will rely on sampling distributions 🙂
  • Semi-supervised learning for singing synthesis timbre. Link: arxiv.
    • They map acoustic and linguistic features into the same latent space, such that one can use one or another seamlessly for different purposes.
  • Context-aware prosody correction for text-based speech editing. Link: arxiv.
    • When replacing some speech by another speech audio snipped, prosody mismatch can be problematic. They adapt the prosody via standard DSP pitch transformation + neural enhancement.
  • Efficient Adversarial Audio Synthesis Via Progressive Upsampling. Link: IEEE.
    • Similarly to progressiveGAN (producing high-resolution image from low-resolution image), they propose progressive upsampling GAN (from low to high sampling rates).
  • StyleMelGAN: an efficient high-fidelity adversarial vocoder with temporal adaptive normalization. Link: arxiv.
    • Use mel-spectrogram conditioning at all layers to modulate the latent, that is the result of transforming noise. They call this operation “to style a low-dimensional noise vector”.

Neural audio effects

  • (our work) Automatic multitrack mixing with a differentiable mixing console of neural audio effects. Links: arxivdemo.
    • Proposes learning the behaviour of the channel strips of a mixing console with a neural network, to have a fully differentiable mixing console for learning to mix in an end-to-end fashion.
  • Differentiable signal processing with black-box audio effects. Link: arxiv.
    • While the work above employs neural networks to emulate audio effects, they propose a framework that allows (directly) using plugins in your neural networks.
  • Lightweight and Interpretable Neural Modeling of an Audio Distortion Effect Using Hyperconditioned differentiable Biquads. Link: arxiv.
    • While the works above employ neural networks or audio plugins to have “audio effects layers”, they propose a framework based on differentiable digital signal processing (with differentiable biquadratic filters).

Self-supervised learning

  • Multi-task self-supervised pre-training for music classification. Link: arxiv.
    • Extend the multi-task self-supervised learning originally proposed for speech, to music classification. It includes: MFCCs for timbre, Chroma for harmonic, and Tempogram for rhythmic attributes.
  • Learning contextual tag embeddings for cross-modal alignment of audio and tags. Link: arxiv.
    • A method for cross-modal alignment of general audio (not only speech) and tags via using contrastive learning and a pre-trained language model.
  • HuBERT: how much can a bad teacher benefit ASR pre-training? Link: IEEE.
    • Hidden-Unit BERT (HUBERT) model which utilizes a “cheap” k-means clustering step to provide aligned target labels for pre-training a BERTish ASR model.
  • Contrast learning for general audio representations. Link: arxiv.
    • Learns representations via assigning high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings.
    • The larger the batch the better! Luckily, they didn’t entitle the paper as “large batch sizes is all you need in contrastive learning”.
  • Optimizing short-time fourier transform parameters via gradient descent. Links: arxiv, code.
    • While is not strictly speaking self-supervised learning, the study how to learn the parameters of the STFT.

Speech quality metrics

  • (our work) SESQA: semi-supervised learning for speech quality assessment. Link: arxiv.
    • Uses labelled data and self-supervised learning ideas to learn to predict speech quality. It can be a reference-based and reference-free metric.
  • CDPAM: contrastive learning for perceptual audio similarity. Link: arxiv, code.
    • It improves their previous work via combining contrastive learning and multi-dimensional representations to build robust models from limited data. 

Speech enhancement

  • ICASSP 2021 deep noise suppression challenge: decoupling magnitude and phase optimization with a two-stage deep network. Link: arxiv.
    • A two-stage model: (i) train to estimate “clean magnitude”, and (ii) train to estimate “clean real and imaginary” from “clean magnitude + noisy phase” jointly with loss (i).
    • Uses a signal processing post-processing.
  • Bandwidth extension is all you need. Link: IEEE.
    • Very good results on bandwidth extension, based on a feed-forward wavenet with deep feature matching and adversarial training.
    • They argue that for high-fidelity audio synthesis since you can train an efficient system (e.g., vocoder or enhancement) at 8kHz or 16kHz and later use bandwidth extension. Since you might need a vocoder or enhancement, bandwidth extension is not all you need 🙂
  • Enhancing into the codec: noise robust speech coding with vector-quantized autoencoders. Link: arxiv.
    • Learnt codecs (trained on clean speech) are not necessarily robust to noisy speech. In line with that, they propose to jointly tackle speech enhancement and coding.
  • Cascaded time + time-frequency Unet for speech enhancement: jointly addressing clipping, codec distortions, and gaps. Link: IEEE.
    • This work jointly addresses three speech distortions: clipping, codec distortions, and gaps in speech (PLCish).
  • High fidelity speech regeneration with application to speech enhancement. Link: arxiv.
    • They approach speech enhancement via resynthesis. Steps: (i) denoising; (ii) speech features like loudness, f0, d-vector; and (iii) speech synthesis with a GAN-TTS vocoder.

Low complexity for speech enhancement

  • Low-complexity, real-time joint neural echo control and speech enhancement based on PercepNet. Link: arxiv.
    • They combine a traditional acoustic echo canceller, and a low-complexity joint residual echo and noise suppressor based on hybrid DSP/DNN approach.
  • Towards efficient models for real-time deep noise suppression. Link: arxiv.
    • A very simple U-net for speech enhancement, but they explain interesting tricks to build an efficient model.

Feel free to contact me on Clubhouse (handle: @jordipons), to organize a discussion around those papers!