ICASSP 2022 – my learnings4 min read

My biggest learning this year: I’LL NOT SURVIVE ANOTHER ONLINE CONFERENCE 💔 I really miss in-person discussion in exotic places! This year I attended ICASSP to present two papers:

  • “On Loss Functions and Evaluation Metrics for Music Source Separation” by Enric Gusó, Jordi Pons, Santiago Pascual, Joan Serrà [ZenodoarXiv].
  • “PixInWav: Residual Steganography for Hiding Pixels in Audio” by Margarita Geleta, Cristina Punti, Kevin McGuinness, Jordi Pons, Cristian Canton, Xavier Giro-i-Nieto [arXiv].

General trends

Let’s learn to predict the parameters of systems that are well-known by audio and synth nerds.

Audio companies are interested in amateur music recordings enhancement.

The focus of self-supervised learning is moving from speech to general audio, and from classification to synthesis:

Several researchers are exploring using “complex-valued” neural networks to process “complex-valued” spectrograms. Also, many speech enhancement systems are hybrid: signal processing + deep learning.

Source separation trends

Let’s guide source separation with textual or high-level descriptions.

Source separation in out-of-distribution real-world data remains a challenging problem.

Specific machine learning techniques

Adaptive instance normalization – I now know where these ideas of adapting the mean (of the feature maps) to change style/identity come from!

Deep equilibrium models – these allow “keeping the expressivity” while reducing the number of trainable parameters of the model.

Implicit neural representations – parametrise the (observed) audio, such that one can easily interpolate (upsample) such parametrisation (for sampling rate conversion).

Distribution augmentation – works by flagging any data augmentation during training. During inference, such flags are set to zero (non-flag symbol) to make sure the distortions due to unrealistic augmentations do not effect inference.

Orderless NADE and using Gibbs sampling – to include an iterative source separation step.

Personal takes

I enjoyed this “simple” baseline based on masking for text-based speech editing. In short: mask and generate the part you wish to edit (conditioned on text).

A couple of papers on SDR that present well the basics:

I liked the idea of also using negative conditioning examples for FiLM layers.

Let’s make sure the adversarial loss captures some perceptually relevant parts of the audio like the transients with a percussive discriminator.

More losses the better?

Valin et al. use “hierarchical sampling” to improve the (computational) efficiency of the softmax layer.

New state-of-the-art in audio tagging for Audioset, using an efficient transformer + CNN output: