My biggest learning this year: I’LL NOT SURVIVE ANOTHER ONLINE CONFERENCE 💔 I really miss in-person discussion in exotic places! This year I attended ICASSP to present two papers:
- “On Loss Functions and Evaluation Metrics for Music Source Separation” by Enric Gusó, Jordi Pons, Santiago Pascual, Joan Serrà [Zenodo, arXiv].
- “PixInWav: Residual Steganography for Hiding Pixels in Audio” by Margarita Geleta, Cristina Punti, Kevin McGuinness, Jordi Pons, Cristian Canton, Xavier Giro-i-Nieto [arXiv].

General trends
Let’s learn to predict the parameters of systems that are well-known by audio and synth nerds.
- “Direct design of biquad filter cascades with deep learning by sampling random polynomials” by Colonel et al.
- “Differentiable wavetable sythesis” by Shan et al.
Audio companies are interested in amateur music recordings enhancement.
- “Music enhancement via image translation and vocoding” by Kandpal et al.
- Dolby ON, an app to record live that enhances amateur music recordings.
The focus of self-supervised learning is moving from speech to general audio, and from classification to synthesis:
- “Towards learning universal audio representations” by Wang et al.
- “Investigating self-supervised learning for speech enhancement and separation” by Huang et al.
Several researchers are exploring using “complex-valued” neural networks to process “complex-valued” spectrograms. Also, many speech enhancement systems are hybrid: signal processing + deep learning.
- “End-to-end complex-valued multidilated convolutional neural network for joing acoustic echo cancellation and noise suppression” by Watcharasupat et al.
- “Attention-based fusion for bone-conducted and air-conducted speech enhancement in the complex domain” by Wang et al.
- Main “complex-valued” speech enhancement reference: DCRRN paper.
Source separation trends
Let’s guide source separation with textual or high-level descriptions.
- “Environmental Sound Extraction Using Onomatopoeic Words” by Okamoto et al.
- “Unsupervised source separation by steering pretrained music models” by Manilow et al.
Source separation in out-of-distribution real-world data remains a challenging problem.
- “Adapting speech separation to real-world meetings using mixture invariant training” by Sivaraman et al.
- “REAL-M: towards speech separation on real mixtures” by Subakan et al.
- “Remix-cycle-consistent learning on adversarially learned separator for accurate and stable unsupervised speech separation” by Saijo and Ogawa.
Specific machine learning techniques
Adaptive instance normalization – I now know where these ideas of adapting the mean (of the feature maps) to change style/identity come from!
- “Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning” by Li et al.
- The original idea (YouTube presentation) from computer vision.
Deep equilibrium models – these allow “keeping the expressivity” while reducing the number of trainable parameters of the model.
- “Music source sepration with deep lequilibrium models” by Koyama et al.
Implicit neural representations – parametrise the (observed) audio, such that one can easily interpolate (upsample) such parametrisation (for sampling rate conversion).
Distribution augmentation – works by flagging any data augmentation during training. During inference, such flags are set to zero (non-flag symbol) to make sure the distortions due to unrealistic augmentations do not effect inference.
- “Distribution augmentation for low-resource expressive text-to-speech” by Lajszczak et al.
Orderless NADE and using Gibbs sampling – to include an iterative source separation step.
- “Improving source separation by explicitly modeling dependencies between sources” by Manilow et al.
Personal takes
I enjoyed this “simple” baseline based on masking for text-based speech editing. In short: mask and generate the part you wish to edit (conditioned on text).
A couple of papers on SDR that present well the basics:
- Nice summary of the SDR-related losses/problems: “SA-SDR: A novel loss function for separation of meeting style data” by Neumann et al.
- Recent update on making SDR faster: “SDR – medium rare with fast computations” by Scheibler.
I liked the idea of also using negative conditioning examples for FiLM layers.
- “Few-Shot Musical Source Separation” by Wang et al.
Let’s make sure the adversarial loss captures some perceptually relevant parts of the audio like the transients with a percussive discriminator.
More losses the better?
- “KaraSinger: score-free singing voice synthesis with VQ-VAE using mel-spectrograms” by Liao et al. Cool application! And CTC loss is key to encourage such layers to carry phoneme-related information.
- “Unsupervised speech enhancement with speech recognition embedding and disentanglement losses” by Trinh et al (mixIT + embedding losses).
Valin et al. use “hierarchical sampling” to improve the (computational) efficiency of the softmax layer.
New state-of-the-art in audio tagging for Audioset, using an efficient transformer + CNN output: