Actually, what I really need is less papers with “all you need” in the title โ and to share a (non-virtual) beer with you folks!! Here some of the papers I enjoyed, together with the papers we presented. You’ll see that I don’t include classification/tagging papers, I guess I need a break from my PhD topic ๐ Enjoy!

Source separation
- (our work) Multichannel-based learning for audio object extraction. Link: arxiv.
- Learn to extract objects, without defining a loss at the object level. Instead, the loss is defined at the after rendering to multichannels (like 5.1 or stereo).
- (our work) On permutation invariant training for speech source separation. Link: arxiv.
- Speaker permutation errors are a known problem in speech source separation, we explicitly investigate and discuss this issue.
- Towards listening to 10 people simultaneously: an efficient permutation invariant training of audio source separation using Sinkhorn’s algorithm. Link: arxiv.
- As clearly noted in the title, an efficient alternative for permutation invariant training in audio source separation.
- The following works investigate MISI, the “Griffin-Lim algorithm for source separation”.
- One-shot conditional audio filtering of arbitrary sound. Links: arxiv, audios.
- Universal sound separation by example, but it can be tricky to find the “right” example that is on the audio at hand.
- Transcription is all you need: learning to separate musical mixtures with score as supervision. Link: arxiv.
- They perform weakly-supervised source separation, and find that using a transcription system to guide the separations works better than a classifier.
- They also use harmonic masks derived from the musical score, as well as adversarial losses on the transcriptor. Hence, transcription is not all you need ๐
Synthesis
- (our work) Upsampling artifacts in neural audio synthesis. Links: arxiv, code.
- If you work on audio synthesis, you might have noticed that upsampling layers (e.g., transposed or subixel CNNs, or nearest neighbour) introduce artefacts, which are documented in this article.
- LOOPNET: musical loop synthesis conditioned on intuitive musical parameters. Links: arxiv, audio.
- Study intuitive controls for composers to control a generative model of music audio loops. Arguably, the future of sampling culture will rely on sampling distributions ๐
- Semi-supervised learning for singing synthesis timbre. Link: arxiv.
- They map acoustic and linguistic features into the same latent space, such that one can use one or another seamlessly for different purposes.
- Context-aware prosody correction for text-based speech editing. Link: arxiv.
- When replacing some speech by another speech audio snipped, prosody mismatch can be problematic. They adapt the prosody via standard DSP pitch transformation + neural enhancement.
- Efficient Adversarial Audio Synthesis Via Progressive Upsampling. Link: IEEE.
- Similarly to progressiveGAN (producing high-resolution image from low-resolution image), they propose progressive upsampling GAN (from low to high sampling rates).
- StyleMelGAN: an efficient high-fidelity adversarial vocoder with temporal adaptive normalization. Link: arxiv.
- Use mel-spectrogram conditioning at all layers to modulate the latent, that is the result of transforming noise. They call this operation “to style a low-dimensional noise vector”.
Neural audio effects
- (our work) Automatic multitrack mixing with a differentiable mixing console of neural audio effects. Links: arxiv, demo.
- Proposes learning the behaviour of the channel strips of a mixing console with a neural network, to have a fully differentiable mixing console for learning to mix in an end-to-end fashion.
- Differentiable signal processing with black-box audio effects. Link: arxiv.
- While the work above employs neural networks to emulate audio effects, they propose a framework that allows (directly) using plugins in your neural networks.
- Lightweight and Interpretable Neural Modeling of an Audio Distortion Effect Using Hyperconditioned differentiable Biquads. Link: arxiv.
- While the works above employ neural networks or audio plugins to have “audio effects layers”, they propose a framework based on differentiable digital signal processing (with differentiable biquadratic filters).
Self-supervised learning
- Multi-task self-supervised pre-training for music classification. Link: arxiv.
- Extend the multi-task self-supervised learning originally proposed for speech, to music classification. It includes: MFCCs for timbre, Chroma for harmonic, and Tempogram for rhythmic attributes.
- Learning contextual tag embeddings for cross-modal alignment of audio and tags. Link: arxiv.
- A method for cross-modal alignment of general audio (not only speech) and tags via using contrastive learning and a pre-trained language model.
- HuBERT: how much can a bad teacher benefit ASR pre-training? Link: IEEE.
- Hidden-Unit BERT (HUBERT) model which utilizes a “cheap” k-means clustering step to provide aligned target labels for pre-training a BERTish ASR model.
- Contrast learning for general audio representations. Link: arxiv.
- Learns representations via assigning high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings.
- The larger the batch the better! Luckily, they didn’t entitle the paper as “large batch sizes is all you need in contrastive learning”.
- Optimizing short-time fourier transform parameters via gradient descent. Links: arxiv, code.
- While is not strictly speaking self-supervised learning, the study how to learn the parameters of the STFT.
Speech quality metrics
- (our work) SESQA: semi-supervised learning for speech quality assessment. Link: arxiv.
- Uses labelled data and self-supervised learning ideas to learn to predict speech quality. It can be a reference-based and reference-free metric.
- CDPAM: contrastive learning for perceptual audio similarity. Link: arxiv, code.
- It improves their previous work via combining contrastive learning and multi-dimensional representations to build robust models from limited data.
Speech enhancement
- ICASSP 2021 deep noise suppression challenge: decoupling magnitude and phase optimization with a two-stage deep network. Link: arxiv.
- A two-stage model: (i) train to estimate “clean magnitude”, and (ii) train to estimate “clean real and imaginary” from “clean magnitude + noisy phase” jointly with loss (i).
- Uses a signal processing post-processing.
- Bandwidth extension is all you need. Link: IEEE.
- Very good results on bandwidth extension, based on a feed-forward wavenet with deep feature matching and adversarial training.
- They argue that for high-fidelity audio synthesis since you can train an efficient system (e.g., vocoder or enhancement) at 8kHz or 16kHz and later use bandwidth extension. Since you might need a vocoder or enhancement, bandwidth extension is not all you need ๐
- Enhancing into the codec: noise robust speech coding with vector-quantized autoencoders. Link: arxiv.
- Learnt codecs (trained on clean speech) are not necessarily robust to noisy speech. In line with that, they propose to jointly tackle speech enhancement and coding.
- Cascaded time + time-frequency Unet for speech enhancement: jointly addressing clipping, codec distortions, and gaps. Link: IEEE.
- This work jointly addresses three speech distortions: clipping, codec distortions, and gaps in speech (PLCish).
- High fidelity speech regeneration with application to speech enhancement. Link: arxiv.
- They approach speech enhancement via resynthesis. Steps: (i) denoising; (ii) speech features like loudness, f0, d-vector; and (iii) speech synthesis with a GAN-TTS vocoder.
Low complexity for speech enhancement
- Low-complexity, real-time joint neural echo control and speech enhancement based on PercepNet. Link: arxiv.
- They combine a traditional acoustic echo canceller, and a low-complexity joint residual echo and noise suppressor based on hybrid DSP/DNN approach.
- Towards efficient models for real-time deep noise suppression. Link: arxiv.
- A very simple U-net for speech enhancement, but they explain interesting tricks to build an efficient model.
WANT TO FURTHER DISCUSS THOSE TOPICS ONLINE?
Feel free to contact me on Clubhouse (handle: @jordipons), to organize a discussion around those papers!