Today we release “PodcastMix”, a dataset for separating music and speech in podcasts:
In this work we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners.
During his internship at Dolby, Enric run an exhaustive evaluation of various loss functions for music source separation. After evaluating those losses objectively and subjectively, we recommend training with the following spectrogram-based losses: L2freq, SISDRfreq, LOGL2freq or LOGL1freq with, potentially, phase- sensitive objectives and adversarial regularizers.
These are the papers we will be presenting at ICASSP 2021! Infinite thanks to all my collaborators for the amazing work 🙂
- On Loss Functions and Evaluation Metrics for Music Source Separation by Enric Gusó, Jordi Pons, Santiago Pascual, Joan Serrà [Zenodo, arXiv].
- PixInWav: Residual Steganography for Hiding Pixels in Audio by Margarita Geleta, Cristina Punti, Kevin McGuinness, Jordi Pons, Cristian Canton, Xavier Giro-i-Nieto [arXiv].
How to extract audio objects with deep learning – without explicitly learning to extract those? In our ICASSP paper we propose multichannel-based learning, a technique closely related to self-supervised learning, differentiable digital signal processing, and universal sound separation.Continue reading