How to extract audio objects with deep learning – without explicitly learning to extract those? In our ICASSP paper we propose multichannel-based learning, a technique closely related to self-supervised learning, differentiable digital signal processing, and universal sound separation.
Why related to differentiable digital signal processing? Multichannel-based learning is enabled by the structure of the model. The fact that our decoder is a differentiable implementation of Dolby Atmos, constraints our encoder to output this format.
Why related to universal sound separation? Our model is not source specific (i.e., not music or speech source separation). It extracts the 3 most prominent sources and a multichannel remainder, called “bed channels”, containing the audio not embedded in the extracted objects.
Why related to self-supervised learning? Instead of directly learning to extract objects, our loss targets a proxy signal based on multichannel renders.