End-to-end Music Source Separation

Is it possible in the waveform domain?


Most of the currently successful source separation techniques use the magnitude spectrogram as input, and are therefore by default omitting part of the signal: the phase. In order to avoid omitting potentially useful information, we study the viability of using end-to-end models for music source separation. By operating directly over the waveform, these models take into account all the information available in the raw audio signal, including the phase. Our results show that waveform-based models can outperform a recent spectrogram-based deep learning model. Namely, a novel Wavenet-based model we propose and Wave-U-Net can outperform DeepConvSep, a spectrogram-based deep learning model. This suggests that end-to-end learning has a great potential for the problem of usic source separation.

Read our paper on arXiv!


DeepConvSep

This CNN estimates time-frequency soft masks from magnitude spectrograms

GitHub Article

Wavenet-based

A discriminative, non-causal Wavenet performing end-to-end source separation

GitHub Article

Wave-U-Net

Wave-U-Net is the 1D adaptation of U-Net for end-to-end source separation

GitHub Article

Singing Voice Source Separation experiment

Song 1 Song 2 Song 3 Song 4 Song 5
Original mixture
Wavenet: vocals
Wave-U-net: vocals
Clean signal: vocals

This experiment is to show the potential of two end-to-end learning models for singing voice source separation, since both waveform-based models are capable to decently perform the task.

While Wave-U-Net is more conservative, Wavenet-based models seems to better remove the acocompaining sources (at the cost of some noticeable artifacts). Accordingly, Wave-U-Net has difficulties in producing silences in parts where the singing voice is not present, what results in a smoother separation having less artifacts.

Multi-instrument Source Separation experiment

Song 1 Song 2 Song 3 Song 4 Song 5
Original mixture
Wavenet: vocals
Wavenet: drums
Wavenet: bass
DeepConvSep: vocals
DeepConvSep: drums
DeepConvSep: bass
Clean signal: vocals
Clean signal: drums
Clean signal: bass

In this experiment we compare: waveform-based vs. spectrogram-based models, for the task of multi-instrument source separation.

The Wavenet-based model is able to achieve separations comparable (if not better) to DeepConvSep ones. Are particularly remarkable the results achieved by the Wavenet-based model when separating drums and bass. Overall: one can observe that DeepConvSep is even more conservative than Wave-U-Net, possibly due to the mask-based approach used for filtering the spectrograms. Although Wavenet-based models seem to better remove the acocompaining sources, they do it at the cost of introducing some noticeable artifacts.

Acknowledgments — Work partially funded by the Maria de Maeztu Programme (MDM-2015-0502). We are grateful to NVidia for the donated GPUs, to Foxnice for hosting our demos, and special thanks to Daniel Balcells for his valuable corrections.