Most of the currently successful source separation techniques use the magnitude spectrogram as input, and are therefore by default omitting part of the signal: the phase. In order to avoid omitting potentially useful information, we study the viability of using end-to-end models for music source separation. By operating directly over the waveform, these models take into account all the information available in the raw audio signal, including the phase. Our results show that waveform-based models can outperform a recent spectrogram-based deep learning model. Namely, a novel Wavenet-based model we propose and Wave-U-Net can outperform DeepConvSep, a spectrogram-based deep learning model. This suggests that
end-to-end learning has a great potential for the problem of usic source separation.
This experiment is to show the potential of two end-to-end learning models for singing voice source separation, since both waveform-based models are capable to decently perform the task.
While Wave-U-Net is more conservative, Wavenet-based models seems to better remove the acocompaining sources (at the cost of some noticeable artifacts). Accordingly, Wave-U-Net has difficulties in producing silences in parts where the singing voice is not present, what results in a smoother separation having less artifacts.
Multi-instrument Source Separation experiment
Clean signal: vocals
Clean signal: drums
Clean signal: bass
In this experiment we compare: waveform-based vs. spectrogram-based models, for the task of multi-instrument source separation.
The Wavenet-based model is able to achieve separations comparable (if not better) to DeepConvSep ones. Are particularly remarkable the results achieved by the Wavenet-based model when separating drums and bass. Overall: one can observe that DeepConvSep is even more conservative than Wave-U-Net, possibly due to the mask-based approach used for filtering the spectrograms. Although Wavenet-based models seem to better remove the acocompaining sources, they do it at the cost of introducing some noticeable artifacts.
Acknowledgments — Work partially funded by the Maria de Maeztu Programme (MDM-2015-0502). We are grateful to NVidia for the donated GPUs, to Foxnice
for hosting our demos, and special thanks to Daniel Balcells for his valuable corrections.