Preprint: On permutation invariant training for speech source separation1 min read

Our ICASSP paper studying permutation ambiguity on speaker-independent source separation models is now accessible on arXiv:

TL;DR #1: we found that STFT-based models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.

TL;DR #2: tPIT+clustering was originally proposed in the STFT domain by Deep CASA. We adapted it to work with Conv-TasNet, a waveform-based model. We propose a major change to the clustering algorithm that scales for waveform-based models.

TL;DR #3: Following up on our previous work (, we also investigate the generalisation capabilities of such models and discuss its relation with permutation errors.