A dataset and pre-trained models to
separate music and speech in podcasts

podcast mixture

We introduce PodcastMix, a dataset formalizing the task of separating background music and foreground speech in podcasts. Throughout our experiments, we find that it can be challenging to separate the background music from podcasts, where speech tends to be louder in the foreground. Although background signals are in general more difficult to separate, we want to underline that this situation is ubiquitous in podcasts — what contrasts with other source separation tasks like speech source separation (where speakers communicate at similar levels), or speech enhancement (where the background noise is not separated, but removed). We aim at defining a benchmark suitable for training and evaluating (deep learning) source separation models for podcasts. To that end, we release a large and diverse dataset and pre-trained models.


Training data

Creative Commons audio dataset

Eval data

Creative Commons audio via Zenodo

Inference code

Models to separate podcasts

Research code

Retrain, evaluate, reproduce results

Research article

Read our scientific article



Current (deep learning) models can incur into generalization issues, specially when trained on synthetic data. To target potential generalization issues, we release different evaluation sets based on real and synthetic podcasts.

PodcastMix-synth: train and test sets

Here, large and diverse training and evaluation sets are programatically generated. PodcastMix-synth consists of 44,455 speech utterances from VCTK, and 19,370 music recordings from Jamendo that are programatically mixed, with standarized train / validation / test partitions. Follow our instructions on Github to download it (480GB). These synthetic podcasts sound as follows:

PodcastMix-real: with-reference

This test set is composed of real podcasts with reference stems to compute evaluation metrics. In total, we recorded 20 podcast excerpts, of 20 seconds, from 6 speakers talking 3 different languages: Portuguese, Italian, and Spanish. Note that the training set only contains English speech, but this evaluation set contains podcasts in Portuguese, Italian, and Spanish. It can be downloaded from Zenodo and sounds as follows:

Mixture Speech Music
Spanish podcast
Italian podcast
Portuguese podcast

PodcastMix-real no-reference

A test set with real podcasts, with only the podcasts mixes for subjective evaluation. It contains 8 podcast excerpts, of 18 seconds each, that were distributed under Creative Commons licenses, where 6 different speakers are speaking 3 different languages: English, Chinese, and Spanish. Provided that no reference stems are available, these can be used as a standardized set of audios to be used for sharing separations on demo websites. For this reason we share all our separations here, to encourage other researchers to do the same.

Mixture U-Net: speech U-Net: music Conv-TasNet: speech Conv-TasNet: music
English podcast
Spanish podcast
English podcast
Chinese podcast
English podcast
English podcast
English podcast
English podcast


Standarized web-MUSRHA subjective tests

In addition, 'PodcastMix real no-reference' can be used as a standarized set for subjective evaluations. To that end, we use web-MUSHRA as it allows to easily run subjective tests online. Together with the dataset and the pre-trained models, we also release web-MUSHRA templates to standarize and facilitate the subjective evaluation:

OVRL web-MUSHRA template

SIG and BAK web-MUSHRA template

Along our experiments with real podcasts, we find that current (deep learning) models may have generalization issues. Yet, these can perform competently, e.g., our best baseline (U-Net) separates speech with a mean opinion score of 3.84 (out of rating “overall separation quality” from 1 to 5 considering the OVRL test above). We released a pre-trained U-Net baseline that can be easily used, check our Github repository.

To support our work, acknowledge PodcastMix if you use it for academic research: N. Schmidt, J. Pons, M. Miron, "PodcastMix: a dataset for separating music and speech in podcasts", submitted to ICASSP (2022).