In this paper we present DAG: a full-band (48kHz) waveform synthesizer based on diffusion-based generative modeling! And style transfer comes for free.. check out our demo! This is great work led by Santi.
I’m very proud of our recent work, because by simply improving the loss (keeping the same model and dataset) we obtain an improvement of 1.4 dB SI-SNRi! 1 dB in source separation is a lot, and is perceptually noticeable. This is great work led by Emilian, who worked with us as an intern during the summer of 2022.
I’m happy to share the highlights of my first paper with Dolby! We will be presenting this work at ICASSP 2020, in Barcelona.
Several improvements have been proposed to Conv-TasNet – that mostly focus on the separator, leaving its encoder/decoder as a (shallow) linear operator. We propose a (deep) non-linear variant of it, that is based on a deep stack of small filters. With this change, we can improve 0.6-0.9 dB SI-SNRi.
Currently, successful neural network audio classifiers use log-mel spectrograms as input. Given a mel-spectrogram matrix X, the logarithmic compression is computed as follows:
f(x) = log(α·X + β).
Common pairs of (α,β) are (1, eps) or (10000,1). In this post we investigate the possibility of learning (α,β). To this end, we study two log-mel spectrogram variants:
- Log-learn: The logarithmic compression of the mel spectrogram X is optimized via SGD together with the rest of the parameters of the model. We use exponential and softplus gates to control the pace of α and β, respectively. We set the initial pre-gate values to 7 and 1, what results in out-of-gate α and β initial values of 1096.63 and 1.31, respectively.
- Log-EPS: We set as baseline a log-mel spectrogram which does not learn the logarithmic compression. (α,β) are set to (1, eps). Note eps stands for “machine epsilon”, a very small number.
TL;DR: We are publishing a negative result,
log-learn did not improve our results! 🙂
During the last summer, I have been a research intern at Telefónica Research (Barcelona). This article is the outcome of this short (but intense!) collaboration with Joan Serrà, where we explore how to train deep learning models with just 1, 2 or 10 audios per class. Check it out on arXiv, and reproduce our results running our code!