This was my first Interspeech, and I was interested in understanding the field from the eyes of a “speech researcher” — instead of looking at it from the music/audio perspective, that is my field of expertise. After attending to Interspeech, I realized their sensibility for languages and how diverse is the community. The best of the conference? That one of the longest slides in the world was in town.

We (Francesc and I) have been in Interspeech to present the adaptation of our Wavenet for Speech Denoising to music source separation:
- End-to-end Music Source Separation: is it possible in the waveform domain? While waveform-based speech source separation methods (like ConvTasNet) are state-of-the-art, this is not the case for music source separation — because separating musical sources is harder (musicians play synchronized). In short, in this
article we depict the state-of-the-art in end-to-end music source separation.
The proceedings are already online. There, one can make word searches like: Adversarial (32 matches), GAN (12 matches), non-parallel (8 matches), wavenet (7 matches), music (6 matches), audiovisual (5 matches) or flow (0 matches).
Many agreed that the main topic of this year’s Interspeech were GANs. Here some articles that I enjoyed:
- Towards generalized speech enhancement with generative adversarial networks. They generalize SEGAN to also remove additive noise, clipping, chunk elimination, or frequency-band removal. To this end, they propose i) the addition of an adversarial acoustic regression loss that promotes a richer feature extraction at the discriminator, and ii) to make use of a two-step adversarial training schedule.
- Augmented CycleGANS for continuous scale normal-to-
lombard speaking style conversion. Lombard speech is a speaking style is associated with an increased vocal effort that is naturally used by humans to improve intelligibility in the presence of noise. They propose using augmented cycle-consistent adversarial networks (Augmented CycleGANs) for a “smooth” conversion between normal and Lombard speaking styles. Their method is based on a parametric approach that uses the Pulse Model in Log domain (PML) vocoder. - UNetGAN: a robust speech enhancement approach in
time domain for extremely low signal-to-noise ratio condition. They extend the UNet architecture with an adversarial loss, and they also propose to use dilated convolutions in the bottleneck of the UNet. They outperform: SEGAN, cGAN, Bidirectional LSTM using phase-sensitive spectrum, and Wave-U-Net.
I really enjoyed the survey talks. At the beginning of some sessions, survey talks were introducing the topic of the session — what was very handy to outsiders of the field (like me). This is a very good idea for multi-track conferences like Interspeech or ICASSP.
The Zero-Resource Speech Challenges
- Unsupervised end-to-end learning of discrete linguistic units for voice conversion. They discover discrete subword units from speech without using any labels by employing Multilabel-Binary Vectors (MBV). The decoder can synthesize speech with the same content as the input to the encoder — but with different speaker characteristics, which achieves voice conversion. They improve the quality of voice conversion using adversarial training. In the ZeroSpeech 2019 Challenge, they achieved very low bitrates.
- Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. They use discrete latent-variable neural networks to unlabelled speech and use the discovered units for speech synthesis. They decouple acoustic unit discovery from speaker modelling by conditioning the decoder on the training speaker identity. For discretisation, categorical variational autoencoders (CatVAEs), vector-quantised VAEs (VQ-VAEs) and straight-through estimation are compared. Their best model uses convolutional encoding, VQ-VAE discretisation, deconvolutional decoding and an FFTNet vocoder.
- SparseSpeech: unsupervised acoustic unit discovery with memory-augmented sequence autoencoders. They propose a sparse sequence autoencoder model for unsupervised acoustic unit discovery, based on bidirectional LSTM encoders/decoders with a sparsity-inducing bottleneck. The sparsity layer is based on memory-augmented neural networks, with a differentiable embedding memory bank. Forcing the network to favour highly sparse memory addressing yields symbolic-like representations of speech that are very compact.
And, of course, the best student paper awards were amazing 🙂
- Language
modeling with deep transformers. They explore deep autoregressive Transformer models in languagemodeling for speech recognition. They revisit the Transformer model configurations for language and show that can perform competently. Second, they show that deep Transformer language models do not require positional encodings.modeling , - Adversarially trained end-to-end
korean singing voice synthesis system. Their system is conditioned on lyrics and melody. It consists of two main modules: amel -synthesis network that generates amel -spectrogram from the input, and a super-resolution network that upsamples the generatedmel -spectrogram into a linear-spectrogram. In themel -synthesis network, phonetic enhancement masking is applied, which enables a more accurate phonetic control of singing voice. In addition, they show that locally conditioning with text and pitch, and conditional adversarial trainingare crucial. - Evaluating
near end listening enhancement algorithms in realistic environments. They present arealistic test platform, featuring two representative everyday scenarios in which speech playback may occur (with noise and reverberation): a domestic space (living room) and a public space (cafeteria). The generated stimuli are evaluated by measuring keyword accuracy rates in a listening test. They use the new platform to compare state-of-the-art algorithms.
During the last 3-4 years, the field of singing voice synthesis has experienced a lot of progress with neural-based models:
- Unsupervised singing voice conversion. Training is performed without any form of supervision — no lyrics or any kind of phonetic features, no notes, and no matching samples between singers. The proposed network employs a CNN encoder for all singers, a WaveNet decoder, and a classifier that enforces the latent representation to be singer-agnostic. Each singer is represented by one embedding vector, which the decoder is conditioned on.
- Adversarially trained end-to-end
korean singing voice synthesis system. Summarised above, see the best student papers list.
On speech enhancement:
- Progressive speech enhancement with residual connections. This approach is progressive in the sense that at every layer the spectrogram is “progressively denoised”. This progressive scheme is based on residual networks with a constant number of channels. Following this strategy, they are able to improve their results.
- Speech enhancement with variance constrained autoencoders. They propose using the Variance Constrained Autoencoder (VCAE) for speech enhancement. In other words, the variance of the VAE is “fixed”. Their model uses a simpler neural network structure than competing solutions like SEGAN or SE-Wavenet, and they show that VCAE outperforms SEGAN and SE-Wavenet — especially at higher SNRs.
- Coarse-to-fine optimization for speech enhancement. They introduce that deep neural networks learnt with cosine similarly loss might not be able to predict enhanced speech. Their coarse-to-fine strategy optimizes the cosine similarity loss for different granularities. Moreover, they also study their proposed schema for generative adversarial networks (GANs) and propose
the dynamic perceptual loss. They claim to achieve state-of-the-art results. - Masking estimation with phase restoration of clean speech for monaural speech enhancement. They present two time-frequency masks to simultaneously enhance the real and imaginary parts of the speech
spectrum, and use them as the training target for a DNN model. - A non-causal FFTNet architecture for speech enhancement. They suggest a new parallel, non-causal and shallow waveform domain architecture for speech enhancement based on FFTNet. The suggested network has considerably reduced model parameters: 32% fewer compared to WaveNet and 87% fewer compared to SEGAN. The SE-FFTNet outperforms WaveNet in terms of enhanced signal quality, while it provides equally good performance as SEGAN.
- A scalable noisy speech dataset and online subjective test framework: To better facilitate deep learning research in speech enhancement, they present a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and SNR levels desired. They also provide an open-source evaluation methodology to evaluate the results subjectively at scale using crowdsourcing. This subjective MOS evaluation is the first large scale evaluation of
speech enhancement algorithms.
On speech source separation:
- WHAM!: extending speech separation to noisy environments. They push the field towards more realistic and challenging scenarios. To that end, they created the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset. They also benchmark various speech separation architectures and objective functions to evaluate their robustness to noise. While separation performance decreases as a result of noise, they still observe substantial gains relative to the noisy signals for most approaches.
- End-to-end monaural speech separation with multi-scale dynamic weighted gated dilated convolutional pyramid networks. Since this is a FurcaNext paper, see the FurcaNext summary below.
- Deep attention gated dilated temporal convolutional networks with intra-parallel convolutional modules for end-to-end monaural speech separation. Since this is a FurcaNext paper, see the FurcaNext summary below.
- FurcaNext. They claim to have improved ConvTasNet by 3dB by simply making the “separator-network” more powerful. Their results are incredible, and I’m eager to see other speech researchers reproducing their results.
- A comprehensive study of speech separation: spectrogram vs waveform separation. Given the success of TasNet in the waveform-domain, they incorporate effective components of TasNet into a frequency-domain separation method. Their experimental results show that spectrogram-based separation can achieve competitive performance when compared to waveform-based methods — with a better network design.
- Evaluating audiovisual source separation in the context of video conferencing. Starting from a recently designed deep neural network, they assess its ability and robustness to separate the visible speakers’ speech from other interfering speeches or signals. For example, they test it for different configurations of video recordings where the speaker’s face may not be fully visible.
- Probabilistic permutation invariant training for speech separation. The recently proposed Permutation Invariant Training (PIT) addresses the permutation-of-sources problem by determining the output-label assignment which minimizes the separation error. In this study, they show that a major drawback of this technique is the overconfident choice of the output-label assignment, especially in the initial steps of training. To solve this problem, they propose Probabilistic PIT (Prob-PIT) — that can be easily implemented by replacing the minimum function of PIT with a soft-minimum function. Their results show that Prob-PIT outperforms PIT in terms of SDR and SIR.
- End-to-end SpeakerBeam for
single channel target speech recognition: speaker extractionwith id of the speaker. Speech separation suffers from a global permutation ambiguity issue. SpeakerBeam aims at extracting only a target speaker in a mixture based on his/her speech characteristics, thus avoiding the global permutation problem. They combine SpeakerBeam and an end-to-end ASR system to allow end-to-end training of a target ASR system in cocktail party scenarios. - Recursive speech separation for unknown number of speakers. They propose a method for multi-speaker speech separation for an unknown number of speakers — via using a single model that recursively/sequentially separates every speaker. To make the separation model recursively applicable, they propose the one-and-rest permutation invariant training.
- Practical applicability of deep neural networks for overlapping speaker separation. They (empirically) study the applicability in realistic scenarios of two deep learning-based solutions: deep clustering and deep attractor networks. First, they investigate if these methods are applicable to a broad range of languages. And secondly, they investigate how these methods deal with realistic background noise and propose some modifications to better cope with these disturbances.
There was session on privacy in speech and audio interfaces. Here a couple of papers that I found interesting:
- Privacy-preserving adversarial representation learning in ASR: reality or illusion? They focus on the protection of speaker identity, and study the extent to which users can be recognized based on the encoded representation of their speech as obtained by a deep encoder-decoder architecture trained for ASR.
- The GDPR & Speech Data: reflections of legal and technology communities, first steps towards a common understanding. Privacy preservation and the protection of speech data is in high demand, not only as a result of recent regulation — e.g. the General Data Protection Regulation (GDPR) in the EU. While there has been a period with which to prepare for its implementation, its implications for speech data is poorly understood. This work pings both the legal and technology communities, with the aim to initiate discussion around the topic.
And, of course, there were interesting contributions with respect to the never-ending discussion about data (e.g., augmentation and quick annotation):
- SpecAugment: a simple data augmentation method for automatic speech recognition. They present a data augmentation method for speech recognition. SpecAugment is directly applied to the feature inputs of a neural network (i.e., spectrograms) and it is based on warping the features, masking blocks of frequency channels, and masking blocks of time steps. They apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks and achieve state-of-the-art performance for several datasets.
- How to annotate 100 hours in 45 minutes. They show evidence that a semi-supervised, human-in-the-loop framework can be useful for browsing and annotating large quantities of audio quickly. Their show that a 100-hour long subset of the Fearless Steps corpus can be annotated for speech activity in less than 45 minutes, a fraction of the time it would take with traditional annotation methods, without a loss in performance.
And a couple of papers using the recently proposed Sincnet front-end:
- On learning interpretable CNNs with parametric modulated kernel-based filters. They have investigated the replacement of the SincNet filters with triangular, gammatone and Gaussian filters — resulting in higher model flexibility and a reduction of the phone error rate.
- Learning problem-agnostic speech representations from multiple self-supervised tasks. They propose a self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. They show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
During the speech coding session,two interesting papers were presented:
- A real-time wideband neural vocoder at 1.6kb/s using LPCNet. They present a low-bitrate neural vocoder based on the LPCNet model that can run on general-purpose hardware. They demonstrate that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPCNet can exceed the quality of a waveform codec operating at a low bitrate.
- Speech audio s
uper resolution for speech recognition. They introduce an end-to-end deep learning based system for speech bandwidth extension for use in a downstream automatic speech recognition (ASR) system. Specifically, they propose a conditional generative adversarial network enriched with ASR-specific loss functions designed to upsample the speech audio while maintaining good ASR performance.