Interspeech2019: my highlights12 min read

This was my first Interspeech, and I was interested in understanding the field from the eyes of a “speech researcher” — instead of looking at it from the music/audio perspective, that is my field of expertise. After attending to Interspeech, I realized their sensibility for languages and how diverse is the community. The best of the conference? That one of the longest slides in the world was in town.

We (Francesc and I) have been in Interspeech to present the adaptation of our Wavenet for Speech Denoising to music source separation:

  • End-to-end Music Source Separation: is it possible in the waveform domain? While waveform-based speech source separation methods (like ConvTasNet) are state-of-the-art, this is not the case for music source separation — because separating musical sources is harder (musicians play synchronized). In short, in this article we depict the state-of-the-art in end-to-end music source separation.

The proceedings are already online. There, one can make word searches like: Adversarial (32 matches), GAN (12 matches), non-parallel (8 matches), wavenet (7 matches), music (6 matches), audiovisual (5 matches) or flow (0 matches).

Many agreed that the main topic of this year’s Interspeech were GANs. Here some articles that I enjoyed:

I really enjoyed the survey talks. At the beginning of some sessions, survey talks were introducing the topic of the session — what was very handy to outsiders of the field (like me). This is a very good idea for multi-track conferences like Interspeech or ICASSP.

The Zero-Resource Speech Challenges were a very nice initiative, too. Two “unsupervised” tasks were addressed: Automatic Speech Recognition (ASR), and Text-To-Speech without Text (TTS without T). Most of the presented methods were based on “acoustic unit discovery”. For ASR, clustering similar “acoustic units” together might be useful for speech recognition — and for TTS, the proposed systems were able to synthesize back from those “discovered” acoustic units. A (very) short high-level summary could be: most ASR systems were based on transfer-learning-like approaches, and most of the TTS systems were based on VQ-VAE-style ideas. Here some articles that I found interesting:

  • Unsupervised end-to-end learning of discrete linguistic units for voice conversion. They discover discrete subword units from speech without using any labels by employing Multilabel-Binary Vectors (MBV). The decoder can synthesize speech with the same content as the input to the encoder — but with different speaker characteristics, which achieves voice conversion. They improve the quality of voice conversion using adversarial training. In the ZeroSpeech 2019 Challenge, they achieved very low bitrates.
  • Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. They use discrete latent-variable neural networks to unlabelled speech and use the discovered units for speech synthesis. They decouple acoustic unit discovery from speaker modelling by conditioning the decoder on the training speaker identity. For discretisation, categorical variational autoencoders (CatVAEs), vector-quantised VAEs (VQ-VAEs) and straight-through estimation are compared. Their best model uses convolutional encoding, VQ-VAE discretisation, deconvolutional decoding and an FFTNet vocoder.
  • SparseSpeech: unsupervised acoustic unit discovery with memory-augmented sequence autoencoders. They propose a sparse sequence autoencoder model for unsupervised acoustic unit discovery, based on bidirectional LSTM encoders/decoders with a sparsity-inducing bottleneck. The sparsity layer is based on memory-augmented neural networks, with a differentiable embedding memory bank. Forcing the network to favour highly sparse memory addressing yields symbolic-like representations of speech that are very compact.

And, of course, the best student paper awards were amazing 🙂

  • Language modeling with deep transformers. They explore deep autoregressive Transformer models in language modeling for speech recognition. They revisit the Transformer model configurations for language modeling, and show that can perform competently. Second, they show that deep Transformer language models do not require positional encodings.
  • Adversarially trained end-to-end korean singing voice synthesis system. Their system is conditioned on lyrics and melody. It consists of two main modules: a mel-synthesis network that generates a mel-spectrogram from the input, and a super-resolution network that upsamples the generated mel-spectrogram into a linear-spectrogram. In the mel-synthesis network, phonetic enhancement masking is applied, which enables a more accurate phonetic control of singing voice. In addition, they show that locally conditioning with text and pitch, and conditional adversarial training are crucial.
  • Evaluating near end listening enhancement algorithms in realistic environments. They present a realistic test platform, featuring two representative everyday scenarios in which speech playback may occur (with noise and reverberation): a domestic space (living room) and a public space (cafeteria). The generated stimuli are evaluated by measuring keyword accuracy rates in a listening test. They use the new platform to compare state-of-the-art algorithms.

During the last 3-4 years, the field of singing voice synthesis has experienced a lot of progress with neural-based models:

  • Unsupervised singing voice conversion. Training is performed without any form of supervision — no lyrics or any kind of phonetic features, no notes, and no matching samples between singers. The proposed network employs a CNN encoder for all singers, a WaveNet decoder, and a classifier that enforces the latent representation to be singer-agnostic. Each singer is represented by one embedding vector, which the decoder is conditioned on.
  • Adversarially trained end-to-end korean singing voice synthesis system. Summarised above, see the best student papers list.

On speech enhancement:

  • Progressive speech enhancement with residual connections. This approach is progressive in the sense that at every layer the spectrogram is “progressively denoised”. This progressive scheme is based on residual networks with a constant number of channels. Following this strategy, they are able to improve their results.
  • Speech enhancement with variance constrained autoencoders. They propose using the Variance Constrained Autoencoder (VCAE) for speech enhancement. In other words, the variance of the VAE is “fixed”. Their model uses a simpler neural network structure than competing solutions like SEGAN or SE-Wavenet, and they show that VCAE outperforms SEGAN and SE-Wavenet — especially at higher SNRs.
  • Coarse-to-fine optimization for speech enhancement. They introduce that deep neural networks learnt with cosine similarly loss might not be able to predict enhanced speech. Their coarse-to-fine strategy optimizes the cosine similarity loss for different granularities. Moreover, they also study their proposed schema for generative adversarial networks (GANs) and propose the dynamic perceptual loss. They claim to achieve state-of-the-art results.
  • Masking estimation with phase restoration of clean speech for monaural speech enhancement. They present two time-frequency masks to simultaneously enhance the real and imaginary parts of the speech spectrum, and use them as the training target for a DNN model.
  • A non-causal FFTNet architecture for speech enhancement. They suggest a new parallel, non-causal and shallow waveform domain architecture for speech enhancement based on FFTNet. The suggested network has considerably reduced model parameters: 32% fewer compared to WaveNet and 87% fewer compared to SEGAN. The SE-FFTNet outperforms WaveNet in terms of enhanced signal quality, while it provides equally good performance as SEGAN.
  • A scalable noisy speech dataset and online subjective test framework: To better facilitate deep learning research in speech enhancement, they present a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and SNR levels desired. They also provide an open-source evaluation methodology to evaluate the results subjectively at scale using crowdsourcing. This subjective MOS evaluation is the first large scale evaluation of speech enhancement algorithms.

On speech source separation:

There was session on privacy in speech and audio interfaces. Here a couple of papers that I found interesting:

And, of course, there were interesting contributions with respect to the never-ending discussion about data (e.g., augmentation and quick annotation):

  • SpecAugment: a simple data augmentation method for automatic speech recognition. They present a data augmentation method for speech recognition. SpecAugment is directly applied to the feature inputs of a neural network (i.e., spectrograms) and it is based on warping the features, masking blocks of frequency channels, and masking blocks of time steps. They apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks and achieve state-of-the-art performance for several datasets.
  • How to annotate 100 hours in 45 minutes. They show evidence that a semi-supervised, human-in-the-loop framework can be useful for browsing and annotating large quantities of audio quickly. Their show that a 100-hour long subset of the Fearless Steps corpus can be annotated for speech activity in less than 45 minutes, a fraction of the time it would take with traditional annotation methods, without a loss in performance.

And a couple of papers using the recently proposed Sincnet front-end:

During the speech coding session,two interesting papers were presented:

  • A real-time wideband neural vocoder at 1.6kb/s using LPCNet. They present a low-bitrate neural vocoder based on the LPCNet model that can run on general-purpose hardware. They demonstrate that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPCNet can exceed the quality of a waveform codec operating at a low bitrate.
  • Speech audio super resolution for speech recognition. They introduce an end-to-end deep learning based system for speech bandwidth extension for use in a downstream automatic speech recognition (ASR) system. Specifically, they propose a conditional generative adversarial network enriched with ASR-specific loss functions designed to upsample the speech audio while maintaining good ASR performance.