ICASSP2019: my highlights

This is the first ICASSP I’m feeling that the conference has become a place where influential machine learning papers are presented. I’m happy to see that most of our community is not only employing ‘LSTMs for a new dataset, but are proposing novel and inspiring machine learning methods. Let’s see what happened in Brighton (UK)!

This year I have been in ICASSP to present two papers:

  • Training neural audio classifiers with few data. We study whether (i) a naive regularization of the solution space, (ii) prototypical networks, (iii) transfer learning, or (iv) their combination, can foster deep learning models to better leverage a small number of training examples. In general, transfer learning is the best option but prototypical networks might be useful in some cases.
  • Randomly weighted CNNs for (music) audio classification. We study how non-trained (randomly weighted) CNNs perform as feature extractors for (music) audio classification tasks. They work surprisingly well, and this methodology serves to run a meta-evaluation of the CNN front-ends for audio.

Researchers are exploring different deep learning losses for audio. For example, domain-knowledge inspired costs, cycle-consistency losses or using GANs.

  • End-to-End Feedback Loss in Speech Chain Framework via Straight-Through Estimator. The speech chain mechanism integrates automatic speech recognition (ASR) and text-to-speech synthesis (TTS) modules into a single cycle during training. It provides the ability for ASR and TTS to assist each other when they receive unpaired data and let them infer the missing pair and optimize the model with the reconstruction error.
  • Cycle-consistency training for end-to-end speech recognition. They present a method to train end-to-end ASR models using unpaired data. Cycle-consistency-based approaches compose a reverse operation with a given transformation (e.g., TTS with ASR) to build a loss that only requires unsupervised data, speech in this example. In short, they train a Text-To-Encoder model and define a loss based on the encoder reconstruction error.
  • Perceptually-motivated Environment-specific Speech Enhancement. This paper introduces a deep learning approach to enhance speech recordings made in a specific environment. A single neural network learns to ameliorate several types of recording artefacts, including noise, reverberation, and non-linear equalization. The method relies on a new perceptual loss function that combines adversarial loss with spectrogram features.
  • A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality. They propose a perceptual metric for speech quality evaluation, which is suitable as a loss function for training deep learning methods. This metric is computed in a per-frame basis in the power spectral domain.
  • STFT spectral loss for training a neural speech waveform model. They propose a new loss for end-to-end models that is based on the STFT. Interestingly, not only amplitude spectra (but also phase spectra) are used to calculate the proposed loss. They also mathematically show that training the waveform model with the proposed loss can be interpreted as maximum likelihood training.

Besides the above works on cycle-consistency losses, here an additional list of interesting works on semi-supervised (and unsupervised) learning:

Many ‘post-processing’ Wavenet-like architectures do not explicitly predict a posterior probability distribution with a softmax. Instead, they just regress the output.

  • Learning Bandwidth Expansion Using Perceptually-Motivated Loss. They introduce a perceptually motivated approach to bandwidth expansion for speech. Their method pairs a new 3-way split variant of the FFTNet neural vocoder structure with a perceptual loss function, combining objectives from both the time and frequency domains.
  • Deep Learning for Tube Amplifier Emulation. Analog audio effects and synthesizers often owe their distinct sound to circuit nonlinearities. Faithfully modelling such an aspect of the original sound in virtual analogue software can prove challenging. They employ a feedforward variant of Wavenet to carry out a regression on audio waveform samples from input to output of a tube amplifier.
  • Modeling of nonlinear audio effects with end-to-end deep neural networks. Although they do not use a Wavenet-like architecture, they also regress the output with very good results. They investigate deep learning architectures for audio effects processing to find a general purpose end-to-end deep neural network to model nonlinear audio effects.

The speech enhancement session was very interesting. One important research question was addressed there: is it better to feed deep neural networks with spectrograms? complex-spectrograms? or waveforms?

U-net architectures are widely used for source separation and speech denoising problems. ICASSP researchers were sharing their insights about that architecture.

It was also nice to see some advances in multi-modal audio processing:

Like many others, I was mind-blown during the TTS session. Here my highlights:

  • LPCNet: Improving Neural Speech Synthesis Through Linear Prediction. They propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of speech synthesis. They demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size.
  • Neural source-filter-based waveform model for statistical parametric speech synthesis. A very interesting one! They propose a non-auto regressive source-filter waveform model that can be directly trained using a spectrum-based training criteria and stochastic gradient descent. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform.
  • Robust and fine-grained prosody control of end-to-end speech synthesis. They propose prosody embeddings for emotional and expressive speech synthesis networks. The proposed method allows for temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech. These temporal structures can be designed either on the speech side or the text side, leading to different control resolutions in time.
  • Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization. To leverage crowd-sourced data to train multi-speaker TTS models it is essential to learn disentangled representations which can independently control the speaker identity and background noise in generated signals. To this end, they propose 3 components to address this problem by: (i) formulating a conditional generative model with factorized latent variables, (ii) using data augmentation to add noise that is not correlated with speaker identity and whose label is known during training, and (iii) using adversarial factorization to improve disentanglement.

There were some interesting articles on audio and music synthesis:

  • A Vocoder Based Method For Singing Voice Extraction. The core idea behind this article is to bypass the source separation problem with synthesis. Want to do singing voice separation? Estimate the singing voice vocoder parameters directly from the music mixture. An elegant and pragmatic approach that works.
  • Neural Music Synthesis for Flexible Timbre Control. Their paper describes a neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned on a learned instrument embedding followed by a WaveNet vocoder.
  • Data Efficient Voice Cloning for Neural Singing Synthesis. Making deep learning work with few training data can enable many applications. In their case, they explore the use case of singing voice cloning. They leverage data from many speakers to first create a multispeaker model, to later employ small amounts of target data to adapt the model to new unseen voices.
  • Acoustic Scene Generation with Conditional SampleRNN. Acoustic scene generation (ASG) is a task to generate waveforms for acoustic scenes. They informally explained to us that it was harder to generate acoustic events than to generate acoustic scenes.

And some cool papers that do not fit into the categories I defined above: