This is the first ICASSP I’m feeling that the conference has become a place where influential machine learning papers are presented. I’m happy to see that most of our community is not only employing ‘LSTMs for a new dataset
This year I have been in ICASSP to present two papers:
- Training neural audio classifiers with few data. We study whether (i) a naive regularization of the solution space, (ii) prototypical networks, (iii) transfer
learning,or (iv) their combination, can foster deep learning models to better leverage a small number of training examples. In general, transfer learning is the best option but prototypical networks might be useful in some cases.
- Randomly weighted
CNNsfor (music) audio classification. We study how non-trained (randomly weighted) CNNsperform as feature extractors for (music) audio classification tasks. They work surprisingly well, and this methodology serves to run a meta-evaluation of the CNN front-ends for audio.
Researchers are exploring different deep learning losses for audio. For example, domain-knowledge inspired costs, cycle-consistency losses or using GANs.
- End-to-End Feedback Loss in Speech Chain Framework via Straight-Through Estimator. The speech chain mechanism integrates automatic speech recognition (ASR) and text-to-speech synthesis (TTS) modules into a single cycle during training. It provides the ability for ASR and TTS to assist each other when they receive unpaired data and let them infer the missing pair and optimize the model with the reconstruction error.
- Cycle-consistency training for end-to-end speech recognition. They present a method to train end-to-end ASR models using unpaired data. Cycle-consistency-based approaches compose a reverse operation with a given transformation (e.g., TTS with ASR) to build a loss that only requires unsupervised data, speech in this example. In short, they train a Text-To-Encoder model and define a loss based on the encoder reconstruction error.
- Perceptually-motivated Environment-specific Speech Enhancement. This paper introduces a deep learning approach to enhance speech recordings made in a specific environment. A single neural network learns to ameliorate several types of recording artefacts, including noise, reverberation, and non-linear equalization. The method relies on a new perceptual loss function that combines adversarial loss with spectrogram features.
- A Deep Learning Loss Function Based on the Perceptual Evaluation of
the SpeechQuality. They propose a perceptual metric for speech quality evaluation, which is suitable as a loss function for training deep learning methods. This metric is computed in a per-frame basis in the power spectral domain.
- STFT spectral loss for training a neural speech waveform model. They propose a new loss for end-to-end models that
isbased on the STFT. Interestingly, not only amplitude spectra (but also phase spectra) are used to calculate the proposed loss. They also mathematically show that training the waveform model with the proposed loss can be interpreted as maximum likelihood training.
Besides the above works on cycle-consistency losses, here an additional list of interesting works on semi-supervised (and unsupervised) learning:
- Conditional Teacher-Student Learning. Teacher-student (T/S) learning has been shown to be effective for a variety of problems such as domain adaptation and model compression. They propose a conditional T/S learning scheme in which a “smart” student model selectively chooses to learn from either the teacher model or the ground truth labels conditioned on whether the teacher can correctly predict the ground truth.
- Semi-Supervised Monaural Singing Voice Separation With a Masking Network Trained on Synthetic Mixtures. They study the problem of semi-supervised singing voice separation, in which the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music.
- Unsupervised training of a deep clustering model for multichannel blind source separation. They propose a training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable. In particular, they show that an unsupervised spatial clustering algorithm is sufficient to guide the training of a
Many ‘post-processing’ Wavenet-like architectures do not explicitly predict a posterior probability distribution with a softmax. Instead, they just regress the output.
- Learning Bandwidth Expansion Using Perceptually-Motivated Loss. They introduce a perceptually motivated approach to bandwidth expansion for speech. Their method pairs a new 3-way split variant of the FFTNet neural vocoder structure with a perceptual loss function, combining objectives from both the time and frequency domains.
- Deep Learning for Tube Amplifier Emulation. Analog audio effects and synthesizers often owe their distinct sound to circuit nonlinearities. Faithfully modelling such an aspect of the original sound in virtual analogue software can prove challenging. They employ a feedforward variant of Wavenet to carry out a regression on audio waveform samples from input to output of a tube amplifier.
Modelingof nonlinear audio effects with end-to-end deep neural networks. Although they do not use a Wavenet-like architecture, they also regress the output with very good results. They investigate deep learning architectures for audio effects processing to find a general purpose end-to-end deep neural network to modelnonlinear audio effects.
The speech enhancement session was very interesting. One important research question was addressed there: is it better to feed deep neural networks with spectrograms? complex-spectrograms? or waveforms?
- Real-time Speech Enhancement Using an Efficient Convolutional Recurrent Network for Dual-microphone Mobile Phones in Close-talk Scenarios. Their results sounded very convincing. They propose a novel deep learning based framework for real-time speech enhancement on dual-microphone mobile phones in a close-talk scenario. It incorporates a convolutional recurrent network (CRN) that is supposed to be computationally efficient.
- Supervised Speech Enhancement with Real Spectrum Approximation. Conventional speech enhancement methods usually ignore the phase – which is important to achieve high-quality speech renderings. To consider the phase, the complex number spectrum needs to be
modeled. In their work, a purereal number spectrum is used as an alternative representation of the complex number spectrum, and a signal approximation method is used for speech enhancement.
- Complex Spectral Mapping with a Convolutional Recurrent Network for Monaural Speech Enhancement.
Phaseis important for perceptual quality in speech enhancement. However, it seems intractable to directly estimate the phase spectrogram through supervised learning – due to lackof clear structure in phase spectrograms. Complex Spectral Mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of noisy speech.
- Deep Griffin-Lim Iteration. They present a novel phase reconstruction method by combining a signal-processing-based approach with deep learning. They propose an architecture which stacks sub-blocks including two Griffin-Lim inspired fixed layers and a DNN.
Phasebook: Building Complex Masks via Discrete Representations for Source Separation. They propose to estimate phase using “ phasebook“, a new type of layer based on a discrete representation of the phase difference between the mixture and the target. They also introduce “ combook“, a similar type of layer that directly estimates a complex mask.
U-net architectures are widely used for source separation and speech denoising problems. ICASSP researchers were sharing their insights about that architecture.
- End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-Character Recognition Model. They presented a system showing that it is possible to tackle the (very challenging) task of lyrics alignment in an end-to-end fashion. It is based on a modified Wave-U-Net architecture, which predicts character probabilities directly from raw audio.
- End-to-End Sound Source Separation Conditioned On Instrument Labels. They study how to perform end-to-end music source separation with a variable number of sources using a Wave-U-Net-based model. They also propose
a multiplicativeconditioning with instrument labels that might be useful for audio-visual source separation and score-informed source separation. TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain. This work proposes a fully convolutional neural network (CNN) for real-time speech enhancement in the time domain (regresses the output). The proposed CNN is a U-net with an additional temporal convolutional module (TCM) inserted between the encoder and the decoder. They also employ dilated/causal convolutions in the context of U-net.
- Using Recurrences in Time and Frequency within U-net Architecture for Speech Enhancement. They investigate the receptive field of the U-net
filters,and find that when designing fully-convolutional neural networks there is a trade-off between receptive field size, number of parameters and spatial resolution of features in deeper layers of the network. They propose to use a combination of many convolutional and recurrent layers that tackle this trade-off. Furcax: End-to-end Monaural Speech Separation Based on Deep Gated (De)convolutional Neural Networks with Adversarial Example Training. It consists of a deep gated (de)convolutional neural network that takes the mixed utterance of two speakers and maps it to two separated utterances. As a training objective, they employ the utterance level SDR in a permutation invariant training style, plus a generative adversarial loss. They said that the gated convolution helped significantly, but the adversarial loss was only deliveing0.5dB of improvement. Further: they pointed us to check FurcaNeXt, that seems to work much better.
It was also nice to see some advances in multi-modal audio processing:
- Look, Listen and Learn More: Design Choices for Deep Audio Embeddings. They released OpenL3, an open-source deep audio embedding based on the self-supervised L3-Net, and claim that it outperforms VGGish and SoundNet (and the original L3-Net) on several sound recognition tasks.
- Perfect match: Improved cross-modal embeddings for audio-visual synchronisation. This paper proposes a new strategy for learning cross-modal embeddings for audio-to-video synchronization. To learn meaningful embeddings, their goal is to find the most relevant audio segment given a video clip. The method builds on the recent advances in learning representations from cross-modal self-supervision.
- Seeing through Sounds: Predicting Visual Semantic Segmentation Results from Multichannel Audio Signals. This paper is a bit scary: they infer an *image* segmentation from an audio recording. Basically, they propose to have a microphone that acts as a camera. Given that sounds provide us with vast amounts of information about surrounding objects and can even remind us of visual images of them, is it possible to implement this noteworthy human ability on machines?
Like many others, I was mind-blown during the TTS session. Here my highlights:
- LPCNet: Improving Neural Speech Synthesis Through Linear Prediction. They propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of speech synthesis. They demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size.
- Neural source-filter-based waveform model for statistical parametric speech synthesis. A very interesting one! They propose a non-auto regressive source-filter waveform model that can be directly trained using
a spectrum-basedtraining criteria and stochastic gradient descent. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform.
- Robust and fine-grained prosody control of end-to-end speech synthesis. They propose prosody embeddings for emotional and expressive speech synthesis networks. The proposed method allows for temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech. These temporal structures can be designed either on the speech side or the text side, leading to different control resolutions in time.
- Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization. To leverage crowd-sourced data to train multi-speaker TTS models it is essential to learn disentangled representations which can independently control the speaker identity and background noise in generated signals. To this end, they propose 3 components to address this problem by: (i) formulating a conditional generative model with factorized latent variables, (ii) using data augmentation to add noise that is not correlated with speaker identity and whose label is known during training, and (iii) using adversarial factorization to improve disentanglement.
There were some interesting articles on audio and music synthesis:
- A Vocoder Based Method For Singing Voice Extraction. The core idea behind this article is to bypass the source separation problem with synthesis. Want to do singing voice separation? Estimate the singing voice vocoder parameters directly from the music mixture. An elegant and pragmatic approach that works.
- Neural Music Synthesis for Flexible Timbre Control. Their paper describes a neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned on a learned instrument embedding followed by a WaveNet vocoder.
- Data Efficient Voice Cloning for Neural Singing Synthesis. Making deep learning work with few training data can enable many applications. In their case, they explore the use case of singing voice cloning. They leverage data from many speakers to first create a multispeaker model, to later employ small amounts of target data to adapt the model to new unseen voices.
- Acoustic Scene Generation with Conditional SampleRNN. Acoustic scene generation (ASG) is a task to generate waveforms for acoustic scenes. They informally explained to us that it was harder to generate acoustic events than to generate acoustic scenes.
And some cool papers that do not fit into the categories I defined above:
- Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition. They propose a fully self-attentional network for
CTC,and show that it can achieve competitive results for speech recognition. They found that 1-2 heads might be enough!
- High-quality speech coding with SampleRNN. They provide a speech coding scheme employing a generative model based on
SampleRNNthat, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of state-of-the-art classic wide-band codecs.
- Learning Sound Event Classifiers from Web Audio with Noisy Labels. I’m happy to see how the family of Freesound Datasets is growing! This dataset contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labelled data and a larger quantity of real-world noisy data from the web.
- Attention-based Wavenet Autoencoder for Universal Voice Conversion. Their method is based on a WaveNet autoencoder with a novel attention component that supports the modification of timing between the input and the output samples.
- Learning to Match Transient Sound Events Using Attentional Similarity for Few-shot Sound Recognition. They introduce a novel attentional similarity module for the problem of few-shot sound recognition. Given a few examples of an unseen sound event, a classifier must be quickly adapted to recognize the new sound event without much fine-tuning.
- WaveGlow: A Flow-based Generative Network for Speech Synthesis. WaveGlow is a flow-based network capable of generating high-quality speech from
mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression.
- SDR – half-baked or well done? Many papers have been relying on BSS_eval to evaluate their methods and compare them to previous works. They argue here that the signal-to-distortion ratio (SDR) implemented in BSS_eval has generally been improperly used and abused. They propose to use a slightly modified definition called scale-invariant SDR (SI-SDR). They present various examples of critical failure of the original SDR that SI-SDR overcomes.