Takeaways from the Google Speech Summit 201811 min read

After assisting to the Google Speech Summit 2018, I can adventure to say that Google’s speech interests for the future are: (i) to continue improving their automatic speech recognition (w/ Listen, Attend and Spell, a seq2seq model) and speech synthesis (w/ Tacotron 2 + Wavenet/WaveRNN) systems so that a robust interface is available for their conversational agent; (ii) they want to keep simplifying pipelines – having less “separated” blocks in order to be end-to-end whenever is possible; (iii) they are studying how to better control some aspects of their end-to-end models – for example, with style tokens they aim to control some Tacotron (synthesis) parameters; and (iv) lots of efforts are put in building the Google Assistant, a conversational agent that I guess will be the basis of their next generation of products.

The following lines aim to summarize (by topics) what I found relevant – and, ideally, describe some details that are not in the papers.


This summit was a great chance to ask questions to the Wavenet inventors, and it was fun to hear some stories about the model. For example, Heiga Zen explained that during the previous Google Speech Summit he got an email from Aaron (the first author of the Wavenet paper) saying that he was obtaining some “interesting” results by directly processing raw audio. They also explained that their first Wavenet investigations where tackling the unsupervised learning problem (without conditioning, the setup that produces the “babbling”). Under this setup, they needed a fairly big receptive field to make it work. However, after incorporating the conditioning, they realized that such long receptive field is possibly not needed since via conditioning one can provide some sense of (global) structure. Related to that, I asked why we are not seeing Wavenet papers to synthesize music from scores; and it turns out that for music one generally needs larger models (maybe due to polyphony?) and an increased receptive field (temporal dependencies are longer in music than in speech), what makes the problem harder.

Besides the original Wavenet, the parallel Wavenet was also presented and discussed. Since the model is rather new, the audience was curious for knowing more about it – for example: does it make sense to use a non-causal architecture for the Wavenet student? How does the model enforces temporal coherence if the student outputs are predicted independently from (independently generated) noise samples? It was interesting to observe that for many questions there was no straight forward answer, because the speech community (including Google researchers) are on the process of building the knowledge and intuitions required to design efficient Wavenet-like models. In line with that, I think FFTNet and WaveRNN works are nice “ablation studies” that can provide an intuition on which parts are essential – or, as another example, our Wavenet for speech denoising explores how to enforce temporal continuity with a highly parallelizable non-causal Wavenet. Exciting times to work on that!

Automatic speech recognition

They treat automatic speech recognition (ASR) as a sequence-to-sequence problem (audio-to-text) – but it was interesting to see that text-to-speech was also presented as a sequence-to-sequence problem; as well as machine translation. Also related to the presentations’ introduction, it was nice to see that some speakers (including ASR and text-to-speech ones) went through the “traditional” literature as a way to motivate their (deep learning based) works – what counterbalances the widespread narrative that many Google researchers express: “no need for domain knowledge, all I need is TPUs and data”; or put in controversial words: “turn an engineering problem into a data collection problem”.

Current Google’s bet for ASR is the Listen, Attend and Spell model. Interestingly enough, most architectures discussed during the ASR session were based on spectrogram + LSTMs. After asking about that, I got two answers: (1) why using waveforms if spectrograms seem to work well for discriminative tasks?, and (2) no need to use CNN front-ends if you have enough training data.

Tacotron 2

Their main goal with the Tacotron series of models is to build a text-to-speech system that “sounds natural”. Although Tacotron models produce reasonably good results when synthesizing words and sentences, when the model synthesizes long paragraphs it has some prosodic issues. It was interesting to listen how some earlier Tacotron models produced speech samples where intonation “was declining” – it was like listening to someone who, after every word, was becoming more and more sad. They presented some recent work on “global style tokens”, meant to control the synthesis process to (eventually) improve intonation and prosody when synthesizing long paragraphs.

A kind reminder for those Wavenet fans: Tacotron 2 makes use of Wavenet to synthesize waveforms from mel-spectrograms. As a result of this add-on, the Wavenet model is explained in the Tacotron 2 paper – and they told us that “this paper is a good reference for the original Wavenet”. I guess we now have some more hints in how to implement Wavenet! Also related to Wavenet, they showed some examples to compare Tacotron 2 vs. Wavenet text-to-speech models – and Tacotron 2 was able to produce more “natural sounding” speech, having a better intonation/prosody.

Simplifying the pipeline

Several speakers during different talks mentioned that Tacotron models require normalized text as input. Put in others words: the neural networks they tried for learning how to normalize the text end-to-end did not work. As mentioned in the introduction, Google researchers have a strong interest in simplifying their pipelines – end-to-end models are easier to deploy (and maintain) than systems having many “parts”. Therefore, not being able to have a fully end-to-end model for Tacotron might be higly unsatisfying for them. I am sure we will see a new paper introducing Tacotron 3 tackling how to normalize the text with neural networks!

WaveRNN has been another hot topic during the summit – and note that is very much related with the idea of simplifying the pipeline since WaveRNN can achieve Wavenet-level results with an unpretentious RNN-based model. For example: due to the small size of the model, it could make sense to deploy WaveRNN models into cell phones; and to use parallel Wavenet for applications that permit sending queries to a server (having the capacity to parallelize these computations).

Wavenet for speech coding

Felicia Lim presented their recent work on using Wavenet for low-rate speech coding. Despite their convincing results, she explained that they had some issues to preserve the identity of the speakers – for that reason the paper suggest that current coder would be more suitable for human-facing applications, e.g. conference calls. When I asked how the model (trained on speech) would generalize for music signals, she explained that they did not try but music is a challenging signal due to its  polyphony. Finally, she was advocating for new evaluation metrics. They found that some of the metrics they used were not correlating well with the actual performance of the model. She argued that this might be caused by the fact that these metrics were designed to compare models based on radically different principles than Wavenet – which operates directly over waveforms.

Low resource languages

Although recent advances in speech technology have been driven by data-hungry models, it was not until the second day that someone responsible for “some data” at Google came to give us a talk. Martin Jansche showed a graphic depicting that just a few languages have millions of speakers, but most world languages have a much smaller amount of speakers. As a result: if Google researchers are able to solve ASR and text-to-speech for these low resource languages, there is a huge business opportunity via bringing Google’s products to these new markets. A nice corollary from this talk is that it is not always reasonable to assume that infinite amounts of data are available to train our models.

He introduced the idea of using a “multilingual” dataset for overcoming some of the aforementioned problems. For example: they tried using a “multilingual” dataset to synthesize speech for languages that were not in the training set as such. Out of these experiments, they found that their models where able to generate intelligible speech with a non-native “weird” accent. Interesting!

Conversational agents

Note that every single one of the previous topics can be regarded as a way to improve the Google Assistant, their conversational agent. ASR and speech synthesis are the interface, low-rate speech coding enables sending high-quality speech through limited networks, simpler pipelines allow for an easier deployment, and being able to deploy these technologies for any (low resource) language is a way to make the Google Assistant accessible to everyone.

The last two talks were on conversational agents: the first one was an exploratory talk introducing some background – to start a discussion on how the ideal conversational agent should be; and the latter was trying to showcase (with a demo) which are the current capacities of the Google Assistant – so that we can grasp which are the challenges of the field.

Talk abstracts

Below, I attach the abstracts of the imparted tech talks – copied from the summit program:

Generative Text-to-Speech Synthesis, Heiga Zen, Research Scientist
Abstract: Recent progress in deep generative models and its application to text-to-speech (TTS) synthesis has made a breakthrough in the naturalness of artificially generated speech. This talk first details the generative model-based TTS synthesis approach from its probabilistic formulation to actual implementation including statistical parametric speech synthesis, then discusses the recent deep generative model-based approaches such as WaveNet, Char2Wav, and Tacotron from this perspective. Possible future research topics and topics are also discussed.

Recent Developments in Google ASR, Khe Chai Sim, Research Scientist
Abstract: In this talk, I will present the recent developments in Google’s ASR technologies, including advanced techniques towards end-to-end modeling such as Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS). I will also describe our effort in improving the computational efficiency for the CTC and lattice-free MMI losses on GPU using a native TensorFlow implementation of the forward-backward algorithm. Finally, I will present a summer intern project towards understanding the RNN states for LAS using memory signature.

Taking WaveNet TTS into Production, Tom Walters, Research Scientist
Abstract: The WaveNet architecture has been hugely successful in generating high-quality speech for TTS. In this talk I’ll present the architecture of the original autoregressive WaveNet model, and then discuss subsequent research from DeepMind on parallel WaveNet and WaveRNN. These new generative models of raw audio retain the quality of the original WaveNet synthesis while improving the computational efficiency of sampling. This research has opened up real-world applications of WaveNet-inspired models. I’ll also explain a bit about DeepMind’s structure, and how the DeepMind for Google team helps to bring DeepMind research to production in Google.

WaveNet based low rate speech codec, Felicia Lim, Software Engineer
Abstract: Speech coding first found its major application in secure communications and later enabled low cost mobile and internet communications. With continuously decreasing cost of bandwidth in most applications, the trade-off between rate and quality has gravitated to higher rates to ensure good quality. Yet, in regions with poor infrastructure or network conditions, users can benefit from lower rates and improved quality. This talk will discuss how WaveNet can be leveraged to create state-of-the-art speech coding systems that provide a significant leap in performance.

Language Technology for the World: Fun & Fungibility, Martin Jansche, Software Engineer
Abstract: Imagine a world where people want to communicate with friends and family, find answers to daily challenges as experienced by them, or look for education or entertainment in their own language — yet for many reasons, none of this is possible or easy. This is our world, now, as experienced by most people. Addressing these challenges puts us on a road towards commoditization of speech and language technology. I’ll walk you through some of the early steps of my team.

Characterizing Conversational Search and Recommendation, Filip Radlinski, Research Scientist
Abstract: This talk considers conversational approaches to search and recommendation, presenting a theory and model of information interaction in a dialog setting. In particular, consider the question of what practical properties would be desirable for a conversational information retrieval or recommendation system, so that the system can allow users to answer a variety of information needs in a natural and efficient manner. I will describe a proposed set of properties that taken together could measure the extent to which a system is conversational, as well as a theoretical model of a conversational system that implements the properties.

Age of Assistance, Behshad Behzadi, Principal Engineer
Abstract: Advances in Artificial Intelligence are powering a quantum leap in the capabilities and usefulness of personal digital assistants. These intelligent assistants present a paradigm shift in human machine interaction: they understand context, have a personality, and can successfully converse in natural language: answering questions and performing actions across an unprecedented breadth of connected devices, and services. In his talk “Age of Assistance”, Behshad Behzadi provides an insider’s perspective on these exciting developments, as seen through the eyes of the engineering lead that built the Google Assistant.