Here the (comprehensive) list of papers on audio and speech that were presented this week at ICLR – together with some comments. At the end of the post, I also share my thoughts regarding the virtual aspect of the conference.
The post is structured as follows: I first paste a screenshot with the title/authors of the paper, to later add a list of notes/comments. There we go!
- I like to think of this paper as a “Manifesto” that describes an important trend in the field: using strong inductive biases to develop deep learning architectures for music/audio.
- The cons? While it can help generalising, constraining the solution space with inductive biases can limit the expressivity of the model. In their specific case, they are empowered/limited by the constrains of the sinusoidal+noise model.
- This is an extension of the Deep Image Prior paper for the audio case.
- If you like this line of research, here some additional pointers! Some are references that the authors missed, and some others are recent work that I find interesting:
- They presented one of the first GANs that is able to achieve state-of-the-art results for speech synthesis. Their results are close to Wavenet – but without paying the computational cost (at generation time) of going auto-regressive.
- Their goal was to develop a BigGAN for audio.
- This work is contemporary to MelGAN, that achieves similar results with similar techniques.
- They learn (from video) (discrete) linguistic units (in speech signals) by incorporating vector quantization layers into neural models.
- Two key aspects: cross-modal learning, and two vector quantization layers!
- The two quantization layers allow for capturing phonetic units and word-level units, respectively.
- Inspired by recent advances in NLP (e.g., GPT or BERT), they propose a new ASR pipeline based on self-supervised learning: vq-wav2vec + RoBERTa + Acoustic model + Linguistic model.
- They describe a significant improvement that achieves state-of-the-art results in public benchmarks.
- Their goal is to get control parameters for neural speech synthesis engines (like the Tacotron) in a semi-supervised fashion.
- They argue that 3 min of labelled data is enough.
- This is a self-supervised learning paper that looks at the problem from the multimodal perspective (speech to generate images).
- Many were concerned about the unethical uses of this technology.
- The inner product + softmask layer is cool.
How it was the experience of assiting to a virtual conference?
It was not a bad experience. What I really liked:
- 5 min video summaries, good! Most of our papers are not scientific breakthroughs that require more than 5 minutes of our time… This allows for a fast ingestion of ideas.
- Going virtual allows for a “fairer” access to conferences. They are cheaper and more accessible to everyone. That’s good, as well.
What I did not like:
- You completely miss the social (and fun) aspect of assisting to an (international) conference. It’s fun to travel around the world, to try new food, to discover new music, etc. Because… even though we are computer scientists, it’s nice to hang out with nerdy people like you!
Hence, assisting to a virtual conference is more like watching as many MOOC videos as you can – so that you have the chance to ask a question to the authors.
It does not feel like “assisting to a conference” but to “compulsively watch Youtube”.