My ICASSP 2018 highlights8 min read

This year’s ICASSP keywords are: generative adversarial networks (GANs), wavenet, speech enhancement, source separation, industry, music transcription, cover song identification, sampleCNN, monophonic pitch tracking, and gated/dilated CNNs. This time, passionate scientific discussions happened in random sport bars at downtown Calgary – next to dirty snow piles that were melting.

GANs were this year’s novelty. I remember that in last ICASSP there were only two or three GAN papers. However, this year many approaches were based on adversarial training. A quick (Ctrl+F) search through the proceedings reveals the following statistics – depicting in which audio fields (generative) adversarial training has been employed:

  • Speech enhancement: 5
  • Audio source separation: 4
  • Automatic speech recognition: 4
  • Feature learning: 4
  • Voice conversion: 2
  • Speech-based emotion recognition: 2
  • Speaker verification: 1
  • Data augmentation: 1
  • Audio Bandwith extension: 1
  • Text-to-speech: 1
  • Speech synthesis: 1

Although the statistics I report focus on audio applications, 11 image processing papers also used (generative) adversarial training – what is not a surprise, provided that most successful GAN stories are coming from the computer vision field.

Since I’m quite interested in the wavenet model, I also searched (Ctrl+F) for wavenet through the proceedings to get a quick idea in which fields this model has been used:

  • Speech synthesis: 5
  • Speech denoising/enhancement: 1
  • Voice conversion: 1
  • Studying quality/quantity of the training data: 1

Among these, one can find a wavenet for speech denoising (our paper [32]), another for speech decoding [2], or the tacotron 2 [4]. However, these statistics do not include some papers that are closely related to the wavenet model like the FFTnet (a simpler conceptualization of wavenet [30]), or the deep learning based speech beamforming (that makes use of a wavenet for speech denoising [31]).

Besides automatic speech recognition, speech enhancement and source separation have been two major topics in this year’s ICASSP – one can also observe this trend in the previously presented statistics. Although most speech enhancement and source separation approaches were based on the traditional masking/wiener filtering pipeline, some new ideas were proposed – for example: i) to directly process the waveforms to bypass potential limitations of the traditional masking/wiener filtering schema [13,32], or ii) an iterative source separation approach was proposed for single-channel source separation [15]. It was also interesting to observe a solid consensus around the idea that traditional (objective) metrics used for source separation are obsolete. Some people are working on that and the proposed solutions were mostly based on incorporating human evaluations in the development loop [24,25,26].

Further: together with the ICASSP’s growing interest for neural networks, the presence of the industry in the conference has increased. Actually, one can observe that the conference subtitle has changed from “the internet of signals” to “signal processing and artificial intelligence: changing the world“. Somehow, the organizers are trying to involve the signal processing community in the so called “deep learning revolution“. Let’s hope that this move helps building bridges between signal processing and machine learning literatures, instead of “decimating” the biggest (audio) signal processing community on earth. In line with that, Alex Acero’s (director @ Apple) plenary talk was a nice contribution to start building some interesting connections between these two communities.

Regarding music signal processing, here my highlights: 1) music transcription is a hot topic right now, people are trying deep learning based methods but their sigmoidal outputs are a challenge since one desires (coherent) binary outputs [22,28]; 2) deep neural networks are being used for the problem of cover song identification with interesting results and a variate set of approaches [23,29]; 3) an improved version of the sampleCNN was proposed for multi-label classification of (music) audio [20]; and 4) CREPE (a convolutional representation for pitch estimation  [18]) was presented as an alternative to the pyin algorithm for monophonic pitch tracking – check their amazing online demo!

Warning! This post is biased towards my interests (deep audio tech). ICASSP is so huge that there is no way one can assist to all presentations. Feel free to suggest any addition to the list. I will be happy to update it with interesting papers I missed!

Below, a non-comprehensive list of papers that I enjoyed:

[1] HIGH-QUALITY NONPARALLEL VOICE CONVERSION BASED ON CYCLE-CONSISTENT ADVERSARIAL NETWORK
Fuming Fang, Junichi Yamagishi, Isao Echizen, Jaime Lorenzo-Trueba, National Institute of Informatics, Japan

[2] WAVENET BASED LOW RATE SPEECH CODING
W. Bastiaan Kleijn, Victoria University of Wellington, New Zealand; Felicia S. C. Lim, Alejandro Luebs, Jan Skoglund, Google Inc., United States; Florian Stimberg, DeepMind, United Kingdom; Quan Wang, Google Inc., United States; Thomas C. Walters, DeepMind, United Kingdom

[3] AN INVESTIGATION OF NOISE SHAPING WITH PERCEPTUAL WEIGHTING FOR WAVENET-BASED SPEECH GENERATION
Kentaro Tachibana, National Institute of Information and Communications Technology, Japan; Tomoki Toda, Nagoya University, Japan; Yoshinori Shiga, Hisashi Kawai, National Institute of Information and Communications Technology, Japan

[4] NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS
Jonathan Shen, Ruoming Pang, Ron Weiss, Mike Schuster, Navdeep Jaitly, Google Inc., United States; Zongheng Yang, University of California, Berkeley, United States; Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, Google Inc., United States

[5] GENERALISED DISCRIMINATIVE TRANSFORM VIA CURRICULUM LEARNING FOR SPEAKER RECOGNITION
Erik Marchi, Stephen Shum, Kyuyeon Hwang, Sachin Kajarekar, Siddharth Sigtia, Hywel Richards, Rob Haynes, Yoon Kim, John Bridle, Apple, Inc., United Kingdom

[6] BAYESIAN ANISOTROPIC GAUSSIAN MODEL FOR AUDIO SOURCE SEPARATION
Paul Magron, Tuomas Virtanen, Tampere University of Technology, Finland

[7] TRAINING SUPERVISED SPEECH SEPARATION SYSTEM TO IMPROVE STOI AND PESQ DIRECTLY
Hui Zhang, Xueliang Zhang, Guanglai Gao, Inner Mongolia University, China

[8] CLASSIFICATION VS. REGRESSION IN SUPERVISED LEARNING FOR SINGLE CHANNEL SPEAKER COUNT ESTIMATION
Fabian-Robert Stoeter, Soumitro Chakrabarty, Bernd Edler, Emanuel A. P. Habets, International Audio Laboratories Erlangen, Germany

[9] VOICE IMPERSONATION USING GENERATIVE ADVERSARIAL NETWORKS
Yang Gao, Rita Singh, Bhiksha Raj, Carnegie Mellon University, United States

[10] WHAT IS MY DOG TRYING TO TELL ME? THE AUTOMATIC RECOGNITION OF THE CONTEXT AND PERCEIVED EMOTION OF DOG BARKS
Simone Hantke, Nicholas Cummins, Bjoern Schuller, University of Augsburg, Germany

[11] ADVERSARIAL SEMI-SUPERVISED AUDIO SOURCE SEPARATION APPLIED TO SINGING VOICE EXTRACTION
Daniel Stoller, Queen Mary University of London, United Kingdom; Sebastian Ewert, Spotify, United Kingdom; Simon Dixon, Queen Mary University of London, United Kingdom

[12] GENERATIVE ADVERSARIAL SOURCE SEPARATION
Cem Subakan, Paris Smaragdis, University of Illinois at Urbana-Champaign, United States

[13] LANGUAGE AND NOISE TRANSFER IN SPEECH ENHANCEMENT GENERATIVE ADVERSARIAL NETWORK
Santiago Pascual, Universitat Politecnica de Catalunya, Spain; Maruchan Park, Chungnam National University, Spain; Joan Serra, Telefonica Research, Spain; Antonio Bonafonte, Universitat Politecnica de Catalunya, Spain; Kang-Hun Ahn, Chungnam National University, Spain

[14] BEING LOW-RANK IN THE TIME-FREQUENCY PLANE
Valentin Emiya, Ronan Hamon, Aix Marseille Univ, CNRS, Centrale Marseille, LIS, France; Caroline Chaux, Aix Marseille Univ, CNRS, Centrale Marseille, I2M, France

[15] LISTENING TO EACH SPEAKER ONE BY ONE WITH RECURRENT SELECTIVE HEARING NETWORKS
Keisuke Kinoshita, NTT Corporation, Japan; Lukas Drude, Paderborn University, Japan; Marc Delcroix, Tomohiro Nakatani, NTT Corporation, Japan

[16] A LARGE-SCALE STUDY OF LANGUAGE MODELS FOR CHORD PREDICTION
Filip Korzeniowski, Johannes Kepler University Linz, Austria; David R. W. Sears, Texas Tech University, United States; Gerhard Widmer, Johannes Kepler University Linz, Austria

[17] RETRIEVAL OF SONG LYRICS FROM SUNG QUERIES
Anna M. Kruspe, Fraunhofer IDMT, Germany; Masataka Goto, National Institute of Advanced Industrial Science and Technology (AIST), Japan

[18] CREPE: A CONVOLUTIONAL REPRESENTATION FOR PITCH ESTIMATION
Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello, New York University, United States

[19] UNIFYING LOCAL AND GLOBAL METHODS FOR HARMONIC-PERCUSSIVE SOURCE SEPARATION
Christian Dittmar, Patricio López-Serrano, Meinard Müller, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany

[20] SAMPLE-LEVEL CNN ARCHITECTURES FOR MUSIC AUTO-TAGGING USING RAW WAVEFORMS
Taejun Kim, University of Seoul, Korea (South); Jongpil Lee, Juhan Nam, KAIST, Korea (South)

[21] A HYBRID NEURAL NETWORK BASED ON THE DUPLEX MODEL OF PITCH PERCEPTION FOR SINGING MELODY EXTRACTION
Hsin Chou, Ming-Tso Chen, Tai-Shih Chi, National Chiao Tung University, Taiwan

[22] POLYPHONIC MUSIC SEQUENCE TRANSDUCTION WITH METER-CONSTRAINED LSTM NETWORKS
Adrien Ycart, Emmanouil Benetos, Queen Mary University of London, United Kingdom

[23] COVER SONG IDENTIFICATION USING SONG-TO-SONG CROSS-SIMILARITY MATRIX WITH CONVOLUTIONAL NEURAL NETWORK
Juheon Lee, Sungkyun Chang, Sang Keun Choe, Kyogu Lee, Seoul National University, Korea (South)

[24] BSS EVAL OR PEASS? PREDICTING THE PERCEPTION OF SINGING-VOICE SEPARATION
Dominic Ward, Hagen Wierstorf, Russell D. Mason, Emad M. Grais, Mark D. Plumbley, University of Surrey, United Kingdom

[25] THE DIMENSIONS OF PERCEPTUAL QUALITY OF SOUND SOURCE SEPARATION
Estefania Cano, Judith Liebetrau, Fraunhofer IDMT, Germany; Derry Fitzgerald, Cork Institute of Technology, Ireland; Karlheinz Brandenburg, Fraunhofer IDMT, Germany

[26] CROWDSOURCED PAIRWISE-COMPARISON FOR SOURCE SEPARATION EVALUATION
Mark Cartwright, New York University, United States; Bryan Pardo, Northwestern University, United States; Gautham Mysore, Adobe Research, United States

[27] ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA
Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley, University of Surrey, United Kingdom

[28] TOWARDS COMPLETE POLYPHONIC MUSIC TRANSCRIPTION: INTEGRATING MULTI-PITCH DETECTION AND RHYTHM QUANTIZATION
Eita Nakamura, Kyoto University, Japan; Emmanouil Benetos, Queen Mary University of London, United Kingdom; Kazuyoshi Yoshii, Kyoto University, Japan; Simon Dixon, Queen Mary University of London, United Kingdom

[29] EFFECTIVE COVER SONG IDENTIFICATION BASED ON SKIPPING BIGRAMS
Xiaoshuo Xu, Xiaoou Chen, Deshun Yang, Peking University, China

[30] FFTNET: A REAL-TIME SPEAKER-DEPENDENT NEURAL VOCODER
Zeyu Jin, Adam Finkelstein, Princeton University, United States; Gautham Mysore, Jingwan Lu, Adobe Research, United States

[31] DEEP LEARNING BASED SPEECH BEAMFORMING
Kaizhi Qian, University of Illinois at Urbana-Champaign, United States; Yang Zhang, Shiyu Chang, IBM T.J. Watson Research Center, United States; Xuesong Yang, University of Illinois at Urbana-Champaign, United States; Dinei Florencio, Microsoft Research, United States; Mark Hasegawa-Johnson, University of Illinois at Urbana-Champaign, United States

[32] A WAVENET FOR SPEECH DENOISING
Dario Rethage, Jordi Pons, Xavier Serra, Music Technology Group, Universitat Pompeu Fabra, Spain