The signal processing community is very into machine learning. Although I am not sure of the implications of this fact, this intersection already produced very interesting results – such as Smaragdis et al.’s work. Lots of papers related to deep learning were presented. Although in many cases people were naively applying DNN or LSTMs to a new problem, there also was (of course) amazing work with inspiring ideas – I highlight some:
- Koizumi et al. propose using reinforcement learning for source separation. This work introduces how to use reinforcement learning for audio signal processing.
- Ewert et al. propose using a variant of dropout that can be used to induce models to learn specific structures by using information from weak labels.
- Ting-Wei et al. propose doing frame-level predictions with a fully convolutional model that also uses gaussian kernel filters (first introduced by them) trained with clip-level annotations in a weakly-supervised learning setup.
It was also funny to see how IBM (keynote) and Microsoft (Xiong et al.) openly fight for having the best performing automatic speech recognition system. However, the take away message from their presentations is: “Fuse the predictions of different deep architectures. You can achieve supra-human performance if you have enough data and you keep the list of different architectures long”. I wonder if they can approach it differently: What about using directly waveforms? Or embedding domain knowledge to the architectures so that these are speech specific?
Audio-set, what aims to be audio’s ImageNet, was released during the conference. Now it is time to dig into it and time will tell how this dataset affects our community. Researchers from the same lab (Harshey et al.) presented a CNN model based on computer-vision architectures for sound classification. It was announced that they will: (i) release the trained model, and (ii) upload features extracted with previously mentioned model for all Audio-set.
I also noticed that approaches guiding the learning process of deep models with well-stablished signal processing ideas – such as ours: Pons et al. or Ewert et al. – , were well-received by the community. ICASSP folks will construct solid bridges between signal processing wisdom and machine learning.
Finally, I want to highlight the missing ones: very (very!) few papers were using modern generative models – ie. Wavenet (none) or GANs (ie. Chang et al. and Kaneko et al.) – and no conclusive work was presented using waveforms as front-end.
The location was so cool: New Orleans rocks! Passionate research discussions happened at Frenchman Street – surrounded by jazz.
Warning! This post is biased towards my interests (deep audio tech). ICASSP is so huge that there is no way one can assist to all presentations. Feel free to suggest any addition to this list. I will be happy to update it with interesting papers I missed!
References:
Smaragdis et al. – A neural network alternative to non-negative audio models.
Koizumi et al. – DNN-based source enhancement self-optimized by reinforcement learning using sound quality measurements.
Ewert et al. – Structured dropout for weak label and multi-instance learning and its application to score-informed souce separation.
Ting-Wei et al. – Weekly-supervised audio event detection using event-specific gaussian filters and fully convolutional networks.
Xiong et al. – The Microsoft 2016 conversational speech recognition system.
Harshey et al. – CNN architectures for large-scale audio classification.
Pons et al. – Designing efficient architectures for modeling temporal features with convolutional neural networks.
Chang et al. – Learning representations of emotional speech with deep convolutional generative adversarial networks.
Kaneko et al. – Generative adversarial network-based postfilter for statistical parametric speech synthesis.