ISMIR 2017 highlights4 min read

This has been my first ISMIR ever, and I am thrilled for being part of this amazing community. It was fun to put faces (and hight, and weight) to these names I respect so much!

All awarded papers were amazing, and these are definitely in my list of highlights:
  • Choi et al. – every time I re-read this paper I am more impressed about the efforts they put in assessing the generalization capabilities of deep learning models. This work defines a high evaluation standard for those working in deep auto-tagging models!
  • Bittner et al. proposes a fully-convolutional model for tracking f0 contours in polyphonic music. The article has a brilliant introduction drawing parallelisms between their proposed fully-convolutional architecture and previous traditional models – making clear that it is worth building bridges between deep learning works and previous signal processing literature.
  • Oramas et al. – deep learning enables to easily combine information from many sources, such as: audio, text or images. They do so by combining representations extracted from audio-spectrograms, word-embeddings and ImageNet-based features. Moreover, they released a new dataset: MuMu, with 147,295 songs belonging to 31,471 albums.
  • Jansson et al.‘s work proposes a U-net model for singing voice separation. It seems that adding connections between layers at the same hierarchical level in the encoder and decoder for reconstructing masked audio signals is a good idea since several papers already reported good results using this setup.

But there were many other inspiring papers..

  • McFee & Bello‘s work addresses the problem of large-vocabulary chord transcription via exploiting structural relationships between chord classes. I am still intrigued by their single 5×5 filter in the first layer which is introduced as an harmonic saliency enhancer.. I am eager to experiment with this idea!
  • Marius et al. propose a score-informed model for classical music source separation that is based on a deep convolutional auto-encoder. Interestingly, their model can be linked to Bittner et al.‘s work (because a multi-channel input representation is used) and to Jansson et al.‘s architecture (because a deep convolutional auto-encoder is also used for source-separation).
  • Chen et al. further elaborate on the idea of using musically motivated architectures for music-audio classification – specifically, they confirm that using many filters in the first layer generally yields to better results. In addition, they incorporate an LSTM-layer on top of the CNN feature extractor to capture the long-term dependencies that are so important in music signals.
I also want to highlight the amazing work researchers in automatic drums transcription are doing for improving in this task. As far as I know, they have been collaborating to release a new dataset and a review article – their collaborative attitude is inspiring! During the late-breaking session they presented the MDB drums dataset for automatic drums transcription, and during the conference they exposed their respective papers:
  • Vogl et al. presented an approach based on convolutional recurrent neural networks.
  • Southall et al. presented a method based on soft attention mechanisms and convolutional neural networks.
  • Wu et al. proposed to leverage unlabeled music data with a student-teacher learning approach.
Besides the previously mentioned MuMu and MDB drums datasets, two more datasets were presented:

It is important to note that the audio content of these two datasets is distributed under Creative Commons licenses – what facilitates data sharing and reproducible research.

There is much ongoing work in automatic symbolic music composition. It is interesting to see how these works are moving from RNNs to CNNs -based models. For example, see MidiNet (a deep convolutional GAN that can be conditioned to a melody or chords) or COCONET (a deep convolutional model trained to reconstruct partial scores – inspired by inpainting models in computer vision).

I also want to highlight a work presented in the late-breaking session that surprised me a lot: a musaicing method based on a NMF2D deconvolution method – which is an extension of NMF where time-frequency kernels can convolve both in time and frequency. Interestingly, this model was out there since 2006 but none of us was aware of it! To conclude: I’m sure I am missing relevant papers from the late-breaking/demo session since I was busy discussing how several end-to-end learning approaches for music audio tagging behave when lots of training data (1M songs) is available! 🙂

Warning! This post is biased towards my interests (deep audio tech). Feel free to suggest any addition to this list, I will be happy to update it with interesting papers I missed!