This year’s ISMIR was in Delft, the Netherlands. It seems like the community is starting to realise that the technologies developed by the ISMIR community can have an impact to our society – because they are starting to work! During the first days of the conference, many conversations were focusing on exploring ways to positively impact society. On the other side, technology-wise, we have seen (i) many people studying how to use musical domain knowledge to disentangle/structure/learn useful neural representations for many music applications, and (ii) many attention-based neural architectures.

This year I was in ISMIR to present..
- our tutorial on “Waveform-based music processing with deep learning“, with Jongpil Lee and Sander Dieleman.
- our open source, deep learning based music tagger: musicnn.
- to represent Dolby Laboratories at the Industry Meetup.
To start, I would like to highlight that the proceedings are in a very nice format – that makes them accessible to everyone. The organisers compiled the papers in a website, together with a short summary of the papers and a link to the code. I encourage everyone to take a look into the proceedings website, because it is a very nice way to navigate the scientific program of the conference.
[Disclaimer: the summaries I attach below, are copy-pasted from the proceedings]
This is my top-10 list of ISMIR papers, with twelve entries:
- Zero-shot Learning for Audio-based Music Classification and Tagging
- ISMIR summary: “Investigated the paradigm of zero-shot learning applied to music domain. Organized 2 side information setups for music calssification task. Proposed a data split scheme and associated evaluation settings for the multi-label zero-shot learning.”
- Deep Unsupervised Drum Transcription
- ISMIR summary: “DrummerNet is a drum transcriber trained in an unsupervised fashion. DrummerNet learns to transcribe by learning to reconstruct the audio with the transcription estimate. Unsupervised learning + a large dataset allow DrummerNet to be less-biased.”
- Multi-Task Learning of Tempo and Beat: Learning One to Improve the Other
- ISMIR summary: “Multi-task learning helps to improve beat tracking accuracy if additional tempo information is used.”
- Adversarial Learning for Improved Onsets and Frames Music Transcription
- ISMIR summary: “Piano roll prediction in music transcription can be improved by appending an additional loss incurred by an adversarial discriminator.”
- Learning Complex Basis Functions for Invariant Representations of Audio
- ISMIR summary: “The “Complex Autoencoder” learns features invariant to transposition and time-shift of audio in CQT representation. The features are competitive in a repeated section discovery, and in an audio-to-score alignment task.”
- Best paper award!
- A Holistic Approach to Polyphonic Music Transcription with Neural Networks
- ISMIR summary: “A neural network architecture is trained in an end-to-end manner to transcribe music scores in humdrum format from polyphonic audio files.”
- Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders
- ISMIR summary: “We disentangle pitch and timbre of musical instrument sounds by learning separate interpretable latent spaces using Gaussian mixture variational autoencoders. The model is verified by controllable sound synthesis and many-to-many timbre transfer.”
- Learning a Joint Embedding Space of Monophonic and Mixed Music Signals for Singing Voice
- ISMIR summary: “The paper introduces a new method of obtaining a consistent singing voice representation from both monophonic and mixed music signals. Also, it presents a simple music mashup pipeline to create a large synthetic singer dataset.”
- Towards Interpretable Polyphonic Transcription with Invertible Neural Networks
- ISMIR summary: “Invertible Neural Networks enable direct interpretability of the latent space.”
- VirtuosoNet: A Hierarchical RNN-based System for Modeling Expressive Piano Performance
- ISMIR summary: “We present an RNN-based model that reads MusicXML and generates human-like performance MIDI. The model employs a hierarchical approach by using attention network and an independent measure-level estimation module. We share our code and dataset.”
- Approachable Music Composition with Machine Learning at Scale
- ISMIR summary: “We show behind the scenes how the Bach Doodle works, the design, how we sped up the machine learning model Coconet to run in the browser. We are also releasing a dataset of 21.6 million melody and harmonization pairs, along with user ratings.”
- SUPRA: Digitizing the Stanford University Piano Roll Archive
- ISMIR summary: ”This paper describes the digitization process of SUPRA, an online database of historical piano roll recordings, which has resulted in an initial dataset of 478 performances of pianists from the early twentieth century transcribed to MIDI format.”
- Best paper award.
It seems like the ISMIR community generously embraced attention-based models:
- Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval
- ISMIR summary: “The amount of temporal context given to a CNN is adapted by an additional soft-attention network, enabling the network to react to local and global tempo deviations in the input audio spectrogram.”
- Harmony Transformer: Incorporating Chord Segmentation into Harmony Recognition
- ISMIR summary: “Incorporating chord segmentation into chord recognition using the Transformer model achieves improved performance over prior art.”
- Best paper award.
- ISMIR summary: “Incorporating chord segmentation into chord recognition using the Transformer model achieves improved performance over prior art.”
- A Bi-Directional Transformer for Musical Chord Recognition
- ISMIR summary: “We propose bi-directional Transformer model based on self-attention mechanism for chord recognition. Through an attention map analysis, we visualize how attention was performed and conclude that the model can effectively capture long-term dependency.”
- An Attention Mechanism for Musical Instrument Recognition
- ISMIR summary: ”Instrument recognition in multi-instrument recordings is formulated as a multi-instance multi-label classification problem. We train a model on the weakly labeled OpenMIC dataset using an attention mechanism to aggregate predictions over time.”
- LakhNES: Improving Multi-instrumental Music Generation with Cross-domain Pre-training
- ISMIR summary: “We use transfer learning to improve multi-instrumental music generation by first pre-training a Transformer on a large heterogeneous music dataset (Lakh MIDI) and subsequently fine tuning it on a domain of interest (NES-MDB).”
- Lakh MIDI + NES-MDB = LakhNes
- ISMIR summary: “We use transfer learning to improve multi-instrumental music generation by first pre-training a Transformer on a large heterogeneous music dataset (Lakh MIDI) and subsequently fine tuning it on a domain of interest (NES-MDB).”
Interestingly, cover-song identification is a hot topic again:
- Cover Detection Using Dominant Melody Embeddings
- ISMIR summary: “We propose a cover detection method based on vector embedding extraction out of audio dominant melody. This architecture improves state-of-the-art accuracy on large datasets, and scales to query collections of thousands of tracks in a few seconds.”
- Da-TACOS: A Dataset for Cover Song Identification and Understanding
- ISMIR summary: “This work aims to understand the links among cover songs with computational approaches and to improve reproducibility of Cover Song Identification task by providing a benchmark dataset and frameworks for comparative algorithm evaluation.”
Finally, some interesting works on music source separation that do not assume a model per source:
- Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for Multiple Source Separations
- ISMIR summary: “In this paper, we apply conditioning learning to source separation and introduce a control mechanism to the standard U-Net architecture. The control mechanism allows multiple instrument separations with just one model without losing performance.”
- ISMIR summary: “In this paper, we apply conditioning learning to source separation and introduce a control mechanism to the standard U-Net architecture. The control mechanism allows multiple instrument separations with just one model without losing performance.”
- Audio Query-based Music Source Separation
- ISMIR summary: “An audio-query based source separation method that is capable of separating the music source regardless of the number and/or kind of target signals. Various useful scenarios are suggested such as zero-shot separation, latent interpolation and etc.”