This was my second ISMIR, and I am super excited of being part of this amazing, diverse, and so inclusive community. It was fun to keep putting faces (and height, and weight) to these names I respect so much! This ISMIR has been very special for me, because I was returning to the city where I kicked off my academic career (5 years ago I was starting a research internship @ IRCAM!), and we won the best student paper award!
All awarded papers were amazing (of course! 🙂 ):
- Learning to Listen, Read, and Follow: Score Following as a Reinforcement Learning Game by Dorfer et al. They address the score following task with deep reinforcement learning. It is important to note that this approach directly learns from sheet music images (pixels) and spectrograms! It’s nice to see novel ideas that are working so well, see their video-demo and code.
- End-to-end Learning for Music Audio Tagging at Scale by Pons et al. We show that when enough training data is available: waveform models (sampleCNN) > spectrogram models (musically motivated CNN). But spectrogram models > waveform models when no sizable data are available! Finally: our musically motivated CNNs achieve state-of-the-art results for the MTT & MSD datasets, see our demo and code.
- Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces by Esling et al. They propose to use variational auto-encoders regularized considering timbral spaces that are defined in previous music perception works. It is very interesting to see people building bridges between former papers on music perception, and current deep learning techniques. See their demo and code!
But there were many other inspiring papers..
- A Single-step Approach to Musical Tempo Estimation Using a Convolutional Neural Network by Schreiber & Müller. They brought to the next level the multi-filter CNN modules we proposed for learning temporal cues from spectrograms. They describe interesting parallelisms between their CNN for (global & local) tempo estimation, and previous traditional approaches. Very interesting!
- Onsets and Frames: Dual-Objective Piano Transcription by Hawthorne et al., who propose to jointly estimate in which frames a note is active while also estimating when an onset occurs. The power of this model relies on a very simple (post-processing) trick: only accepting the “frame-notes where an onset is active”, what removes these annoying spurious notes that many transcription systems estimate.
- Zero-Mean Convolutions for Level-Invariant Singing Voice Detection by Schlüter & Lehner. They found that their singing voice detection model was using the energy of the signal as a proxy to detect singing voice. Hence, the model was tricking the audience like the famous horse “Clever Hans”. In order to solve that issue, they propose to use CNN filters constrained to be zero-mean in the first layer.
- Representation Learning of Music Using Artist Labels by Park et al. Their goal is to learn (in a supervised fashion) transferable music representations from labels that are easy to collect. They found a solution via learning from artist labels: either explicitly (directly predicting the artist), or implicitly (with metric learning using Siamese networks).
Source separation has been historically considered the Holy Grail among music technologists, and now is being revisited from the deep learning perspective. According to the papers presented at ISMIR, U-net architectures seem to perform very well. For example, Park et al. presented the a paper based on U-net like structure: Music Source Separation Using Stacked Hourglass Networks. Or Stoller et al. introduced the Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation, a model capable to do source separation directly in the waveform domain. I personally think that Wave-U-Net opens up a very interesting research direction (also explored by others) that would, in the long run, allow to get rid of the (so annoying) Wiener filtering step – that most spectrogram-based source separation algorithms rely on.
Finally, I want to highlight Lattner’s (and his collaborators) work who did a tremendous effort (3 papers!) for disseminating the results of their Predictive Model for Music based on Learned Interval Representations. In short, they propose to use a recurrent gated auto-encoder which is a recurrent neural network that operates on interval representations (that are learned in an unsupervised fashion by a gated auto-encoder) of musical sequences. Their main goal is to learn transposition-invariant features, and they show that these can work for audio & midi signals (in Learning Transposition-invariant Interval Features from Symbolic Music and Audio), and for audio-to-score alignment (in Audio-to-Score Alignment using Transposition-invariant Features).
Unfortunately, for the sake of brevity, I left out many interesting papers. Feel free to complete this list by leaving a comment below! Otherwise, here there is a link to the whole scientific program.
But.. no comments about the new format?!
Oh, yes! I could not skip writing some words related to this year’s ISMIR format: 4′ talk + poster session for everyone.
As a presenter: I really enjoyed this year’s format, and I would love to repeat it.
- It’s easier to prepare a 4′ talk than a 15′-20′ talk.
- It’s easier (and less frustrating) to give an overview of your work – rather than introducing stupid details that are only relevant to the authors of the paper, and few researchers in the audience.
- It’s nice to present the main take-aways of your work to the broad audience assisting to your talk..
- ..while also being able to discuss the details (and get feedback!) from those brilliant minds who really care about your work and will attend to your poster.
+ bonus track: it’s great that the talks were recorded, this makes ISMIR more accessible – what would eventually increase the impact of our works.
As an attendee: Redundancy is good! I liked the format, but we can possibly improve it.
- It’s easier to follow a short talk than a longer one.
- However, 18 short talks in a row is too much. Maybe introducing a short break during the orals could make the session easier to digest?
- Having a first oral round helps introducing the papers, and facilitates the discussion during the poster.
- In case you miss the orals (you shouldn’t do that!), you won’t miss anything important from the conference.
- Recording the orals is useful! You can double check presentations for clarification – even, during the conference!
- Steaming the orals makes the difference! You can listen to the orals from your hotel room before going to the posters, so cool.
Finally, this new format was also useful for giving more visibility to some works that were rarely selected for oral sessions – like datasets. Here a (non-comprehensive) list of the new datasets that were presented at ISMIR:
- Audio-Aligned Jazz Harmony Dataset for Automatic Chord Transcription and Corpus-based Research by Eremenko et al.
- OpenMIC-2018: An open data-set for multiple instrument recognition by Humphrey et al.
- The NES Music Database: A multi-instrumental dataset with expressive performance attributes by Donahue et al.
- DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm by Meseguer-Brocal et al.
- A Crowdsourced Experiment for Tempo Estimation of Electronic Dance Music by Schreiber & Müller.