- Choi et al. – every time I re-read this paper I am more impressed about the efforts they put in assessing the generalization capabilities of deep learning models. This work defines a high evaluation standard for those working in deep auto-tagging models!
- Bittner et al. proposes a fully-convolutional model for tracking f0 contours in polyphonic music. The article has a brilliant introduction drawing parallelisms between their proposed fully-convolutional architecture and previous traditional models – making clear that it is worth building bridges between deep learning works and previous signal processing literature.
- Oramas et al. – deep learning enables to easily combine information from many sources, such as: audio, text or images. They do so by combining representations extracted from audio-spectrograms, word-embeddings and ImageNet-based features. Moreover, they released a new dataset: MuMu, with 147,295 songs belonging to 31,471 albums.
- Jansson et al.‘s work proposes a U-net model for singing voice separation. It seems that adding connections between layers at the same hierarchical level in the encoder and decoder for reconstructing masked audio signals is a good idea since several papers already reported good results using this setup.
But there were many other inspiring papers..
- McFee & Bello‘s work addresses the problem of large-vocabulary chord transcription via exploiting structural relationships between chord classes. I am still intrigued by their single 5×5 filter in the first layer which is introduced as an harmonic saliency enhancer.. I am eager to experiment with this idea!
- Marius et al. propose a score-informed model for classical music source separation that is based on a deep convolutional auto-encoder. Interestingly, their model can be linked to Bittner et al.‘s work (because a multi-channel input representation is used) and to Jansson et al.‘s architecture (because a deep convolutional auto-encoder is also used for source-separation).
- Chen et al. further elaborate on the idea of using musically motivated architectures for music-audio classification – specifically, they confirm that using many filters in the first layer generally yields to better results. In addition, they incorporate an LSTM-layer on top of the CNN feature extractor to capture the long-term dependencies that are so important in music signals.
- Vogl et al. presented an approach based on convolutional recurrent neural networks.
- Southall et al. presented a method based on soft attention mechanisms and convolutional neural networks.
- Wu et al. proposed to leverage unlabeled music data with a student-teacher learning approach.
- FMA dataset – with 106,574 music audio tracks arranged in a hierarchical taxonomy of 161 genres.
- Freesound Datasets – a platform for the creation of open audio datasets, with an early-dataset of 23,519 sounds structured considering the AudioSet ontology.
It is important to note that the audio content of these two datasets is distributed under Creative Commons licenses – what facilitates data sharing and reproducible research.
Warning! This post is biased towards my interests (deep audio tech). Feel free to suggest any addition to this list, I will be happy to update it with interesting papers I missed!