Abstract. The focus of this work is to study how to efficiently tailor Convolutional Neural Networks (CNNs) towards learning timbre representations from log-mel magnitude spectrograms. We first review the trends when designing CNN architectures. Through this literature overview we discuss which are the crucial points to consider for efficiently learning timbre representations using CNNs. From this discussion we propose a design strategy meant to capture the relevant time-frequency contexts for learning timbre, which permits using domain knowledge for designing architectures. In addition, one of our main goals is to design efficient CNN architectures – what reduces the risk of these models to over-fit, since CNNs’ number of parameters is minimized. Several architectures based on the design principles we propose are successfully assessed for different research tasks related to timbre: singing voice phoneme classification, musical instrument recognition and music auto-tagging.
The signal processing community is very into machine learning. Although I am not sure of the implications of this fact, this intersection already produced very interesting results – such as Smaragdis et al.’s work. Lots of papers related to deep learning were presented. Although in many cases people were naively applying DNN or LSTMs to a new problem, there also was (of course) amazing work with inspiring ideas – I highlight some:
- Koizumi et al. propose using reinforcement learning for source separation. This work introduces how to use reinforcement learning for audio signal processing.
- Ewert et al. propose using a variant of dropout that can be used to induce models to learn specific structures by using information from weak labels.
- Ting-Wei et al. propose doing frame-level predictions with a fully convolutional model that also uses gaussian kernel filters (first introduced by them) trained with clip-level annotations in a weakly-supervised learning setup.
I was invited to give a talk to the Deep Learning for Speech and Language Winter Seminar @ UPC, Barcelona. Since UPC is the university where I did my undergaduate sudies, it was a great pleasure to give an introductory talk about how our community is using deep learning for approaching music technology problems.
Overall, the talk was centered in reviewing the state-of-the-art (1988-2016) in deep learning for music data processing in order to boost some discussion about current trends. Several key papers were chronologically listed and briefly described: pioneer papers using MLP , RNNs , LSTMs  and CNNs  for music data processing; and pioner papers using symbolic data , spectrograms  and waveforms  – among others.
This journal article summarizes the most relevant results we found throughout my master thesis research – namely, the results related to popular western music. However, in this thesis we also describe the first attempt of remixing orchestral music for improving CI users classical music experience. Although the results for orchestral music are not conclusive, they provide nice intuition for designing future experiments and might be valuable for researchers who are interested in that topic.
Abstract – Many researchers use convolutional neural networks with small rectangular filters for music (spectrograms) classification. First, we discuss why there is no reason to use this filters setup by default and second, we point that more efficient architectures could be implemented if the characteristics of the music features are considered during the design process. Specifically, we propose a novel design strategy that might promote more expressive and intuitive deep learning architectures by efficiently exploiting the representational capacity of the first layer – using different filter shapes adapted to fit musical concepts within the first layer. The proposed architectures are assessed by measuring their accuracy in predicting the classes of the Ballroom dataset. We also make available the used code (together with the audio-data) so that this research is fully reproducible.