Abstract – Many researchers use convolutional neural networks with small rectangular filters for music (spectrograms) classification. First, we discuss why there is no reason to use this filters setup by default and second, we point that more efficient architectures could be implemented if the characteristics of the music features are considered during the design process. Specifically, we propose a novel design strategy that might promote more expressive and intuitive deep learning architectures by efficiently exploiting the representational capacity of the first layer – using different filter shapes adapted to fit musical concepts within the first layer. The proposed architectures are assessed by measuring their accuracy in predicting the classes of the Ballroom dataset. We also make available the used code (together with the audio-data) so that this research is fully reproducible.
Given that several relevant researchers in our field were in Barcelona for being part of the jury of Ajay‘s and Sankalp‘s PhD thesis defense, the MTG hosted a very interesting seminar. Among other topics, the potential impact of deep learning in our field was discussed and almost everyone agreed that it seems that end-to-end learning approaches are not successful because no large-scale (annotated) music collections are available for research benchmarking. And indeed, most successful deep learning approaches use those models as mere feature extractors or as hierarchical classifiers build on top of hand-crafted features.