Abstract. The focus of this work is to study how to efficiently tailor Convolutional Neural Networks (CNNs) towards learning timbre representations from log-mel magnitude spectrograms. We first review the trends when designing CNN architectures. Through this literature overview we discuss which are the crucial points to consider for efficiently learning timbre representations using CNNs. From this discussion we propose a design strategy meant to capture the relevant time-frequency contexts for learning timbre, which permits using domain knowledge for designing architectures. In addition, one of our main goals is to design efficient CNN architectures – what reduces the risk of these models to over-fit, since CNNs’ number of parameters is minimized. Several architectures based on the design principles we propose are successfully assessed for different research tasks related to timbre: singing voice phoneme classification, musical instrument recognition and music auto-tagging.
Abstract – Many researchers use convolutional neural networks with small rectangular filters for music (spectrograms) classification. First, we discuss why there is no reason to use this filters setup by default and second, we point that more efficient architectures could be implemented if the characteristics of the music features are considered during the design process. Specifically, we propose a novel design strategy that might promote more expressive and intuitive deep learning architectures by efficiently exploiting the representational capacity of the first layer – using different filter shapes adapted to fit musical concepts within the first layer. The proposed architectures are assessed by measuring their accuracy in predicting the classes of the Ballroom dataset. We also make available the used code (together with the audio-data) so that this research is fully reproducible.
These (preliminary) results denote that the CNNs design for music informatics research (MIR) can be further optimized by considering the characteristics of the music audio data. By designing musically motivated CNNs, a much more interpretable and efficient model can be obtained. These results are the logical culmination of the previously presented discussion, that we recommend to read first.
We aim to study how deep learning techniques can learn generalizable musical concepts. For doing so, we discuss which musical concepts can be fitted under the constraint of an specific CNN filter shape.
Several architectures can be combined to construct deep learning models: feed-forward neural networks, RNNs or CNNs. However, since the goal of our work is to understand which (musical) features deep learning models are learning, CNNs seemed an intuitive choice regarding that it is common to feed CNNs with spectrograms. Spectrograms have a meaning in time and in frequency and therefore, the resulting CNN filters will have interpretable dimensions (at least) in the first layer: time and frequency. This basic observation, motivates the following discussion.