Towards adapting CNNs for music spectrograms: first attempt10 min read

These (preliminary) results denote that the CNNs design for music informatics research (MIR) can be further optimized by considering the characteristics of the music audio data. By designing musically motivated CNNs, a much more interpretable and efficient model can be obtained. These results are the logical culmination of the previously presented discussion, that we recommend to read first.


Figure 1. Time-frequency architecture.

Throughout this post we assume the spectrogram dimensions to be M-by-N, the filter dimensions to be m-by-n and the feature map dimensions to be M’-by-N’. M, m and M’ standing for the number of frequency bins and N, n and N’ for the number of time frames.

1. Musically motivated CNNs

Two architectures are introduced: Black-box and Time-Frequency.

  • Black-box architecture is based on previous work using CNNs for music classification. In this setup – obtaining the best results for the MIREX 2015 task of music/speech classification – a single convolutional layer with 15 12-by-8 filters and a 2-by-1 max-pool layer has been used, followed by a feed-forward layer of 200 units connected to an output softmax layer. We adapted this approach to further get better accuracy results changing slightly the setup to 32 12-by-8 filters and a max-pool layer of 4-by-1 (Figure 2). After some remarkable effort, an accuracy ceiling around 87% was observed and any of the tested architectural modifications implied an accuracy improvement. Therefore, we assume the Black-box architecture results to be the best we can get with deep learning for this dataset.

Architectures similar to Black-box are common in the literature. We call it black-box because there is no musically motivated reason for such architectural choices (ie. filter shape or max-pool shape).

The architectural choices that led to the Black-box architecture were based on intuition and brute-force optimization of the parameters – aka. trial and error. Therefore, it is still not clear how to navigate through the network parameters space and it is hard to discover the adequate combination of parameters for a particular task, which leads to architectures being difficult to interpret. Given this, we aim to rationalize the design process by proposing musically motivated architectures. Based on the previously presented discussion and inspired by the idea of using musically motivated CNNs, we propose the Time-Frequency architecture:

  • Time-Frequency architecture is composed by the Time and the Frequency architectures detailed below. By means of later fusion, the Time and Frequency architectures join forces so that the model can learn complementary features (time and frequency cues) from the data:
    • Time architecture (upper branch in Figure 1): is particularly designed to learn temporal cues. It is composed by a convolutional layer of 32 temporal filters (1-by-60, and defined here) followed by a max-pool layer of (40-by-1) connected to an output layer. Note that the frequency interpretation for the M’ dimension of the subsequent feature map still holds because the convolution operation is done bin-wise (m=1). The fact that the max-pool layer spans all over the frequency axis of the feature map (M = M’ = 40) and covers only one feature map frame (N’=1), allows: propagating only temporal content due to the summarization done among frequencies and preserving the feature map frame resolution, respectively. In that way, the Time architecture is only learning temporal cues.
    • Frequency architecture (lower branch in Figure 1): is designed to learn frequency features. It is composed by a convolutional layer of 32 frequency filters (32-by-1, and defined here) followed by a max-pool layer of (1-by-80) connected to an output layer. The max-pool layer (with a feature map of N = N’ = 80) operates similarly as in the Time architecture, but in that case the summarization is in time. Some other researchers motivated this architectural choice arguing in favor of the time invariance of this operation. Note that the extreme case of a Frequency architecture would be to input only one frame to the network; however, we expect the statistics provided by the max-pool layer to help the network to better solve the task to be learnt.

The Time filter shapes were optimized to achieve the best possible accuracy results considering only the Time architecture (upper branch in Figure 1) – by setting the output layer to be directly a softmax, removing the feed-forward layer and the Frequency architecture. The Frequency filter shapes were optimized similarly. Therefore, the Time and Frequency filter shapes were optimized separately.

blackboxFigure 2. Black-box architecture.

It is important to remark that, due to the fact that the architectural choices where musically motivated, the features learnt by the model are now easier to be interpreted.

2. Experiment setup

The goal of the following experiments is to assess whether musically motivated architectures can achieve competitive results compared to black-box deep learning methods.

Experiments are realized using the Ballroom dataset that consist on 698 tracks approx. 30 seconds long, divided into 8 music genres: cha-cha-cha, jive, quickstep, rumba, samba, tango, viennesse-waltz and slow-waltz. Two main shortcomings are regularly issued against this dataset: (i) its small size and (ii) the fact that its classes are highly correlated with tempo. And precisely, the previously described shortcomings motivate our study. Deep learning algorithms rely on the assumption that large amounts of training data are available to train the large amount of parameters of a network, and this assumption do not holds for most MIR datasets. We want to study whether it is feasible or not to constrain the solution space by means of musically motivated architectural choices – in order to have an smaller amount of parameters to train from a small dataset. The Ballroom dataset provides an excellent opportunity for studying so, due to its reduced size and because its classes are highly correlated with tempo. We want to take advantage of this musically relevant prior knowledge in order to propose a musically motivated architecture capable of encoding relevant temporal cues (such as tempo). And precisely, the Time architecture is designed for learning such relevant temporal cues from data and therefore, one expects good performance of it. In the other hand, one would expect the Frequency architecture to perform much worse since temporal cues can not be encoded with this architecture.

Also note that the Black-box architecture has advantage wrt. the Time and Frequency architectures, because only the former can learn time and frequency cues at the same time. Therefore, only Black-box and Time-Frequency architectures are comparable because both models can benefit from: exploiting time and frequency cues at the convolutional layer, a later fusion with a feed forward layer (200 units and 50% dropout) and an output softmax layer.

[go to section 3 if you are not a motivated researcher]

The audio is fed to the network through fixed-length mel spectrogram samples, N=80 frames wide. Throughout this work we use 40 bands mel-spectrograms derived from a STFT-spectrogram computed with a Blackman Harris window of 2048 samples (50% overlap) at 44.1 kHz. Phases are discarded. A dynamic range compression is applied to the input spectrograms element-wise in the form of log(1+C·x) where C=10.000 is a constant controlling the amount of compression. The resulting spectrograms are normalized so that the whole dataset spectrograms (together) have zero mean and variance one. Note that this normalization is not attribute-wise, as this would perturb the relative information encoded between spectrogram bins/frames.

The activation functions of the hidden layers are linear rectifiers (ReLUs) with a final 8-way softmax, where each output unit corresponds to a Ballroom class. 50% dropout is applied to the feed-forward layers. The output unit having the highest output activation is selected to be the model’s class prediction. Each network is trained using minibatch gradient descent with minibatches of 10 samples, minimizing the categorical cross-entropy between predictions and targets for each sample. It is trained from random initialization using an initial learning rate of 0.01. A learning schedule is programmed and the learning rate is divided by two every time the training loss gets stacked until there is no more improvement. The best model in the validation set is kept for testing.

Accuracies are computed using 10-fold cross validation with a randomly generated train-validation-test split of 80%-10%-10%. Since the input spectrograms are shorter than the total length of the song spectrogram, several estimations for each song can be done. A simple majority vote approach serves to decide the estimated class.

3. Discussion: efficient deep learning for music signals

Results comparing Black-box (brute-force optimized architecture) with Time-Frequency, Time and Frequency (musically motivated architectures) are presented in Table 1.

architecture #params accuracy: mean ± std
10 fold validation
Black-box 3.275.312 87.25 ± 3.39 %
Time-Frequency 196.816 86.54 ± 4.29 %
Time 7.336 81.79 ± 4.72 %
Frequency 3.368 59.59 ± 5.82 %

Table 1.

The best accuracy result ever reported for the Ballroom dataset was achieved by Marchand and Peeters (2014): 93.12%. Therefore, our deep learning results are inferior than the state-of-the-art, 6% points below. Two tentative explanations could discuss that result: or (i) not enough data is provided to the learning algorithm to solve the task or, simply, (ii) deep learning methods are not the best technique for predicting the Ballroom classes.

The first remarkable result comes into sight when comparing the accuracy values of the Black-box wrt. the Time-Frequency architecture. Interestingly, the musically motivated architecture achieves results that are equivalent to the ones achieved by the Black-box architecture. This result is specially relevant because it means that musically motivated architectures can be used instead of black-box architectures, what would allow going beyond the general black-box paradigm in deep learning. Due to the fact that the architectural choices in Time-Frequency where musically motivated, the learnt features are now more understandable and easier to be interpreted.

A second relevant result is related to the number of parameters required for both models (Black-box and Time-Frequency) to achieve an equivalent result. The parameters number of a network can be seen as a proxy for measuring the learning capacity of an architecture. One can observe that the Time-Frequency architecture, with approx. x16 less capacity to learn than the Black-box architecture, was still capable of achieving equivalent performance. Denoting that the musically motivated architecture (Time-Frequency) provides a more efficient framework where to learn music audio features – with less risk of overfitting due to the reduced number of parameters to be learnt from small data.

Third, note that the results achieved by the Frequency architecture are far from random (random accuracy, predicting the most likely class: 15.9%) – denoting that frequency cues are also relevant for predicting the Ballroom classes. Also note that the frequency filter shape (m = 32 < 40 = M) allows the filter to convolve in frequency, what can be interpreted as a pitch shifting according to the previously presented discussion.

A fourth interesting result is the remarkable 81.79% accuracy achieved by the Time architecture, a very cheap CNN model of only 7.336 parameters. Even though it is far from state-of-the-art accuracies, the fact that such a cheap model reached a that high accuracy denotes that the proposed architecture is a very efficient design for representing temporal features. Finally, we want also to discuss the filters shape of the Time architecture from a more musical perspective: can these filters learn tempo? Please note that tempo can be encoded in temporal filters by means of tempo patterns where periodic energy peaks (representing beats) can characterize a specific tempo in a filter. Currently, the temporal filters are set to be 1-by-60 (the length of the filter is 60 frames, representing approx. 1.4 sec). However note that it is challenging, even for a human, to discriminate tempos with that short audio excerpts – of less than 2 seconds. A part from that, the tempo in the Ballroom dataset songs ranges from 60 to 224 BPMs. Therefore, with the filter length set to 60 frames, only 1 beat can be accommodated by the filter for the slowest tempo in the dataset (60 BPMs) and the filter can learn about to 5 beats for the fastest tempo in the dataset (224 BPMs) – what denotes that this architecture might have difficulties on learning slow tempo patterns. Although these difficulties, it is clear that temporal filters are capable learning relevant temporal features from data. Therefore, and given that the Ballroom classes are highly correlated with tempo, it might be worth taking a closer look to the weights learnt by the temporal filters for trying to investigate whether these are learning tempo patterns or not – since the temporal filters have the capacity of learning (at least) fast tempo patterns.

4. References

  • Ugo Marchand & Geoffroy Peeters (2014). “The Modulation Scale Spectrum and its Application to Rhythm-Content Description”. In DAFx (pp. 167-172).

5. Additional material

Scientific publication for reference:

Reproduce our results, use our code: