CNN filter shapes discussion for music spectrograms7 min read

We aim to study how deep learning techniques can learn generalizable musical concepts. For doing so, we discuss which musical concepts can be fitted under the constraint of an specific CNN filter shape.

Several architectures can be combined to construct deep learning models: feed-forward neural networks, RNNs or CNNs. However, since the goal of our work is to understand which (musical) features deep learning models are learning, CNNs seemed an intuitive choice regarding that it is common to feed CNNs with spectrograms. Spectrograms have a meaning in time and in frequency and therefore, the resulting CNN filters will have interpretable dimensions (at least) in the first layer: time and frequency. This basic observation, motivates the following discussion.


Figure 1. Discussed filter shapes. From left to right: squared/rectangular filter, temporal filter and frequency filter.

1. CNN filter shapes discussion

Due to the CNNs success in the computer vision research field, its literature significantly influenced the music informatics research (MIR) community. In the image processing literature, squared small CNNs filters (ie. 3×3 or 7×7) are common. As a result of that, MIR researchers tend to use similar filter shape setups. However, note that the image processing filter dimensions have spatial meaning, while the audio spectrograms filters dimensions correspond to time and frequency. Therefore, wider filters may be capable of learning longer temporal dependencies in the audio domain while higher filters may be capable of learning more spread timbral features.

In order to motivate researchers to be conscious about the potential impact of choosing one filter shape or another, three examples and a use case are discussed in the following. Throughout this post we assume the spectrogram dimensions to be M-by-N, the filter dimensions to be m-by-n and the feature map dimensions to be M’-by-N’. M, m and M’ standing for the number of frequency bins and N, n and N’ for the number of time frames:

  • Squared/rectangular filters (m-by-n filters) are capable of learning time and frequency features at the same time. This kind of filter is one of the most used in the music technology literature. Such filters can learn different musical aspects depending on how m and n are set. For example, a bass or a kick could be well modeled with a small filter (m << M and n << N, representing a sub-band for a short-time) because: these instruments are sufficiently characterized by the lower bands of the spectrum and the temporal evolution of the bass notes or a kick is not so long. An interesting interpretation of such small filters is that they can be considered pitch invariant to some extent. Note that the convolution happens in both (time and frequency) domains and therefore, the inherent frequency convolution in CNNs is a pitch shifting. However, such pitch invariability would not hold for instruments having a large pitch range since the timbre of an instrument changes accordingly to its pitch. But depending on the input spectrogram representation (ie. CQT, MEL or STFT) CNNs might be capable of learning more robust pitch invariant features. CQT is specially suited for achieving pitch invariant features since the relative positions of the harmonics remain constant regardless the f0, what makes the timbre signature less variant for all possible pitches of an harmonic instrument. This contrasts with the timbre representation achieved with STFT, that is f0 dependent. CQT can be thought as a STFT mapping done by series of logarithmically spaced averages – that are spaced in a similar way as octaves are distributed in frequency. This log-based transform achieves constant inter-harmonic spacings, what might facilitate CNNs to learn pitch invariant representations. Finally, note that MEL spectrograms might permit learning features that are more pitch invariant than with STFT – because MEL spectrograms are based in a log-based perceptual scale of pitches. However, in theory, MEL spectrograms are not as good as CQT because they are not grounded by the same motivations but for mapping human music perception.As another example, cymbals or snare drums -that are broad in frequency with a fixed decay time- could be suitably modeled setting m = M and n << N. Please note that a bass or a kick could also be modeled with this filter, however: (i) the pitch invariance interpretation will not hold because its dimensions (m=M) do not allow the filter to convolve along frequency and therefore, pitch will be encoded together with timbre (meaning that, in order to characterize the timbre for the whole pitch range of an instrument, a filter per note is needed), what leads to a less efficient representation; and (ii) most of the weights would be set to zero, waisting part of the representational power of the CNN filter – because most of the relevant information is basically concentrated in the lower bands of the spectrum.As a final example, we want to point that squared/rectangular filters might be capable of modeling music motives as well. A music motive is a succession of (close) notes that occur synchronized with a characteristic rhythmic pattern. Therefore, music motives fit under the constraint of being a band information (m < M) that last a fixed period of time (n < N).
  • Temporal filters (1-by-n): setting the frequency dimension m to 1, such filters will not be capable of learning frequency features but will be specialized in modeling temporal dependencies relevant for the task to be learned from the training data. Note that, even though the filters themselves are not learning frequency features, upper layers may be capable of exploiting frequency relations present in the resulting feature map – the frequency interpretation for the M’ dimension of the subsequent feature map still hold because the convolution operation is done bin-wise (m=1). From the musical perspective, one expects these temporal filters to learn relevant rhythmic/tempo patterns within the analyzed bin.
  • Frequency filters (m-by-1): setting the time dimension n to 1, such filters will not be capable of learning temporal features but will be specialized in modeling frequency features relevant for the task to be learned from the training data. Similarly as for the temporal filters, upper layers can still find some temporal dependencies in the resulting feature map since the temporal interpretation of the N’ dimension still hold because the convolution operation is done frame-wise (n=1). From the musical perspective, one expects these frequency filters to learn timbre or equalization setups, for example. Moreover, note the resemblance of the frequency filters with the so used (and successful in MIR) NMF basis. As a final remark, note that the pitch invariant discussion introduced for the m-by-n filters also applies for frequency filters.

To conclude this section, we discuss the results posted by Keunwoo Choi as a study case. They use a 5-layer CNN of squared 3-by-3 filters for genre classification. After auralising and visualizing the network filters, they provide an interpretation of the learned CNNs filters in every layer:

  • Layer 1: onsets.
  • Layer 2: onsets, bass, harmonics, melody.
  • Layer 3: onsets, melody, kick, percussion.
  • Layer 4: harmonic structures, notes, vertical lines, long horizontal lines.
  • Layer 5: textures, harmo-rhythmic patterns structures.

Note that Keunwoo Choi observations are in concordance with the previously presented discussion. As a result of using small squared filters of 3-by-3, the lower layers of the deep CNN are learning musical concepts that fit under the constraint of being represented in a sub-band for a short-time. Moreover note that deeper layers in the network learn horizontal and vertical lines, denoting the plausible utility of the temporal and frequency filters in CNNs for MIR.

As observed in this example, the model needed deep representations (stacked CNN layers) for being able to represent large time-frequency contexts since it is difficult for the first layers to scope long time dependencies or wide frequency signatures with such small squared filters. This fact remarks the potential of employing temporal and frequency filters; by using these filters in the first layer(s), the depth of the network can be employed for learning other features rather than learning vertical and horizontal lines.

To conclude this text we want to remark that these interpretations do not only hold for music, since a similar reasoning could be done for speech audio or for any audio related deep learning task.

Next post proposes and assesses some musically motivated architectures that consider the here presented discussion.

Scientific publication for reference: