We aim to study how deep learning techniques can learn generalizable musical concepts. For doing so, we discuss which musical concepts can be fitted under the constraint of an specific CNN filter shape.
Several architectures can be combined to construct deep learning models: feed-forward neural networks, RNNs or CNNs. However, since the goal of our work is to understand which (musical) features deep learning models are learning, CNNs seemed an intuitive choice regarding that it is common to feed CNNs with spectrograms. Spectrograms have a meaning in time and in frequency and therefore, the resulting CNN filters will have interpretable dimensions (at least) in the first layer: time and frequency. This basic observation, motivates the following discussion.