Rationalizing the design of deep learning models for music signals

A brief review of the state-of-the-art in music informatics research (MIR) and deep learning reveals that such models achieved competitive results in a relatively short amount of time – most relevant papers were published during the last 5 years. Many researchers successfully used deep learning for several tasks: onset detection, genre classification, chord estimation, auto-tagging or source separation. Even some researchers declare that is the time for a paradigm shift: from hand-crafted features and shallow classifiers to deep processing models. In fact, in the past, introducing machine learning for global modeling (ie. classification) resulted in a significant state-of-the-art advance – no one doubts about that. And now, some researchers think that another advance could be done by using data-driven feature extractors instead of hand-crafted features – meaning that these researchers propose to fully substitute the current pipeline by machine learning. However, deep learning for MIR is still in its early ages. Current systems are based on solutions proposed in the computer vision, natural language or speech research fields. Therefore, now it is time to understand and adapt these for the music case.

For example, convolutional neural networks (CNNs) are widely used in computer vision. CNNs exploit spatially-local correlations by enforcing local connections between neurons of the same layer. The local connections define filters that characterize the local stationarities of the data. After years of research, the computer vision community has reached the consensus that relevant local stationarities could be modeled by small squared filters. Actually, by using squared small filters the first layers can model edges or basic shapes that in deeper layers can be combined to represent textures or objects.

Currently, many music technology researchers (influenced by the success of deep learning in computer vision) adopted small squared filters for their work with music spectrograms. In a way, these researchers are assuming that audio events can be recognized by seeing spectrograms – that are image-like time-frequency audio representations. However, how do we know that the relevant local stationarities in music (spectrograms) can be modeled by small squared filters? In fact, music technology researchers might study which are the relevant local stationarities in music and propose models that could fit those well, what would probably lead to more successful and understandable deep learning architectures.

The following two reflections remark why deep learning can be advantageous for modeling music. But these also point how to adapt some of the current deep learning technologies to consider the underlying construction of music when designing deep learning architectures:

  1. Deep learning technologies might fit well music’s nature since music is hierarchic in frequency (note, chord; note, motive, structure) and time (onset, rhythm; onset, tempo). Deep learning can allow this hierarchic representation of concepts since its architecture is inherently hierarchical due to its depth.
  2. Relationships between musical events in time are important for human music perception. Using recurrent neural networks (RNNs) and/or CNNs, the net is capable to analyze such temporal context. RNNs can model long-term dependencies (music structure or recurrent harmonies) and CNNs can model the local context (instrument’s timbre or musical units). But note that RNNs can also model short-term dependencies, meaning that by architectural choices researchers can tailor the net towards learning musical aspects in manifold ways.

Theoretically, RNNs can model long-term musical dependencies. However, it is still not clear how different musical concepts (such as rhythm, tempo or structure) could be efficiently modeled by RNNs. Moreover, it might be also interesting to study how CNNs and RNNs model different time dependencies when coupled together for the music case. Therefore, further research is needed for trying to understand how different deep learning approaches can fit music audio since there is still a big lack of understanding – we still do not fully grasp what the networks are learning. Dieleman et al. made some progress showing that “higher level features are defined in terms of lower-level features” for music – they found that the first convolutional layer in their deep learning music recommendation system had filters specialized in low-level musical concepts (vibrato, vocal thirds, pitches, chords), whereas the third convolutional layer filters were specialized in higher-level musical concepts (christian rock, chinese pop, 8-bit). This matches with similar results found by the image processing research community, where lower layers are capable of learning shapes that are combined in higher layers to represent objects. Furthermore, Dieleman et al. also proposed a deep learning algorithm that preserves musically significant timescales (beats-bars-themes) within the design of the architecture, what “leads to an increase in accuracy” for music classification tasks and gives an intuition of what the network may be learning; showing that musically motivated architectures may be beneficial for MIR. Moreover, Choi et al. proposed a method called auralisation which is an extension of the CNNs visualization method. This method allows to interpret by listening what each CNN filter has learned.  And finally, Schluter et al. and Phan et al. also tried to gain some understanding of what their CNNs have learned by visually inspecting the feature maps, convolutional filters and/or activations.

Despite the efforts on trying to puzzle out what the networks are learning, it is still not clear which are the best architectures to fit music audio. It is hard to discover the adequate combination of parameters for a particular task, which leads to architectures being difficult to interpret. Given this, it might be interesting to rationalize the design process by exploring deep learning architectures specifically thought to fit music audio; what will probably lead to more successful and understandable deep learning architectures.

Some of our previous work goes in that direction. Specifically, we propose using musically motivated CNNs – these CNN architectures are designed considering the conclusions of a CNNs filter shapes discussion for music spectrograms.