Given that several relevant researchers in our field were in Barcelona for being part of the jury of Ajay‘s and Sankalp‘s PhD thesis defense, the MTG hosted a very interesting seminar. Among other topics, the potential impact of deep learning in our field was discussed and almost everyone agreed that it seems that end-to-end learning approaches are not successful because no large-scale (annotated) music collections are available for research benchmarking. And indeed, most successful deep learning approaches use those models as mere feature extractors or as hierarchical classifiers build on top of hand-crafted features.
Figure 1: Graphic representation of the trade-off between pure data-driven approaches and pure knowledge-based approaches.
Then.. how is the lack of data affecting our (deep learning) research?
[Throughout this post we discuss the use case of audio classification, deep generative models are not discussed here.]
- Researchers input hand-crafted features to deep networks. By doing so, one aims to input a higher-level representation of the signal for simplifying the process of learning a feature extractor + classifier from a small amount of data. However, throughout this process on can miss some information – ie. the phase, if the magnitude spectrogram is used as input. A publication doing so: Hershey et al. and Choi et al.
- Researchers use non deep learning classifiers/models stacked on top of deep learning feature extractors. Several reasons exist for this pipeline: i) deep learning models are powerful feature extractors, but they are data demanding and ii) it seems easier to introduce musical knowledge with non deep learning models (that are less data demanding, more understandable and, still, performing very well). Therefore, this pipeline minimizes the affect of not having large amounts of data while still exploiting the potential of deep learning. Some publications doing so: Torralba et al., Durand et al., Korzeniowski et al. and Böck et al.
- Researchers are tailoring the networks towards learning solutions closer to what humans perceive from music. Several ideas can fit under this umbrella: musically/perceptually inspired weights initialization schemes (similar to the well established initialization paradigm in NMF research), musically inspired architectures, sparsity constraints and perceptually inspired cost functions. These approaches aim to (initially) place the model close to a solution having the generalization power of the known priors of music and perception. A couple of publications doing so: Cakir et al. and Pons et al.
- Researchers are doing a big effort to provide open audio databases for benchmarking our systems. For example: COSMIR, AcousicBrainz and Freesound projects are gathering large-scale music and audio collections.
- Researchers are artificially augmenting their annotated data in order to overcome the current situation where small annotated datasets are available. For example, see: Salamon et al.
The previous list gives a simplistic overview of the ongoing work that can affect the evolution of deep learning for audio signals, and summarizes some of the current research trends in deep learning for music informatics research (MIR) – note that all selected papers are from 2016. Until a large corpus of annotated audio is not available for researchers (bullet 4), it seems that there is a consensus on trying to restrict the solution space of the deep learning models (in manifold ways, see: bullets 1, 2 and 3) so that these models have more chances to succeed. Additionally, most researchers are making use of data augmentation techniques for boosting their results by artificially generating training examples (see bullet 5).
But it exists a delicate trade-off between restricting the solution space and allowing any possible solution. In one extreme (left side of Figure 1) are the non-constrained models that explore the solution space by any means (ie. deep learning): brute-force optimization of hyper-parameters and architectures, a pure data-driven approach that is learning from raw data. Part of the deep learning game is to allow the architecture to freely discover features, what leads to very successful models. However, a common criticism to deep learning relates with the difficulty in understanding the underlying relationships that the neural networks are learning, thus behaving like a black-box. Having interpretable models is specially relevant for the MIR field since it has already been pointed that machine learning algorithms are leaning how to “reproduce the ground truth” rather than learning musical concepts. Without any constraint, there is the risk of achieving a solution difficult to interpret that, moreover, is very prone to over-fitting – given that a small sample of data is available for training. On the other extreme (right side of Figure 1) are some of the current MIR systems: hand crafted features and knowledge-based approaches. Restricting too much the solution space of deep networks might be inappropriate because the learning algorithm will not be able to explore enough the solution space. If a classifier/model is not expressive enough or an architecture, an initialization or an input representation is limiting too much the feature learning process, the resulting model is likely to have limited performance. Hence, by restricting the solution space we are inquiring the risk of not (fully) exploiting the power of deep learning. Therefore, a compromise between expressiveness and restriction is needed since the available data is scarce. We need to constrain the solution space in a way that allows interpreting the results while still guaranteeing that deep learning models have enough freedom to explore the solution space themselves. If the solution space exploration process is severely limited by the human skills (as it can be currently happening with hand-crafted features and with knowledge-based classifiers), we are wasting the opportunities that deep learning is offering to train highly expressive models.
Interestingly, the deep learning methodology allows training models that can stay between the two paradigms: data-driven approaches that allow knowledge-based refinements, see Figure 1. Deep learning allows end-to-end learning (what permits minimizing the assumptions of the model) while still allowing knowledge-based decisions over the input, architecture, cost and initialization. Moreover, end-to-end learning approaches allow having an unified pipeline where the feature extractor and the classifier are optimized together, what might be an advantage if we compare with the current paradigm: shallow, deep or knowledge-based classifiers stacked on top of feature extractors of different nature. For this reason we think that minimally-restricted end-to-end learning approaches – with some musically/perceptually based restrictions (bullet 3) together with some minimal pre-processing (bullet 1, ie. with spectrograms as input) while taking advantage of data augmentation techniques – are very interesting because these allow minimizing the assumptions of the model while also minimizing the need of training data. Without restricting too much the solution space, we aim to achieve more understandable expressive models that are more prone to generalize in the current context where a large corpus of annotated audio is not available.