Why do spectrogram-based VGGs rock?4 min read

Me: VGGs suck because they are computationally inefficient, and because they are a naive adoption of a computer vision architecture.

Random person on Internet: Jordi, you might be wrong. People use VGGs a lot!

No more introduction is required, this series of posts is about that: I want to share my honest thoughts regarding this discussion, for thinking which is the role of the computer vision deep learning architectures in the audio field.

In a previous post I explained what’s wrong with VGGs. In other words: I listed some of the reasons why they suck. Now, it’s time to explain why they rock! Why deep learning practitioners use VGGs if people (like me) find clear evidences that spectrogram-based VGGs suck? What’s good about these models?

In these posts, I’m centering the discussion around the VGG model — which is a computer vision architecture that is widely used by audio researchers. In short, VGGs are based on a deep stack of very small filters combined with max pooling (see the simplified figure above).

If they suck, why people keep using VGG models?

The keys for my answer are: the flexibility of the model, and the momentum coming from the computer vision community.

VGG’s flexibility

Audio CNNs can be designed considering domain knowledge or not (for further info, see this article). And, without any doubt, spectrogram-based VGGs utilize no audio domain knowledge for their design. What’s good about that?

By means of not considering any domain knowledge during their design, one minimizes the assumptions the model does w.r.t. the problem. This might be beneficial, for example, if one is not certain in how to approach the task.

Remember that part of the deep learning game is to allow the architectures to freely discover features, what leads to very successful models. If we specifically design a model to efficiently learn timbral or temporal features, we might inquire the risk of restricting too much the solution space.

Instead, VGGs are designed to make minimal assumptions over the nature of the signal or problem — so that any structure can be learnt via hierarchically combining small-context representations. Consequently, VGGs inquire the risk of being super-flexible function approximators (as opposed to be regularized models). That’s why people use VGGs, because in some cases this flexibility can be useful!

Momentum from the computer vision community

Sadly, in many cases, people simplify as follows: AI → deep learning → computer vision. One can find clear evidences of that in AI scientific venues, where most of the empirical results are gathered via tackling computer vision problems with deep neural networks.

Provided that the deep learning scene is clearly dominated by computer vision researchers, it seems reasonable that many exciting models, very clear tutorials, or software tools are developed towards this end.

Particularly, the computer vision tutorials are heavily impacting our field. For any deep learning audio practitioner, it looks easier (and safer!) to just follow one of these amazing computer vision tutorials online, rather than implementing a not so well documented audio architecture. Due to that, many people end up having a computer vision model that works with “audio images”!

A direct consequence of this strong momentum coming from the computer vision field is that many people consider VGGs as the “standard CNN“, when they are just an arbitrary design fitting the specific needs of the computer vision community.

Why these work? It could just be because deep neural networks are very strong function approximators. Or because the relevance of the used architecture is less important if enough training data is available. Accordingly, it’s likely that you’ll only gain “a 5%” by changing the VGG to your favorite audio architecture.

Although your model would be smaller, more interpretable, and would possibly work better… is it worth the effort? Maybe no, because we are lazy. We are not willing to spend our time for just “a 5%”. We can possibly live using an architecture that makes no sense. Because, after all, VGGs only “watch spectrograms”. Changing that, won’t help bringing peace to the world.