Me: VGGs suck because they are computationally inefficient, and because they are a naive adoption of a computer vision architecture.
Random person on Internet: Jordi, you might be wrong. People use VGGs a lot!

No more introduction is required, this series of posts is about that: I want to share my honest thoughts regarding this discussion, for thinking which is the role of the computer vision deep learning architectures in the audio field.
In a previous
- Post I: Why do spectrogram-based VGGs suck?
- Post II: Why do spectrogram-based VGGs rock? [this post]
- Post III: What’s up with waveform-based VGGs?
In these posts, I’m centering the discussion around the VGG model — which is a computer vision architecture that is widely used by audio researchers. In short, VGGs are based on a deep stack of very small filters combined with max pooling (see the simplified figure above).
If they suck, why people keep using VGG models?
The keys
VGG’s flexibility
Audio CNNs can be designed considering domain knowledge or not (for further info, see this article). And, without any doubt, spectrogram-based VGGs utilize no audio domain knowledge for their design. What’s good about that?
By means of not considering any domain knowledge during their design, one minimizes the assumptions the model does w.r.t. the problem. This might be beneficial, for example, if one is not certain in how to approach the task.
Remember that part of the deep learning game is to allow the architectures to freely discover features, what leads to very successful models. If we specifically design a model to efficiently learn timbral or temporal features, we might inquire the risk of restricting too much the solution space.
Instead, VGGs are designed to make minimal assumptions over the nature of the signal or problem — so that any structure can be learnt via hierarchically combining small-context representations. Consequently, VGGs

Momentum from the computer vision community
Sadly, in many cases, people simplify as follows: AI → deep learning → computer vision. One can find clear evidences of that in AI scientific venues, where most of the empirical results are gathered via tackling computer vision problems with deep neural networks.
Provided that the deep learning scene is clearly dominated by computer vision researchers, it seems reasonable that many exciting models, very clear tutorials, or software tools are developed towards this end.
Particularly, the computer vision tutorials are heavily impacting our field. For any deep learning audio practitioner, it looks easier (and safer!) to just follow one of these amazing computer vision tutorials online, rather than implementing a not so well documented audio architecture. Due to that, many people end up having a computer vision model that works with “audio images”!
A direct consequence of this strong momentum coming from the computer vision field is that many people consider VGGs as the “standard CNN
Why these work? It could just be because deep neural networks are very strong function approximators. Or because the relevance of the used architecture is less important if enough training data is available. Accordingly, it’s likely that you’ll only gain “a 5%” by changing the VGG to your favorite audio architecture.
Although your model would be smaller, more interpretable, and would possibly work better… is it worth the effort? Maybe no, because we are lazy. We are not willing to spend our time
