Why do spectrogram-based VGGs suck?

Me: VGGs suck because they are computationally inefficient and because they are a naive adoption of a computer vision architecture.

Random person on Internet: Jordi, you might be wrong. People use VGGs a lot!


No more introduction is required, this series of posts is about that: I want to share my honest thoughts regarding this discussion, for thinking which is the role of the computer vision deep learning architectures in the audio field.

In these posts, I’m centering my discussion around the VGG model — which is a computer vision architecture that is widely used by audio researchers. In short, VGGs are based on a deep stack of very small filters combined with max pooling (see the simplified figure above).

It is important to remark that this discussion does not only affect music and audio deep learning researchers, but also challenges the speech community — which seems to only care about reducing WER using ensembles of computer vision architectures.

Why I don’t like spectrogram-based VGGs?

I have always found very unsatisfactory to use computer vision architectures for machine listening problems.

What makes me feel particularly uncomfortable is the following assumption: spectrograms are images. However, images have a spatial meaning — while spectrograms’ axis stands for time and frequency.

For the sake of clarity, shall we start changing our CV’s to include that we are computer vision researchers working with spectrograms?

That said, note that computer vision architectures are designed considering the nature of their problem: several edges can be combined to conform a shape, and several shapes can be combined to build a nose or an eye, that can be further combined to draw a face. The VGG model design is based on this principle, and that’s why they hierarchically stack very small filters — because these can capture edges in a first layer, which can be hierarchically combined to draw a face. But the music/audio game is about combining shapes? I’m not sure about that.

Natural language processing researchers have also successfully integrated domain knowledge into their designs. For example, it is usual to utilize as input a set of k-dimensional word vectors (each corresponding to the i-th word in the sentence). Considering this input and with the aim to learn n-grams, it is common to observe any of these CNN filters to span n-words. I’m not aware of any VGG-net for processing this kind of inputs, because it basically doesn’t make sense.

Not to mention the recent capsules proposed by Hinton et al., meant to capture the orientational and relative spatial relationships between objects in an image. In that way, they can construct latent representations more robust to different view angles — what is clearly inspired by the way the human visual system works.

Why computer vision and natural language researchers have their own architectures, and the audio community have almost none? Is there a research opportunity that we are missing? Is this a dead end, and that’s why people are not publishing many audio-specific architectures?

It is not only about the conceptualization of the model. We can run into computational costs that are possibly not necessary.

Considering the above reasons, it makes sense that computer vision architectures use stacks of very small CNN filters. But to see a reasonable context considering the small filters setup (e.g., the whole image), one needs to consider a rather deep model — with many layers, processing several feature maps.

Provided that VRAM in GPUs is limited and each feature map representation in VGGs takes a fairly large amount of space, it does not seem a bad idea to try to be as memory-efficient as possible. VRAM in GPUs is our little (and expensive) treasure!

Now let’s think about the audio case, where a relevant cue is timbre (expressed along the vertical axis of a spectrogram). How to capture timbral traces under the small CNN filters setup? Well, going as deep as necessary. Remember that one can only expand a small context per layer (with a small filter and a small max-pool), and to “see” the whole vertical axis of a spectrogram it is required to stack several layers.

However, a single CNN layer with vertical filters can already capture timbral traces (expressed along a relatively large vertical receptive field) without paying the memory cost of going deep. This happens because vertical filters already capture what’s important: timbre, and one does not need to store the output of several layers (large feature maps) to capture that context. But not only that: the number of computations required by the model utilizing vertical filters is far less than the ones performed by the VGG model, because one just runs a single layer.

Note, then, that a very simple signal observation informed a design that makes our deep learning models far more efficient (in both time and space complexity). Now, this saved compute power can be used to build more expressive models!

Finally, is important to remark that recent studies show that these single-layered CNN with vertical filters can work as well as (if not better!) than VGGs. Besides, these vertical filters can be designed to be pitch-invariant — what greatly improves the model’s performance.

If more efficient and simpler audio models exist, why people keep using VGGs?! See the answer to this question in the following post: Why do spectrogram-based VGGs rock?