What’s up with waveform-based VGGs?5 min read

In this series of posts I have written a couple of articles discussing the pros & cons of spectrogram-based VGG architectures, to think about which is the role of the computer vision deep learning architectures in the audio field. Now is time to discuss what’s up with waveform-based VGGs!

In these posts I’m centering the discussion around the VGG model — which is a computer vision architecture that is widely used by audio researchers. In short, VGGs are conformed by a deep stack of very small filters combined with max pooling.

The main difference between spectrogram and waveform-based VGGs is that the former performs 2D convolutions (across time and frequency), while the latter performs 1D convolutions (across time). Another difference is that waveform-based models do not discard the phase. Instead, they use the raw signal as it is. If this is an advantage or not, this still has to be determined for many tasks!

In previous posts I have explained how to use domain knowledge to improve the efficiency and performance of spectrogram-based models, and I also pointed that people use spectrogram-based VGGs because these are very flexible — basically, these are not constrained by any domain knowledge. In other words: I exposed the never-ending discussion of using domain knowledge or not when designing data driven models.

Interestingly, waveform-based deep learning researchers are also diving into this discussion. Some have found very promising results when using VGGs, but some others have found interesting results when using domain knowledge. However, the literature is far from being conclusive. Possibly because these works are relatively new, there are no independent meta-studies comparing these architectures across several datasets.

Why can waveform-based VGGs rock?

It is important to remark that waveforms are high-dimensional and very variable. That’s why, historically, the audio community did not succeed in building successful systems that were directly approaching the raw waveform.

Just because waveforms are unintuitive and hard to approach, it possibly makes sense to tackle this problem without utilizing any domain knowledge. If it’s hard to think how to properly approach the task, why not learn it all from data? Accordingly, then, it could make sense to use VGG-like models to this end — since these are not constrained by any design strategy relying on domain knowledge, and are therefore highly expressive and have a huge capacity to learn from data.

Besides, VGG models are constructed via stacking CNN layers having small filters. As a consequence of using small filters, the possibility of learning the same representation at different phases is significantly reduced. In addition, the interleaved max-pooling layers further reinforce phase invariance.

As seen, it does not seem a bad idea to use waveform-based VGGs. Differently to the spectrogram case, when dealing with waveform-based models, we have no clear intuitions in how to build our models. As a consequence of that, people started designing waveform-based VGG-like models — that do not rely on any domain expertise for their design. Instead, they rely on a set of very small filters that can be hierarchically combined to learn any useful structure. Some of these architectures are the Wavenet, the sampleCNN, a squeeze-and-excitation extension of it, or a ResNet adaptation for audio.

Let’s use domain knowledge to design waveform front-ends!

Although some researchers think that utilizing no domain knowledge is the way to go for waveform-based models, some others think the contrary.

All waveform-based models designed considering domain knowledge depart from the same observation: end-to-end neural networks learn frequency selective filters in the first layers. If these have to learn time-frequency decompositions, what if we already tailor the network towards learning that? Maybe, in that way, one can achieve better results than when using VGG-like models.

One first attempt towards that was to use filters that are as long as the window length in an STFT (e.g., filter length of 512 with a stride of 256). If this setup works nicely to decompose signals into sinusoidal basis with the STFT, maybe it can also work to facilitate learning frequency selective filters in a CNN!

Later, a multiscale CNN front-end was proposed — which is composed of concatenated feature maps resulting of CNNs having different filter sizes (e.g., filter lengths of 512, 256 and 128 with a stride of 64). They found that these different filters naturally learn the frequencies they can most efficiently represent, with large and small filters learning low and high frequencies respectively. This contrasts with STFT-inspired CNNs which try to cover the entire frequency spectrum with a single filter size.

Or recently, a waveform-based front-end based on parametrized sinc functions (that implements band-pass filters) was proposed: the SincNet. With just two learnable parameters in each of its first layer filters’, SincNet can outperform an STFT-inspired CNN for waveforms — and even a spectrogram-based CNN!

So.. what?

Although some anecdotal meta-studies exist, given that we are in the early ages of this research area, it’s hard to tell which architectures are going to prevail in the long run. For the moment, some influent ideas were presented and now is time for the community to experiment with those.

While the highly expressive waveform-based VGG models might be capable when training data are abundant, domain knowledge based models might have more chances when data are scarce — just because the number of parameters of the model can be dramatically reduced, like with SincNet. Time will tell!