Learning the logarithmic compression of the mel spectrogram

Currently, successful neural network audio classifiers use log-mel spectrograms as input. Given a mel-spectrogram matrix X, the logarithmic compression is computed as follows:

f(x) = log(α·X + β).

Common pairs of (α,β) are (1, eps) or (10000,1). In this post we investigate the possibility of learning (α,β). To this end, we study two log-mel spectrogram variants:

Log-learn: The logarithmic compression of the mel spectrogram X is optimized via SGD together with the rest of the parameters of the model. We use exponential and softplus gates to control the pace of α and β, respectively. We set the initial pre-gate values to 7 and 1, what results in out-of-gate α and β initial values of 1096.63 and 1.31, respectively.
Log-EPS: We set as baseline a log-mel spectrogram which does not learn the logarithmic compression. (α,β) are set to (1, eps). Note eps stands for “machine epsilon”, a very small number.

TL;DR: We are publishing a negative result,
log-learn did not improve our results! 🙂

Datasets

Following common practice, inputs are set to be log-mel spectrogram patches of 128 bins × 3 seconds (128 frames). We consider the following datasets:

Acoustic event recognition: UrbanSound8K dataset (US8K), featuring 8,732 urban sounds divided into 10 classes and 10 folds (with roughly 1000 instances per class).
Acoustic scene classification: TUT dataset (ASC-TUT), featuring 4,680 audio segments for training and 1,620 for evaluation, of 10 s each, divided into 15 classes (with 312 instances per class).

Finally, this study was framed within the context of a work investigating which neural network-based strategies perform best when only few training examples are available. To do so, we simulate classification scenarios having only n randomly selected training audios per class, n ∈ {1, 2, 5, 10, 20, 50, 100}. Since results of the same repeated experiment might vary depending on which audios are selected, we run each experiment m times per fold of data, and report average accuracy scores across runs and folds. Specifically: m = 20 when n ∈ {1, 2}, m = 10 when n ∈ {5,10}, and m = 5 when n ∈ {20, 50, 100}.

Models

SB-CNN consists of 3 CNN layers with filters of 5×5, interleaved with max-pool layers. The resulting feature map is connected to a softmax output via a dense layer of 64 units.
VGG is based on a deep stack of small 3×3 CNN filters (in our case 5 layers, each having only 32 filters), combined with max-pool layers (in our case of 2×2). We further employ a final dense layer with a softmax activation that adapts the feature map size to the number of output classes.
TIMBRE consists of a single-layer CNN layer with vertical filters of 108 bins×7 frames. A softmax output is computed from the maximum values present in each CNN feature map and, therefore, the model has as many filters as output classes. TIMBRE is possibly the smallest CNN one can imagine for an audio classification task, provided that it only has a single ‘timbral’ filter per class.
Prototypical networks are based on learning a latent metric space in which classification can be performed via computing (euclidean) distances to prototype representations of each class. Prototypes are mean vectors of the embedded support data belonging to a class. In our case, our embedding is defined by a VGG. Prototypical networks produce a distribution over classes based on a softmax over distances to the prototypes in the embedding space. See the references below to know more about prototypical networks.

Results

Tables 1 and 3 compare the results obtained by several models when varying the mel spectrogram compression: log-learn vs. log-EPS. To clearly illustrate which are the performance gains obtained by log-learn, Tables 2 and 4 list the accuracy differences between log-learn and log-EPS variants.

Tables 1 and 2 reveal that log-learn and log-EPS results are almost identical for US8K. Although it seems that log-learn can help improving the results for SB-CNN and TIMBRE architectures, for prototypical networks and VGG one can achieve worse results.
For that reason, we conclude that log-learn and log-EPS results are almost equivalent for US8K.

However, for ASC-TUT dataset, log-learn results are much worse than log-EPS ones.
Tables 3 and 4 show that log-learn only improves the results of SB-CNN models when trained with little data (1≤n≤10), but for the rest of the models the performance decreases substantially.

Accordingly, we conclude that learning the logarithmic compression of the mel spectrogram does not improve our results.

Now that you know this thing it doesn’t really work.. you don’t need to try yourself! 🙂

References

Further details of this experiment can be found in the appendix A.3 of our paper (which you can cite to refer to these results):

Jordi Pons, Joan Serrà & Xavier Serra (October, 2018). Training neural audio classifiers with few data. [arXiv, code]

SB-CNN is from Salamon and Bello, see: Deep convolutional neural networks and data augmentation for environmental sound classification in IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017.

VGG is a computer vision architecture that was used in:

Hershey, et al.: CNN architectures for large-scale audio classification in ICASSP 2017.
Choi, et al.: Transfer learning for music classification and regression tasks in ISMIR 2017.

For further information about the TIMBRE model, see Pons et al.: Timbre Analysis of Music Audio Signals with Convolutional Neural Networks in EUSIPCO 2017.

And prototypical networks were proposed by Snell, et al.: Prototypical networks for few-shot learning, in NIPS 2017.

Datasets

Models

Results

References

Related Articles.

Preprint: “Fast Timing-Conditioned Latent Audio Diffusion”

Preprint: “GASS – Generalizing Audio Source Separation with Large-scale Data”

On Prompting Stable Audio

5 ideas from EUSIPCO 2023

ISMIR 2023 paper: “Mono-to-stereo through parametric stereo generation”