Currently, successful neural network audio classifiers use logmel spectrograms as input. Given a melspectrogram matrix X, the logarithmic compression is computed as follows:
f(x) = log(α·X + β).
Common pairs of (α,β) are (1, eps) or (10000,1). In this post we investigate the possibility of learning (α,β). To this end, we study two logmel spectrogram variants:
 Loglearn: The logarithmic compression of the mel spectrogram X is optimized via SGD together with the rest of the parameters of the model. We use exponential and softplus gates to control the pace of α and β, respectively. We set the initial pregate values to 7 and 1, what results in outofgate α and β initial values of 1096.63 and 1.31, respectively.
 LogEPS: We set as baseline a logmel spectrogram which does not learn the logarithmic compression. (α,β) are set to (1, eps). Note eps stands for “machine epsilon”, a very small number.
TL;DR: We are publishing a negative result,
loglearn did not improve our results! 🙂
Datasets
Following common practice, inputs are set to be logmel spectrogram patches of 128 bins × 3 seconds (128 frames). We consider the following datasets:
 Acoustic event recognition: UrbanSound8K dataset (US8K), featuring 8,732 urban sounds divided into 10 classes and 10 folds (with roughly 1000 instances per class).
 Acoustic scene classification: TUT dataset (ASCTUT), featuring 4,680 audio segments for training and 1,620 for evaluation, of 10 s each, divided into 15 classes (with 312 instances per class).
Models
 SBCNN consists of 3 CNN layers with filters of 5×5, interleaved with maxpool layers. The resulting feature map is connected to a softmax output via a dense layer of 64 units.
 VGG is based on a deep stack of small 3×3 CNN filters (in our case 5 layers, each having only 32 filters), combined with maxpool layers (in our case of 2×2). We further employ a final dense layer with a softmax activation that adapts the feature map size to the number of output classes.
 TIMBRE consists of a singlelayer CNN layer with vertical filters of 108 bins×7 frames. A softmax output is computed from the maximum values present in each CNN feature map and, therefore, the model has as many filters as output classes. TIMBRE is possibly the smallest CNN one can imagine for an audio classification task, provided that it only has a single ‘timbral’ filter per class.
 Prototypical networks are based on learning a latent metric space in which classification can be performed via computing (euclidean) distances to prototype representations of each class. Prototypes are mean vectors of the embedded support data belonging to a class. In our case, our embedding is defined by a VGG. Prototypical networks produce a distribution over classes based on a softmax over distances to the prototypes in the embedding space. See the references below to know more about prototypical networks.
Results
Tables 1 and 3 compare the results obtained by several models when varying the mel spectrogram compression: loglearn vs. logEPS. To clearly illustrate which are the performance gains obtained by loglearn, Tables 2 and 4 list the accuracy differences between loglearn and logEPS variants.
Tables 1 and 2 reveal that loglearn and logEPS results are almost identical for US8K. Although it seems that loglearn can help improving the results for SBCNN and TIMBRE architectures, for prototypical networks and VGG one can achieve worse results.
For that reason, we conclude that loglearn and logEPS results are almost equivalent for US8K.
However, for ASCTUT dataset, loglearn results are much worse than logEPS ones.
Tables 3 and 4 show that loglearn only improves the results of SBCNN models when trained with little data (1≤n≤10), but for the rest of the models the performance decreases substantially.
Accordingly, we conclude that learning the logarithmic compression of the mel spectrogram does not improve our results.
Now that you know this thing it doesn’t really work.. you don’t need to try yourself! 🙂
References
Further details of this experiment can be found in the appendix A.3 of our paper (which you can cite to refer to these results):
Jordi Pons, Joan Serrà & Xavier Serra (October, 2018). Training neural audio classifiers with few data. [arXiv, code]
SBCNN is from Salamon and Bello, see: Deep convolutional neural networks and data augmentation for environmental sound classification in IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017.
VGG is a computer vision architecture that was used in:

Hershey, et al.: CNN architectures for largescale audio classification in ICASSP 2017.

Choi, et al.: Transfer learning for music classification and regression tasks in ISMIR 2017.
For further information about the TIMBRE model, see Pons et al.: Timbre Analysis of Music Audio Signals with Convolutional Neural Networks in EUSIPCO 2017.
And prototypical networks were proposed by Snell, et al.: Prototypical networks for fewshot learning, in NIPS 2017.