A gentle introduction to Extreme Learning Machines for audio

Extreme Learning Machines (ELMs) are very controversial and very fast machine learning models that perform very well. Of course, very is in italics because such word is susceptible to change depending on your background or application field. However, this sentence provides an idea of what ELMs can deliver – and why these might be interesting for an audio community that rarely uses them.

Extreme Learning Machines

In short, ELMs are classification/regression models (like SVMs, for example) that are based on a single-layer feed-forward neural network with random weights. They work as follows: first, ELMs randomly project the input into a latent space; and then, learn how to predict the output via a least-square fit. More formally, one aims to predict:

Ŷ = W2 σ(W1 X)

where W1 is the (randomly weighted) matrix of input-to-hidden-layer weights, σ is the non-linearity, W2 is the matrix of hidden-to-output-layer weights, Y stands for the output vector and X represents the input. The training algorithm works as follows: i) set W1 with random values, and ii) estimate W2 via a least-squares fit:

W2 = σ(W1 X)+ + Y

where + denotes the Moore-Penrose inverse. Since no iterative process is required for learning the weights, the ELM authors insist in that is a lot faster than stochastic gradient descent.

Basically, one aims to predict something useful (Ŷ) provided that an input X is given – that could be: raw audio, MFCC features or any other (audio) representation.

ELMs  for audio applications

These have been successfully used in many application domains, among them: 3D-shape identification, image recognition, text classification, time-series analysis, remote sensing, or control and robotics – see this review [1] for more information.

However, despite the literature showing that ELMs are a promising approach for audio, most ELM reviews/surveys do not showcase audio applications. After a non-exhaustive investigation (email me if your work is not in this list!), we identified two main areas where ELMs have been successfully applied: i) speech emotion recognition [2, 3], and ii) music audio classification [4, 5, 6]. And, interestingly, these studies report that ELMs achieve results on par with SVMs.

When regarding the speech emotion recognition works [2, 3]: these set X to be a feature vector composed of, e.g.: MFCC, pitch-based features, or openSMILE features – complemented with their delta features across time frames. And for the music audio classification case: X is also a set to be a feature vector composed of, e.g.: ZCR, RMS, loudness features, MFCCs, or pitch and rhythm descriptors. Therefore, previous audio works basically used ELMs as a classifier that builds on top of a (hand-crafted) feature vector.

Frequently asked questions

Q1: Any additional links to further understand ELMs?

R1: Yes! First, see Hu Yuhuang’s wiki for a nice introduction (with many useful additional links). Then, read G.-B. Huang‘s web-site to get an overall picture of the ELM’s field. And finally, I would recommend reading some scientific papers (like [1] or [9]), but be aware that these generic ELM papers can be technical – basically, they go through the “ELM theories”.

Q2: What’s the difference between Extreme Learning Machines and Echo State Networks?

R2: Echo State Networks differ from ELMs in that their random projections use sparse recurrent connections – see [7, 8] for audio applications of Echo State Networks.

Q3: Are you aware of any python ELM implementation?

R3: The implementation I use to experiment with ELMs is: github.com/zygmuntz/Python-ELM/ (having ReLU non-linearities) from this website, which is based on this one: github.com/dclambert/Python-ELM/ (which has no ReLUs).

Q4: How should I cite ELMs?

R4: Good question! See the “ELMs scandal” to understand the difficulty of figuring out who is the original inventor of ELMs (or the same concept but named differently). Due to that, it was not clear to me how to properly cite ELMs. I finally decided to cite these works together [9, 10, 11]. Do you agree? Feel free to email me if you know a better option!

References

[1] Gao Huang, et al. “Trends in extreme learning machines: A review.” Neural Networks 61 (2015): 32-48.

[2] Kun Han, et al. “Speech emotion recognition using deep neural network and extreme learning machine.” Fifteenth Annual Conference of the International Speech Communication Association. 2014. [PDF]

[3] Heysem Kaya and Albert Ali Salah. “Combining modality-specific extreme learning machines for emotion recognition in the wild”. Journal on Multimodal User Interfaces, 10(2):139–149, 2016.

[4] Suisin Khoo, et al. “Automatic han chinese folk song classification using extreme learning machines”. Australasian Joint Conference on Artificial Intelligence, pages 49–60. Springer, 2012.

[5] Qi-Jun Benedict Loh and Sabu Emmanuel. “ELM for the classification of music genres”. International Conference on Control, Automation, Robotics and Vision, pages 1–6. IEEE, 2006. [PDF]

[6] Simone Scardapane, et al. “Music classification using extreme learning machines”. International Symposium on Image and Signal Processing and Analysis (ISPA), pages 377–381. IEEE, 2013.

[7] Georg Holzmann. “Reservoir computing: A powerful black-box framework for nonlinear audio processing”. International Conference on Digital Audio Effect (DAFx), 2009.

[8] Simone Scardapane and Aurelio Uncini. “Semisupervised echo state networks for audio classification”. Cognitive Computation, 9(1):125–135, 2017.

[9] Guang-Bin Huang, et al. “Extreme learning machine: theory and applications”. Neurocomputing, 70(1-3):489–501, 2006.

[10] Yoh-Han Pao, et al. “Learning and generalization characteristics of the random vector functional-link net”. Neurocomputing, 6(2):163–180, 1994.

[11] Wouter F Schmidt, et al. “Feedforward neural networks with random weights”. International Conference on Pattern Recognition, pages 1–4. IEEE, 1992.