Deep learning architectures for audio classification: a personal (re)view

One can divide deep learning models into two parts: front-end and back-end – see Figure 1. The front-end is the part of the model that interacts with the input signal in order to map it into a latent-space, and the back-end predicts the output given the representation obtained by the front-end.

Figure 1 – Deep learning pipeline.

In the following, we discuss the different front- and back-ends we identified in the audio classification literature. Continue reading