End-to-end learning for music audio tagging at scale

More information about our study on GitHub and on arXiv

Two songs from the test set were randomly selected. Tags predicted by the GBT + features model (traditional method set as baseline) and by the proposed deep learning model (with a musically-motivated spectrogram front-end) are compared to human annotated tags.

J.S. Bach - Aria (Vergnügte Ruh, Beliebte Seelenlust)

Top10: Human-labels

female vocals, triple meter, acoustic, classical music, baroque period, lead vocals, string ensemble, major, compositional dominance of: lead vocals and melody.

Top10: GBT + features

acoustic, triple meter, string ensemble, classical music, baroque period, classic period, string solo, major, compositional dominance of: melody and form.

Top10: Deep learning

acoustic, string ensemble, classical music, period baroque, major, compositional dominance of: the arrangement, form, performance, rhythm and lead vocals.


'baroque period' and 'classic period' are mutually exlusive tags predicted with high confidence for the GBT + features model.

'triple meter' is not in the deep learning list.

Kendrick Lamar - Complexion (A Zulu Love)

Top10: Human-labels

English, male vocals, rap, East Coast, breathy vocal, joyful lyrics and compositional dominance of: lyrics, melody, rhythm, accompanying vocals.

Top10: GBT + features

East Coast, West Coast, rap, hardcore, lead vocals, funk, old school, drums, electronic drums, party.