It’s been amazing to re-meet my international friends and colleagues in person. It was nice to see PhD students to experience research and conferences firsthand (no beers allowed) 🙂 I’m sure this meeting will foster future collaborations and new friendships, pushing the field of music/audio deep learning research forward!
This year I was there to present:
- Full-band General Audio Synthesis with Score-based Diffusion by Santi Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons and Joan Serrà.
- Adversarial Permutation Invariant Training for Universal Sound Separation by Emilian Postolache, Jordi Pons, Santi Pascual and Joan Serrà.

Next, I will outline the key trends that we identified while attending ICASSP. Thanks Santi and Chunghsin for your feedback!
Multimodal interfaces
The integration of natural language interfaces has become a prominent trend in the field. One example is the transition from audio tagging to audio captioning, where the focus has shifted towards generating descriptive text for audio content. Another example is the move from tag-based audio generation to text-based audio generation. This trend has been clearly observed within the machine learning community, and is also evident in conferences such as ICASSP. Notably, this trend extends beyond natural language prompting and includes the utilization of images as prompts, although fewer examples were observed in this domain.
Scaling = struggle
Many researchers are currently facing difficulties when it comes to scaling their models, either up or down. Those aiming to scale up and utilize large, powerful models often encounter limitations in GPU resources. A commonly observed solution to address this challenge is a two-step approach similar to Tacotron. On the other hand, researchers seeking to scale down their models for real-time devices also face significant obstacles. It was interesting to witness at ICASSP, a signal processing community, the emergence of many solutions for scaling down models rooted in signal processing techniques.
Diffusion probabilistic models
It comes as no surprise, but their inclusion is essential for a comprehensive overview. It is clear (not only from ICASSP but from recent machine learning literature) that diffusion probabilistic models have proven to be highly effective generative models. At ICASSP, researchers have proposed diffusion probabilistic models for separation, enhancement, bandwidth extension, vocoding and audio synthesis. That said, I’m really looking forward for a GANs comeback!
Self-supervised learning
The first in-person ICASSP event since the pandemic underscored the lasting significance of self-supervised learning. Self-supervised learning is here to stay! I highly recommend interested researchers to look into the Self-supervision in Audio, Speech and Beyond Workshop, which showcased the latest advancements in self-supervision for speech (with limited focus on audio or music, though). This workshop was possibly the standout session of (all) ICASSP, kudos to the organizers for their exceptional work!
Selected innovative papers
These papers presented a breath of fresh air with their innovative ideas – sorry for my source separation, music and audio bias 🙂 Presented with no specific order:
- Blind estimation of audio processing graph
- Hyperbolic audio source separation
- Heterogeneous graph learning for acoustic event classification
- AudioSlots: A slot-centric generative model for audio separation
What did I miss?
Despite the current era of multi-modal data and large datasets, I was surprised to observe that the ICASSP community still heavily focuses on speech-related and audio-only challenges. For this conference I was anticipating a broader emphasis on general audio problems encompassing music, audio, and speech. However, while this feels a natural evolution of the field, this was not the case. I found that numerous researchers continue to prioritize chasing state-of-the-art performance using small, speech-only datasets – potentially overlooking the significance of working on problem formulation and dataset creation for new deep learning tasks that might be more relevant. Note that as individuals our auditory experience encompasses more than just speech. Our ears are attuned to a diverse range of sounds and music, and have the ability to leverage visual and spatial cues to enhance our perception.
It was also weird that the recent advancements in music/audio generation (AudioLM, MusicLM, AudioLDM, MusicGen, Dance Diffusion) were not part of the proceedings. Maybe next time we should organise more satellite workshops (like the Self-supervision in Audio, Speech and Beyond Workshop) around ICASSP?