WASPAA 2023 paper: “CLIPSonic – Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models”1 min read

This paper is the result of Herman’s internship at Dolby. In this work we explore the feasibility of training a text-to-audio synthesis model without text-to-audio pairs.

Check it out on arXiv and listen to our demo!