WASPAA 2023 paper: “CLIPSonic – Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models”1 min read By Jordi Pons in Conferences, Deep learning July 1, 2023 This paper is the result of Herman’s internship at Dolby. In this work we explore the feasibility of training a text-to-audio synthesis model without text-to-audio pairs. Check it out on arXiv and listen to our demo!