MMAudio 可根据视频和/或文本输入生成同步音频。我们的关键创新是多模式联合训练,它允许在各种音频-视频和音频-文本数据集上进行训练。此外,同步模块将生成的音频与视频帧对齐。
MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets. Moreover, a synchronization module aligns the generated audio with the video frames.