Abstract
We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual des......
小提示:本篇文献需要登录阅读全文,点击跳转登录