Abstract
We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. B......
小提示:本篇文献需要登录阅读全文,点击跳转登录