Abstract
Multi-head self-attention (MSA) endows vision Transformers (ViTs) with the ability of modeling long-range interactions between tokens. However, recent......
小提示:本篇文献需要登录阅读全文,点击跳转登录