Papers
arxiv:2306.00989

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Published on Jun 1, 2023
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

Community

Sign up or log in to comment

Models citing this paper 37

Browse 37 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.00989 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.00989 in a Space README.md to link it from this page.

Collections including this paper 1