Space-time Mixing Attention for Video Transformer
The growing popularity of Vision Transformers as the go-to models for image
classification has led to an explosion of architectural modifications claiming
to be more efficient than the original ViT. However, a wide diversity of
experimental conditions prevents a fair comparison between all of them, based
solely on their reported results. To address this gap in comparability, we
conduct a comprehensive analysis of more than 30 models to evaluate the
efficiency of vision transformers and related architectures, considering
various performance metrics. Our benchmark provides a comparable baseline
across the landscape of efficiency-oriented transformers, unveiling a plethora
of surprising insights. For example, we discover that ViT is still Pareto
optimal across multiple efficiency metrics, despite the existence of several
alternative approaches claiming to be more efficient. Results also indicate
that hybrid attention-CNN models fare particularly well when it comes to low
inference memory and number of parameters, and also that it is better to scale
the model size, than the image size. Furthermore, we uncover a strong positive
correlation between the number of FLOPS and the training memory, which enables
the estimation of required VRAM from theoretical measurements alone.
Thanks to our holistic evaluation, this study offers valuable insights for
practitioners and researchers, facilitating informed decisions when selecting
models for specific applications. We publicly release our code and data at
https://github.com/tobna/WhatTransformerToFavor