ConvNeXt also makes several layer design choices to be more memory-efficient and improve performance, so it competes favorably with Transformers! | |
Encoder[[cv-encoder]] | |
The Vision Transformer (ViT) opened the door to computer vision tasks without convolutions. ViT uses a standard Transformer encoder, but its main breakthrough was how it treated an image. It splits an image into fixed-size patches and uses them to create an embedding, just like how a sentence is split into tokens. |