Ahmadzei's picture
update 1
57bdca5
raw
history blame contribute delete
483 Bytes
ConvNeXt also makes several layer design choices to be more memory-efficient and improve performance, so it competes favorably with Transformers!
Encoder[[cv-encoder]]
The Vision Transformer (ViT) opened the door to computer vision tasks without convolutions. ViT uses a standard Transformer encoder, but its main breakthrough was how it treated an image. It splits an image into fixed-size patches and uses them to create an embedding, just like how a sentence is split into tokens.