OWL-ViT builds on top of CLIP by using it as its backbone for zero-shot object detection. After pretraining, an object detection head is added to make a set prediction over the (class, bounding box) pairs. Encoder-decoder[[mm-encoder-decoder]] Optical character recognition (OCR) is a long-standing text recognition task that typically involves several components to understand the image and generate the text. TrOCR simplifies the process using an end-to-end Transformer.