Ahmadzei's picture
update 1
57bdca5
raw
history blame contribute delete
472 Bytes
OWL-ViT builds on top of CLIP by using it as its backbone for zero-shot object detection. After pretraining, an object detection head is added to make a set prediction over the (class, bounding box) pairs.
Encoder-decoder[[mm-encoder-decoder]]
Optical character recognition (OCR) is a long-standing text recognition task that typically involves several components to understand the image and generate the text. TrOCR simplifies the process using an end-to-end Transformer.