Spaces:

flax-community
/

koclip

Build error

File size: 6,667 Bytes

# KoCLIP

KoCLIP is a Korean port of OpenAI's CLIP.

## Models

We trained a total of two models, `koclip-base` and `koclip-large`. Both models use RoBERTa-large. The decision to use a somewhat large language model was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to good multimodal pipeline given limited data.

| KoCLIP         | LM                   | ViT                            |
|----------------|----------------------|--------------------------------|
| `koclip-base`  | `klue/roberta-large` | `openai/clip-vit-base-patch32` |
| `koclip-large` | `klue/roberta-large` | `google/vit-large-patch16-224` |

## Data

KoCLIP was fine-tuned using 82,783 images from the [MSCOCO](https://cocodataset.org/#home) 2014 image captioning dataset. Korean translations of image captions were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence), an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40,000 images from the validation set of the aforementioned dataset.

While we also considered alternative multilingual image captioning datsets, notably the Wikipedia-based Image Text Dataset, we found non-trivial discrepancies in the way captions were curated in WiT and MSCOCO, and eventually decided to train the model on relatively cleaner captions of MSCOCO instead of introducing more noise.

## Demo

We present three demos, which each illustrate different use cases of KoCLIP.
 
* *Image to Text*: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided.
* *Text to * Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text. 
* *Text to Patch*: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.

## Findings

In this section, we detail some interesting findings we made throughout the project.

### Prompting

We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as

```
이것은 {{}} 이다 (EN: This is {{}}.)
```

noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.  

### Multilinguality

Although KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingly well for simple words (e.g. "dog", "car"). This could be one of two reasons, or a combination thereof:

* *ViT Pretraining*: The ViT backbone for `koclip-base`, `openai/clip-vit-base-patch32`, was already pretrained on an English dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is that `koclip-large` also demonstrates similar multilingual behavior.

* *LM Knowledge Bleed*: `klue/roberta-large` was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for both `koclip-base` and `koclip-large`. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that "the corpus must be written in contemporary Korean."

At the end of the day, we still found it intriguing that a model that was fine-tuned exclusively on Korean managed to produce semantic embeddings that worked well with ViT.

## Future Work

Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as [M-CLIP](https://huggingface.co/M-CLIP). We hope to benchmark KoCLIP on various metrics and evaluation datasets to further measure its performance and reliability. In addition, given that prompt engineering is somewhat of a mystery and an active area of ongoing research, we hope to explore more scientific approaches on this the topic.

## References

```bibtex
@misc{park2021klue,
      title={KLUE: Korean Language Understanding Evaluation}, 
      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
      year={2021},
      eprint={2105.09680},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

```bibtex
@misc{radford2021learning,
      title={Learning Transferable Visual Models From Natural Language Supervision}, 
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

```bibtex
@misc{lin2015microsoft,
      title={Microsoft COCO: Common Objects in Context}, 
      author={Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollár},
      year={2015},
      eprint={1405.0312},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

```bibtex
@misc{srinivasan2021wit,
      title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, 
      author={Krishna Srinivasan and Karthik Raman and Jiecao Chen and Michael Bendersky and Marc Najork},
      year={2021},
      eprint={2103.01913},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

---

We thank the teams at Hugging Face and Google for arranging this wonderful oportunity. It has been a busy yet enormously rewarding week for all of us. Hope you enjoy the demo!