Papers
arxiv:2407.07315

CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

Published on Jul 10
· Submitted by davanstrien on Jul 11
Authors:
,
,

Abstract

Existing vision-text contrastive learning models enhance representation transferability and support zero-shot prediction by matching paired image and caption embeddings while pushing unrelated pairs apart. However, astronomical image-label datasets are significantly smaller compared to general image and label datasets available from the internet. We introduce CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions. SpaceNet, attained via FLARE, constitutes ~13k optimally distributed images, while BLIP acts as a rich knowledge extractor. The rich semantics derived from this SpaceNet and BLIP descriptions, when learned contrastively, enable CosmoCLIP to achieve superior generalization across various in-domain and out-of-domain tasks. Our results demonstrate that CosmoCLIP is a straightforward yet powerful framework, significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.

Community

Paper submitter

Screenshot 2024-07-11 at 11.21.42.png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.07315 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.07315 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.07315 in a Space README.md to link it from this page.

Collections including this paper 3