bwang0911 (Bo Wang)

Posts 4

Post

2390

we are very proud to introduce jinaai/jina-clip-v1, aka "jina-embeddings-multimodal".

The OpenAI CLIP openai/clip-vit-base-patch32 have nice performance to align text and image modality, that user can perform cross-modal text image retrieval or image classification on top of it. However, due to the training data and recipe, it can not:

1. model longer sequence of text inputs (77 token constraint).
2. align text representations (CLIP Text Tower is weak for text search).

In our latest publication, Jina CLIP: Your CLIP Model Is Also Your Text Retriever (2405.20204) , we proposed a multi-task, multi-objective learning scheme. The produced CLIP model shows:

1. Stronger cross-modal performance against OpenAI sets, 2% and 6% improvement on cross-modal retrieval recall@5.
2. Text tower of the JinaCLIP is a strong text encoder, reach the same performance as jinaai/jina-embeddings-v2-base-en, 165% improvement on MTEB[BEIR] recall@5.
3. Image tower of the JinaCLIP also shows strong performance in image-image search (CBIR), 12% recall improvement on Cifar100 test set.

If you are working on MuRAG (multimodal-retrieval argumented generation), try it out!

Post

2914

In the vector search setup, we normally combine a fast embedding model and an accurate but slow reranker model.

The newly released @jinaai rerankers are small in size and almost as accurate as our base reranker. This means given a time constraint, it can scoring more candidate documents from embedding models and have a better chance to feed LLM the correct context for RAG generation.

These models are available on Huggingface and has been integrated into the latest SentenceTransformers 2.7.0. Check it out!

jinaai/jina-reranker-v1-turbo-en
jinaai/jina-reranker-v1-tiny-en

View all posts