license: apache-2.0
The text embedding suit trained by [Jina AI](https://github.com/jina-ai), [Finetuner team](https://github.com/jina-ai/finetuner).
Intented Usage & Model Info
jina-embedding-s-en-v1
is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a compact size of just 35 million parameters, the model enables lightning-fast inference while still delivering impressive performance. Additionally, we provide the following options:
jina-embedding-b-en-v1
: 110 million parameters.jina-embedding-l-en-v1
: 800 million parameters.jina-embedding-xl-en-v1
: 3 billion parameters.jina-embedding-xxl-en-v1
: 11 billion parameters.
Data & Parameters
More info will be released together with the technique report.
Metrics
We compared the model against all-minilm-l6-v2
from sbert and text-embeddings-ada-002
from OpenAI:
FIELD1 | STS12 | STS13 | STS14 | STS15 | STS16 | STS17 | TRECOVID | Quora | SciFact | param | context length |
---|---|---|---|---|---|---|---|---|---|---|---|
all-minilm-l6-v2 | 0.724 | 0.806 | 0.756 | 0.854 | 0.79 | 0.876 | 0.473 | 0.876 | 0.645 | 33m | 256 |
all-mpnet--base-v2 | 0.726 | 0.835 | 0.78 | 0.857 | 0.8 | 0.906 | 0.513 | 0.875 | 0.656 | 110m | 256 |
ada-embedding-002 | 0.698 | 0.833 | 0.761 | 0.861 | 0.86 | 0.903 | 0.685 | 0.876 | 0.726 | Unknown | 8024 |
jina-embedding-small | 0.738 | 0.781 | 0.732 | 0.833 | 0.785 | 0.859 | 0.471 | 0.852 | 0.567 | 35m | 512 |
For more tasks and metrics, please checkout MTEB benchmark.
Usage
!pip install finetuner[text]
import finetuner
model = finetuner.get_model('jinaai/jina-embedding-s-en-v1')
embeddings = model.encode(['sentence 1', 'sentence 2'])