zpn commited on
Commit
f58ec29
1 Parent(s): b0753ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -6
README.md CHANGED
@@ -2668,14 +2668,71 @@ Training data to train the models is released in its entirety. For more details,
2668
 
2669
  ## Usage
2670
 
2671
- Note `nomic-embed-text` *requires* prefixes! We support the prefixes `[search_query, search_document, classification, clustering]`.
2672
- For retrieval applications, you should prepend `search_document` for all your documents and `search_query` for your queries.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2673
 
2674
- For example, you are building a RAG application over the top of Wikipedia. You would embed all Wikipedia articles with the prefix `search_document`
2675
- and any questions you ask with `search_query`. For example:
2676
  ```python
2677
- queries = ["search_query: who is the first president of the united states?", "search_query: when was babe ruth born?"]
2678
- documents = ["search_document: <article about US Presidents>", "search_document: <article about Babe Ruth>"]
 
 
 
 
 
2679
  ```
2680
 
2681
  ### Sentence Transformers
 
2668
 
2669
  ## Usage
2670
 
2671
+ **Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
2672
+
2673
+ For example, if you are implementing a RAG application, you embed your documents as `search_document: <text here>` and embed your user queries as `search_query: <text here>`.
2674
+
2675
+ ## Task instruction prefixes
2676
+
2677
+ ### `search_document`
2678
+
2679
+ #### Purpose: embed texts as documents from a dataset
2680
+
2681
+ This prefix is used for embedding texts as documents, for example as documents for a RAG index.
2682
+
2683
+ ```python
2684
+ from sentence_transformers import SentenceTransformer
2685
+
2686
+ model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
2687
+ sentences = ['search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten']
2688
+ embeddings = model.encode(sentences)
2689
+ print(embeddings)
2690
+ ```
2691
+
2692
+ ### `search_query`
2693
+
2694
+ #### Purpose: embed texts as questions to answer
2695
+
2696
+ This prefix is used for embedding texts as questions that documents from a dataset could resolve, for example as queries to be answered by a RAG application.
2697
+
2698
+ ```python
2699
+ from sentence_transformers import SentenceTransformer
2700
+
2701
+ model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
2702
+ sentences = ['search_query: Who is Laurens van Der Maaten?']
2703
+ embeddings = model.encode(sentences)
2704
+ print(embeddings)
2705
+ ```
2706
+
2707
+ ### `clustering`
2708
+
2709
+ #### Purpose: embed texts to group them into clusters
2710
+
2711
+ This prefix is used for embedding texts in order to group them into clusters, discover common topics, or remove semantic duplicates.
2712
+
2713
+ ```python
2714
+ from sentence_transformers import SentenceTransformer
2715
+
2716
+ model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
2717
+ sentences = ['clustering: the quick brown fox']
2718
+ embeddings = model.encode(sentences)
2719
+ print(embeddings)
2720
+ ```
2721
+
2722
+ ### `classification`
2723
+
2724
+ #### Purpose: embed texts to classify them
2725
+
2726
+ This prefix is used for embedding texts into vectors that will be used as features for a classification model
2727
 
 
 
2728
  ```python
2729
+ from sentence_transformers import SentenceTransformer
2730
+
2731
+ model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
2732
+ sentences = ['classification: the quick brown fox']
2733
+ embeddings = model.encode(sentences)
2734
+ print(embeddings)
2735
+ ```
2736
  ```
2737
 
2738
  ### Sentence Transformers