zpn MaxNomic commited on
Commit
ec7a86b
1 Parent(s): a6571e5

Update README.md (#23)

Browse files

- Update README.md (a06d57ec039caf0bbe404cc1a482bd37db3c4597)


Co-authored-by: Max Cembalest <MaxNomic@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +37 -38
README.md CHANGED
@@ -2612,7 +2612,7 @@ language:
2612
 
2613
  `nomic-embed-text-v1` is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.
2614
 
2615
-
2616
 
2617
  | Name | SeqLen | MTEB | LoCo | Jina Long Context | Open Weights | Open Training Code | Open Data |
2618
  | :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
@@ -2624,43 +2624,6 @@ language:
2624
 
2625
  **Exciting Update!**: `nomic-embed-text-v1` is now multimodal! [nomic-embed-vision-v1](https://huggingface.co/nomic-ai/nomic-embed-vision-v1) is aligned to the embedding space of `nomic-embed-text-v1`, meaning any text embedding is multimodal!
2626
 
2627
- ## Hosted Inference API
2628
-
2629
- The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
2630
-
2631
- Generating embeddings with the `nomic` Python client is as easy as
2632
-
2633
- ```python
2634
- from nomic import embed
2635
-
2636
- output = embed.text(
2637
- texts=['Nomic Embedding API', '#keepAIOpen'],
2638
- model='nomic-embed-text-v1',
2639
- task_type='search_document'
2640
- )
2641
-
2642
- print(output)
2643
- ```
2644
-
2645
- For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
2646
-
2647
- ## Data Visualization
2648
- Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
2649
-
2650
-
2651
- [![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
2652
-
2653
- ## Training Details
2654
-
2655
- We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
2656
- the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
2657
-
2658
- In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
2659
-
2660
- For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
2661
-
2662
- Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
2663
-
2664
  ## Usage
2665
 
2666
  **Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
@@ -2794,6 +2757,42 @@ const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
2794
  console.log(embeddings);
2795
  ```
2796
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2797
  # Join the Nomic Community
2798
 
2799
  - Nomic: [https://nomic.ai](https://nomic.ai)
 
2612
 
2613
  `nomic-embed-text-v1` is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.
2614
 
2615
+ # Performance Benchmarks
2616
 
2617
  | Name | SeqLen | MTEB | LoCo | Jina Long Context | Open Weights | Open Training Code | Open Data |
2618
  | :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
 
2624
 
2625
  **Exciting Update!**: `nomic-embed-text-v1` is now multimodal! [nomic-embed-vision-v1](https://huggingface.co/nomic-ai/nomic-embed-vision-v1) is aligned to the embedding space of `nomic-embed-text-v1`, meaning any text embedding is multimodal!
2626
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2627
  ## Usage
2628
 
2629
  **Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
 
2757
  console.log(embeddings);
2758
  ```
2759
 
2760
+ ## Nomic API
2761
+
2762
+ The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
2763
+
2764
+ Generating embeddings with the `nomic` Python client is as easy as
2765
+
2766
+ ```python
2767
+ from nomic import embed
2768
+
2769
+ output = embed.text(
2770
+ texts=['Nomic Embedding API', '#keepAIOpen'],
2771
+ model='nomic-embed-text-v1',
2772
+ task_type='search_document'
2773
+ )
2774
+
2775
+ print(output)
2776
+ ```
2777
+
2778
+ For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
2779
+
2780
+
2781
+ ## Training
2782
+ Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
2783
+
2784
+ [![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
2785
+
2786
+ We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
2787
+ the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
2788
+
2789
+ In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
2790
+
2791
+ For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
2792
+
2793
+ Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
2794
+
2795
+
2796
  # Join the Nomic Community
2797
 
2798
  - Nomic: [https://nomic.ai](https://nomic.ai)