burgerbee
/

txtai-en-wikipedia

Sentence Similarity

Model card Files Files and versions Community

burgerbee commited on 24 days ago

Commit

a586bda

•

1 Parent(s): bb5d986

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ library_name: txtai
 tags:
 - sentence-similarity
 datasets:
-- burgerbee/wikipedia-en-20240320
 ---
 # Wikipedia txtai embeddings index
 This is a [txtai](https://github.com/neuml/txtai) embeddings index (5GB embeddings + 25GB documents) for the [english edition of Wikipedia](https://en.wikipedia.org/).
@@ -16,7 +16,7 @@ This is a [txtai](https://github.com/neuml/txtai) embeddings index (5GB embeddin
 Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
 An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
-This index is built from the [Wikipedia march 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-en-20240320).
 The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
 to only match commonly visited pages.
@@ -54,4 +54,4 @@ https://dumps.wikimedia.org/enwiki/
 https://dumps.wikimedia.org/other/pageview_complete/
-https://huggingface.co/datasets/burgerbee/wikipedia-en-20240320

 tags:
 - sentence-similarity
 datasets:
+- burgerbee/wikipedia-en-20241020
 ---
 # Wikipedia txtai embeddings index
 This is a [txtai](https://github.com/neuml/txtai) embeddings index (5GB embeddings + 25GB documents) for the [english edition of Wikipedia](https://en.wikipedia.org/).
 Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
 An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
+This index is built from the [Wikipedia october 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020).
 The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
 to only match commonly visited pages.
 https://dumps.wikimedia.org/other/pageview_complete/
+https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020