Initial Model & Readme

Files changed (7) hide show

README.md CHANGED Viewed

@@ -1,3 +1,39 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language: "en"
+tags:
+- bag-of-words
+- dense-passage-retrieval
+- knowledge-distillation
+datasets:
+- ms_marco
 ---
+# Uni-ColBERTer (Dim: 1) for Passage Retrieval
+If you want to know more about our (Uni-)ColBERTer architecture check out our paper: https://arxiv.org/abs/2203.13088 🎉
+For more information, source code, and a minimal usage example please visit: https://github.com/sebastian-hofstaetter/colberter
+## Limitations & Bias
+- The model is only trained on english text.
+- The model inherits social biases from both DistilBERT and MSMARCO.
+- The model is only trained on relatively short passages of MSMARCO (avg. 60 words length), so it might struggle with longer text.
+## Citation
+If you use our model checkpoint please cite our work as:
+```
+@article{Hofstaetter2022_colberter,
+ author = {Sebastian Hofst{\"a}tter and Omar Khattab and Sophia Althammer and Mete Sertkan and Allan Hanbury},
+ title = {Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction},
+ publisher = {arXiv},
+ url = {https://arxiv.org/abs/2203.13088},
+ doi = {10.48550/ARXIV.2203.13088},
+ year = {2022},
+}
+```

config.json ADDED Viewed

+{
+  "aggregate_unique_ids": true,
+  "architectures": [
+    "ColBERTer"
+  ],
+  "bert_model": "distilbert-base-uncased",
+  "compress_to_exact_mini_mode": true,
+  "compression_dim": 32,
+  "dual_loss": true,
+  "model_type": "ColBERT",
+  "retrieval_compression_dim": 128,
+  "return_vecs": false,
+  "second_compress_dim": 1,
+  "torch_dtype": "float32",
+  "trainable": true,
+  "transformers_version": "4.12.0",
+  "use_contextualized_stopwords": true
+}

pytorch_model.bin ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:9509bd5319affe32c3ec101af8960edd527be036a21aab0344f3e1a1a684ef33
+size 265984167

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "distilbert-base-uncased", "tokenizer_class": "DistilBertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff