davidmezzetti commited on
Commit
243add7
1 Parent(s): 7f2706c

Initial version

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - transformers
8
+ language: en
9
+ license: apache-2.0
10
+ ---
11
+
12
+ # PubMedBERT Embeddings Matryoshka
13
+
14
+ This is a version of [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings) with [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) applied. This enables dynamic embeddings sizes of `64`, `128`, `256`, `384`, `512` and the full size of `768`. It's important to note while this method saves space, the same computational resources are used regardless of the dimension size.
15
+
16
+ Sentence Transformers 2.4 added support for Matryoshka Embeddings. More can be read in [this blog post](https://huggingface.co/blog/matryoshka).
17
+
18
+ ## Usage (txtai)
19
+
20
+ This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
21
+
22
+ ```python
23
+ import txtai
24
+
25
+ # New embeddings with requested number of dimensions
26
+ embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings-matryoshka", content=True, dimensions=256)
27
+ embeddings.index(documents())
28
+
29
+ # Run a query
30
+ embeddings.search("query to run")
31
+ ```
32
+
33
+ ## Usage (Sentence-Transformers)
34
+
35
+ Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).
36
+
37
+ ```python
38
+ from sentence_transformers import SentenceTransformer
39
+ sentences = ["This is an example sentence", "Each sentence is converted"]
40
+
41
+ model = SentenceTransformer("neuml/pubmedbert-base-embeddings-matryoshka")
42
+ embeddings = model.encode(sentences)
43
+
44
+ # Requested matryoshka dimensions
45
+ dimensions = 256
46
+
47
+ print(embeddings[:, :dimensions])
48
+ ```
49
+
50
+ ## Usage (Hugging Face Transformers)
51
+
52
+ The model can also be used directly with Transformers.
53
+
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModel
56
+ import torch
57
+
58
+ # Mean Pooling - Take attention mask into account for correct averaging
59
+ def meanpooling(output, mask):
60
+ embeddings = output[0] # First element of model_output contains all token embeddings
61
+ mask = mask.unsqueeze(-1).expand(embeddings.size()).float()
62
+ return torch.sum(embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)
63
+
64
+ # Sentences we want sentence embeddings for
65
+ sentences = ['This is an example sentence', 'Each sentence is converted']
66
+
67
+ # Load model from HuggingFace Hub
68
+ tokenizer = AutoTokenizer.from_pretrained("neuml/pubmedbert-base-embeddings-matryoshka")
69
+ model = AutoModel.from_pretrained("neuml/pubmedbert-base-embeddings-matryoshka")
70
+
71
+ # Tokenize sentences
72
+ inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
73
+
74
+ # Compute token embeddings
75
+ with torch.no_grad():
76
+ output = model(**inputs)
77
+
78
+ # Perform pooling. In this case, mean pooling.
79
+ embeddings = meanpooling(output, inputs['attention_mask'])
80
+
81
+ # Requested matryoshka dimensions
82
+ dimensions = 256
83
+
84
+ print("Sentence embeddings:")
85
+ print(embeddings[:, :dimensions])
86
+ ```
87
+
88
+ ## Evaluation Results
89
+
90
+ Performance of this model compared to the top base models on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.
91
+
92
+ The following datasets were used to evaluate model performance.
93
+
94
+ - [PubMed QA](https://huggingface.co/datasets/pubmed_qa)
95
+ - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
96
+ - [PubMed Subset](https://huggingface.co/datasets/zxvix/pubmed_subset_new)
97
+ - Split: test, Pair: (title, text)
98
+ - [PubMed Summary](https://huggingface.co/datasets/scientific_papers)
99
+ - Subset: pubmed, Split: validation, Pair: (article, abstract)
100
+
101
+ Evaluation results from the original model are shown below for reference. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
102
+
103
+ | Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
104
+ | ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
105
+ | [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40 | 95.86 | 94.07 | 93.44 |
106
+ | [bge-base-en-v1.5](https://hf.co/BAAI/bge-large-en-v1.5) | 91.02 | 95.60 | 94.49 | 93.70 |
107
+ | [gte-base](https://hf.co/thenlper/gte-base) | 92.97 | 96.83 | 96.24 | 95.35 |
108
+ | [**pubmedbert-base-embeddings**](https://hf.co/neuml/pubmedbert-base-embeddings) | **93.27** | **97.07** | **96.58** | **95.64** |
109
+ | [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO) | 90.86 | 93.33 | 93.54 | 92.58 |
110
+
111
+ See the table below for evaluation results per dimension for `pubmedbert-base-embeddings-matryoshka`.
112
+
113
+ | Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
114
+ | --------------------| --------- | ------------- | -------------- | --------- |
115
+ | Dimensions = 64 | 92.16 | 95.85 | 95.67 | 94.56 |
116
+ | Dimensions = 128 | 92.80 | 96.44 | 96.22 | 95.15 |
117
+ | Dimensions = 256 | 93.11 | 96.68 | 96.53 | 95.44 |
118
+ | Dimensions = 384 | 93.42 | 96.79 | 96.61 | 95.61 |
119
+ | Dimensions = 512 | 93.37 | 96.87 | 96.61 | 95.62 |
120
+ | **Dimensions = 768** | **93.53** | **96.95** | **96.70** | **95.73** |
121
+
122
+ This model performs slightly better overall compared to the original model.
123
+
124
+ The bigger takeaway is how competitive it is at lower dimensions. For example, `Dimensions = 256` performs better than all the other models originally tested above. Even `Dimensions = 64` performs better than `all-MiniLM-L6-v2` and `bge-base-en-v1.5`.
125
+
126
+ ## Training
127
+ The model was trained with the parameters:
128
+
129
+ **DataLoader**:
130
+
131
+ `torch.utils.data.dataloader.DataLoader` of length 20191 with parameters:
132
+ ```
133
+ {'batch_size': 24, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
134
+ ```
135
+
136
+ **Loss**:
137
+
138
+ `sentence_transformers.losses.MatryoshkaLoss.MatryoshkaLoss` with parameters:
139
+ ```
140
+ {'loss': 'MultipleNegativesRankingLoss', 'matryoshka_dims': [768, 512, 384, 256, 128, 64], 'matryoshka_weights': [1, 1, 1, 1, 1, 1]}
141
+ ```
142
+
143
+ Parameters of the fit()-Method:
144
+ ```
145
+ {
146
+ "epochs": 1,
147
+ "evaluation_steps": 500,
148
+ "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
149
+ "max_grad_norm": 1,
150
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
151
+ "optimizer_params": {
152
+ "lr": 2e-05
153
+ },
154
+ "scheduler": "WarmupLinear",
155
+ "steps_per_epoch": null,
156
+ "warmup_steps": 10000,
157
+ "weight_decay": 0.01
158
+ }
159
+ ```
160
+
161
+ ## Full Model Architecture
162
+ ```
163
+ SentenceTransformer(
164
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
165
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
166
+ )
167
+ ```
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.36.2",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.4.0",
4
+ "transformers": "4.36.2",
5
+ "pytorch": "2.1.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null
9
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17ad4503287c3e240a24609eaccff9b0514ed942495a78496f2966d03145d1b6
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
similarity_evaluation_results.csv ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
2
+ -1,-1,0.9611268628744398,0.8651325788568655,0.9412334131276019,0.8650209269988058,0.9408144524772969,0.8651457143657781,0.9561560772829465,0.8651094963898324
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff