aps6992 commited on
Commit
d89a62b
·
1 Parent(s): 5fbf5d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -6
README.md CHANGED
@@ -30,14 +30,164 @@ model = AutoAdapterModel.from_pretrained("allenai/specter2_aug2023refresh_base")
30
  adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_classification", source="hf", set_active=True)
31
  ```
32
 
33
- ## Architecture & Training
34
 
35
- <!-- Add some description here -->
 
36
 
37
- ## Evaluation results
 
 
 
38
 
39
- <!-- Add some description here -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
- ## Citation
42
 
43
- <!-- Add some description here -->
 
30
  adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_classification", source="hf", set_active=True)
31
  ```
32
 
33
+ **\*\*\*\*\*\*Update\*\*\*\*\*\***
34
 
35
+ This update introduces a new set of SPECTER 2.0 models with the base transformer encoder pre-trained on an extended citation dataset containing more recent papers.
36
+ For benchmarking purposes please use the existing SPECTER 2.0 models w/o the **aug2023refresh** suffix viz. [allenai/specter2_base](https://huggingface.co/allenai/specter2_base).
37
 
38
+ # SPECTER 2.0 (Base)
39
+ SPECTER 2.0 is the successor to [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
40
+ This is the base model to be used along with the adapters.
41
+ Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
42
 
43
+ **Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
44
+
45
+ **To get the best performance on a downstream task type please load the associated adapter with the base model as in the example below.**
46
+
47
+ # Model Details
48
+
49
+ ## Model Description
50
+
51
+ SPECTER 2.0 has been trained on over 6M triplets of scientific paper citations, which are available [here](https://huggingface.co/datasets/allenai/scirepeval/viewer/cite_prediction_new/evaluation).
52
+ Post that it is trained with additionally attached task format specific adapter modules on all the [SciRepEval](https://huggingface.co/datasets/allenai/scirepeval) training tasks.
53
+
54
+ Task Formats trained on:
55
+ - Classification
56
+ - Regression
57
+ - Proximity
58
+ - Adhoc Search
59
+
60
+
61
+ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
62
+
63
+
64
+
65
+ - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
66
+ - **Shared by :** Allen AI
67
+ - **Model type:** bert-base-uncased + adapters
68
+ - **License:** Apache 2.0
69
+ - **Finetuned from model:** [allenai/scibert](https://huggingface.co/allenai/scibert_scivocab_uncased).
70
+
71
+ ## Model Sources
72
+
73
+ <!-- Provide the basic links for the model. -->
74
+
75
+ - **Repository:** [https://github.com/allenai/SPECTER2_0](https://github.com/allenai/SPECTER2_0)
76
+ - **Paper:** [https://api.semanticscholar.org/CorpusID:254018137](https://api.semanticscholar.org/CorpusID:254018137)
77
+ - **Demo:** [Usage](https://github.com/allenai/SPECTER2_0/blob/main/README.md)
78
+
79
+ # Uses
80
+
81
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
82
+
83
+ ## Direct Use
84
+
85
+ |Model|Name and HF link|Description|
86
+ |--|--|--|
87
+ |Retrieval*|[allenai/specter2_aug2023refresh_proximity](https://huggingface.co/allenai/specter2_aug2023refresh)|Encode papers as queries and candidates eg. Link Prediction, Nearest Neighbor Search|
88
+ |Adhoc Query|[allenai/specter2_aug2023refresh_adhoc_query](https://huggingface.co/allenai/specter2_aug2023refresh_adhoc_query)|Encode short raw text queries for search tasks. (Candidate papers can be encoded with proximity)|
89
+ |Classification|[allenai/specter2_aug2023refresh_classification](https://huggingface.co/allenai/specter2_aug2023refresh_classification)|Encode papers to feed into linear classifiers as features|
90
+ |Regression|[allenai/specter2_aug2023refresh_regression](https://huggingface.co/allenai/specter2_aug2023refresh_regression)|Encode papers to feed into linear regressors as features|
91
+
92
+ *Retrieval model should suffice for downstream task types not mentioned above
93
+
94
+ ```python
95
+ from transformers import AutoTokenizer, AutoModel
96
+
97
+ # load model and tokenizer
98
+ tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_aug2023refresh_base')
99
+
100
+ #load base model
101
+ model = AutoModel.from_pretrained('allenai/specter2_aug2023refresh_base')
102
+
103
+ #load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
104
+ model.load_adapter("allenai/specter2_aug2023refresh_classification", source="hf", load_as="specter2_classification", set_active=True)
105
+
106
+ papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
107
+ {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
108
+
109
+ # concatenate title and abstract
110
+ text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
111
+ # preprocess the input
112
+ inputs = self.tokenizer(text_batch, padding=True, truncation=True,
113
+ return_tensors="pt", return_token_type_ids=False, max_length=512)
114
+ output = model(**inputs)
115
+ # take the first token in the batch as the embedding
116
+ embeddings = output.last_hidden_state[:, 0, :]
117
+ ```
118
+
119
+ ## Downstream Use
120
+
121
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
122
+
123
+ For evaluation and downstream usage, please refer to [https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md](https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md).
124
+
125
+ # Training Details
126
+
127
+ ## Training Data
128
+
129
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
130
+
131
+ The base model is trained on citation links between papers and the adapters are trained on 8 large scale tasks across the four formats.
132
+ All the data is a part of SciRepEval benchmark and is available [here](https://huggingface.co/datasets/allenai/scirepeval).
133
+
134
+ The citation link are triplets in the form
135
+
136
+ ```json
137
+ {"query": {"title": ..., "abstract": ...}, "pos": {"title": ..., "abstract": ...}, "neg": {"title": ..., "abstract": ...}}
138
+ ```
139
+
140
+ consisting of a query paper, a positive citation and a negative which can be from the same/different field of study as the query or citation of a citation.
141
+
142
+ ## Training Procedure
143
+
144
+ Please refer to the [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677).
145
+
146
+
147
+ ### Training Hyperparameters
148
+
149
+
150
+ The model is trained in two stages using [SciRepEval](https://github.com/allenai/scirepeval/blob/main/training/TRAINING.md):
151
+ - Base Model: First a base model is trained on the above citation triplets.
152
+ ``` batch size = 1024, max input length = 512, learning rate = 2e-5, epochs = 2 warmup steps = 10% fp16```
153
+ - Adapters: Thereafter, task format specific adapters are trained on the SciRepEval training tasks, where 600K triplets are sampled from above and added to the training data as well.
154
+ ``` batch size = 256, max input length = 512, learning rate = 1e-4, epochs = 6 warmup = 1000 steps fp16```
155
+
156
+
157
+ # Evaluation
158
+
159
+ We evaluate the model on [SciRepEval](https://github.com/allenai/scirepeval), a large scale eval benchmark for scientific embedding tasks which which has [SciDocs] as a subset.
160
+ We also evaluate and establish a new SoTA on [MDCR](https://github.com/zoranmedic/mdcr), a large scale citation recommendation benchmark.
161
+
162
+ |Model|SciRepEval In-Train|SciRepEval Out-of-Train|SciRepEval Avg|MDCR(MAP, Recall@5)|
163
+ |--|--|--|--|--|
164
+ |[BM-25](https://api.semanticscholar.org/CorpusID:252199740)|n/a|n/a|n/a|(33.7, 28.5)|
165
+ |[SPECTER](https://huggingface.co/allenai/specter)|54.7|57.4|68.0|(30.6, 25.5)|
166
+ |[SciNCL](https://huggingface.co/malteos/scincl)|55.6|57.8|69.0|(32.6, 27.3)|
167
+ |[SciRepEval-Adapters](https://huggingface.co/models?search=scirepeval)|61.9|59.0|70.9|(35.3, 29.6)|
168
+ |[SPECTER 2.0-Adapters](https://huggingface.co/models?search=allenai/specter-2)|**62.3**|**59.2**|**71.2**|**(38.4, 33.0)**|
169
+
170
+ Please cite the following works if you end up using SPECTER 2.0:
171
+
172
+ [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677):
173
+
174
+ ```bibtex
175
+ @inproceedings{specter2020cohan,
176
+ title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}},
177
+ author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld},
178
+ booktitle={ACL},
179
+ year={2020}
180
+ }
181
+ ```
182
+ [SciRepEval paper](https://api.semanticscholar.org/CorpusID:254018137)
183
+ ```bibtex
184
+ @article{Singh2022SciRepEvalAM,
185
+ title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations},
186
+ author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman},
187
+ journal={ArXiv},
188
+ year={2022},
189
+ volume={abs/2211.13308}
190
+ }
191
+ ```
192
 
 
193