|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- allenai/mslr2022 |
|
language: |
|
- en |
|
pipeline_tag: summarization |
|
--- |
|
|
|
# PubMedBERT for biomedical extractive summarization |
|
|
|
## Description |
|
Work done for my [Bachelor's thesis](https://amslaurea.unibo.it/id/eprint/29686). |
|
|
|
[PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) fine-tuned |
|
on [MS^2](https://github.com/allenai/mslr-shared-task) for extractive summarization.\ |
|
The model architecture is similar to [BERTSum](https://github.com/nlpyang/BertSum).\ |
|
Training code is available at [biomed-ext-summ](https://github.com/NotXia/biomed-ext-summ). |
|
|
|
## Usage |
|
```python |
|
summarizer = pipeline("summarization", |
|
model = "NotXia/pubmedbert-bio-ext-summ", |
|
tokenizer = AutoTokenizer.from_pretrained("NotXia/pubmedbert-bio-ext-summ"), |
|
trust_remote_code = True, |
|
device = 0 |
|
) |
|
|
|
sentences = ["sent1.", "sent2.", "sent3?"] |
|
summarizer({"sentences": sentences}, strategy="count", strategy_args=2) |
|
>>> (['sent1.', 'sent2.'], [0, 1]) |
|
``` |
|
|
|
### Strategies |
|
Strategies to summarize the document: |
|
- `length`: summary with a maximum length (`strategy_args` is the maximum length). |
|
- `count`: summary with the given number of sentences (`strategy_args` is the number of sentences). |
|
- `ratio`: summary proportional to the length of the document (`strategy_args` is the ratio [0, 1]). |
|
- `threshold`: summary only with sentences with a score higher than a given value (`strategy_args` is the minimum score). |