BlueBert-Base, Uncased, PubMed

Model description

A BERT model pre-trained on PubMed abstracts

Intended uses & limitations

How to use

Please see https://github.com/ncbi-nlp/bluebert

Training data

We provide preprocessed PubMed texts that were used to pre-train the BlueBERT models. The corpus contains ~4000M words extracted from the PubMed ASCII code version.

Pre-trained model: https://huggingface.co/bert-base-uncased

Training procedure

Below is a code snippet for more details.

value = value.lower()
value = re.sub(r'[\r\n]+', ' ', value)
value = re.sub(r'[^\x00-\x7F]+', ' ', value)

tokenized = TreebankWordTokenizer().tokenize(value)
sentence = ' '.join(tokenized)
sentence = re.sub(r"\s's\b", "'s", sentence)

BibTeX entry and citation info

@InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
  pages     = {58--65},
}
Downloads last month
2,215
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12

Finetunes
2 models

Dataset used to train bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12