BlueBert-Base, Uncased, PubMed

Model description

A BERT model pre-trained on PubMed abstracts

Intended uses & limitations

How to use

Please see https://github.com/ncbi-nlp/bluebert

Training data

We provide preprocessed PubMed texts that were used to pre-train the BlueBERT models. The corpus contains ~4000M words extracted from the PubMed ASCII code version.

Pre-trained model: https://huggingface.co/bert-base-uncased

Training procedure

lowercasing the text
removing speical chars \x00-\x7F
tokenizing the text using the NLTK Treebank tokenizer

Below is a code snippet for more details.

value = value.lower()
value = re.sub(r'[\r\n]+', ' ', value)
value = re.sub(r'[^\x00-\x7F]+', ' ', value)

tokenized = TreebankWordTokenizer().tokenize(value)
sentence = ' '.join(tokenized)
sentence = re.sub(r"\s's\b", "'s", sentence)

BibTeX entry and citation info

@InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
  pages     = {58--65},
}

bionlp
/

bluebert_pubmed_uncased_L-12_H-768_A-12