BlueBert-Base, Uncased, PubMed
Model description
A BERT model pre-trained on PubMed abstracts
Intended uses & limitations
How to use
Please see https://github.com/ncbi-nlp/bluebert
Training data
We provide preprocessed PubMed texts that were used to pre-train the BlueBERT models. The corpus contains ~4000M words extracted from the PubMed ASCII code version.
Pre-trained model: https://huggingface.co/bert-base-uncased
Training procedure
- lowercasing the text
- removing speical chars
\x00
-\x7F
- tokenizing the text using the NLTK Treebank tokenizer
Below is a code snippet for more details.
value = value.lower()
value = re.sub(r'[\r\n]+', ' ', value)
value = re.sub(r'[^\x00-\x7F]+', ' ', value)
tokenized = TreebankWordTokenizer().tokenize(value)
sentence = ' '.join(tokenized)
sentence = re.sub(r"\s's\b", "'s", sentence)
BibTeX entry and citation info
@InProceedings{peng2019transfer,
author = {Yifan Peng and Shankai Yan and Zhiyong Lu},
title = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
year = {2019},
pages = {58--65},
}
- Downloads last month
- 2,215