MediAlbertina
The first publicly available medical language model trained with real European Portuguese data.
MediAlbertina is a family of encoders from the Bert family, DeBERTaV2-based, resulting from the continuation of the pre-training of PORTULAN's Albertina models with Electronic Medical Records shared by Portugal's largest public hospital.
Like its antecessors, MediAlbertina models are distributed under the MIT license.
Model Description
MediAlbertina PT-PT 900M NER was created through fine-tuning of MediAlbertina PT-PT 900M on real European Portuguese EMRs that have been hand-annotated for the following entities:
- Diagnostico (D): All types of diseases and conditions following the ICD-10-CM guidelines.
- Sintoma (S): Any complaints or evidence from healthcare professionals indicating that a patient is experiencing a medical condition.
- Medicamento (M): Something that is administrated to the patient (through any route), including drugs, specific food/drink, vitamins, or blood for transfusion.
- Dosagem (D): Dosage and frequency of medication administration.
- ProcedimentoMedico (PM): Anything healthcare professionals do related to patients, including exams, moving patients, administering something, or even surgeries.
- SinalVital (SV): Quantifiable indicators in a patient that can be measured, always associated with a specific result. Examples include cholesterol levels, diuresis, weight, or glycaemia.
- Resultado (R): Results can be associated with Medical Procedures and Vital Signs. It can be a numerical value if something was measured (e.g., the value associated with blood pressure) or a descriptor to indicate the result (e.g., positive/negative, functional).
- Progresso (P): Describes the progress of patient’s condition. Typically, it includes verbs like improving, evolving, or regressing and mentions to patient’s stability.
MediAlbertina PT-PT 900M NER achieved superior results to the same adaptation made on a non-medical Portuguese language model, demonstrating the effectiveness of this domain adaptation, and its potential for medical AI in Portugal.
Model | B-D | I-D | B-S | I-S | B-PM | I-PM | B-SV | I-SV | B-R | I-R | B-M | I-M | B-DO | I-DO | B-P | I-P |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | |
albertina-900m-portuguese-ptpt-encoder | 0.721 | 0.786 | 0.734 | 0.775 | 0.737 | 0.805 | 0.859 | 0.811 | 0.803 | 0.816 | 0.913 | 0.871 | 0.853 | 0.895 | 0.769 | 0.785 |
medialbertina_pt-pt_900m | 0.799 | 0.832 | 0.754 | 0.782 | 0.786 | 0.813 | 0.916 | 0.788 | 0.821 | 0.83 | 0.926 | 0.895 | 0.85 | 0.885 | 0.779 | 0.807 |
Data
MediAlbertina PT-PT 900M NER was fine-tuned on about 10k hand-annotated medical entities from about 4k fully anonymized medical sentences from Portugal's largest public hospital. This data was acquired under the framework of the FCT project DSAIPA/AI/0122/2020 AIMHealth-Mobile Applications Based on Artificial Intelligence.
How to use
from transformers import pipeline
ner_pipeline = pipeline('ner', model='portugueseNLP/medialbertina_pt-pt_900m_NER', aggregation_strategy='average')
sentence = 'Durante o procedimento endoscópico, foram encontrados pólipos no cólon do paciente.'
entities = ner_pipeline(sentence)
for entity in entities:
print(f"{entity['entity_group']} - {sentence[entity['start']:entity['end']]}")
Citation
MediAlbertina is developed by a joint team from ISCTE-IUL, Portugal, and Select Data, CA USA. For a fully detailed description, check the respective publication:
@article{MediAlbertina PT-PT,
title={MediAlbertina: An European Portuguese medical language model},
author={Miguel Nunes and João Boné and João Ferreira
and Pedro Chaves and Luís Elvas},
year={2024},
journal={CBM},
volume={182}
url={https://doi.org/10.1016/j.compbiomed.2024.109233}
}
Please use the above cannonical reference when using or citing this model.
Acknowledgements
This work was financially supported by Project Blockchain.PT – Decentralize Portugal with Blockchain Agenda, (Project no 51), WP2, Call no 02/C05-i01.01/2022, funded by the Portuguese Recovery and Resillience Program (PRR), The Portuguese Republic and The European Union (EU) under the framework of Next Generation EU Program.
- Downloads last month
- 703