π Korean Medical DPR(Dense Passage Retrieval)
1. Intro
μλ£ λΆμΌμμ μ¬μ©ν μ μλ Bi-Encoder ꡬ쑰μ κ²μ λͺ¨λΈμ
λλ€.
νΒ·μ νΌμ©μ²΄μ μλ£ κΈ°λ‘μ μ²λ¦¬νκΈ° μν΄ SapBERT-KO-EN μ λ² μ΄μ€ λͺ¨λΈλ‘ μ΄μ©νμ΅λλ€.
μ§λ¬Έμ Question Encoderλ‘, ν
μ€νΈλ Context Encoderλ₯Ό μ΄μ©ν΄ μΈμ½λ©ν©λλ€.
- Question Encoder : https://huggingface.co/snumin44/medical-biencoder-ko-bert-question
(β» μ΄ λͺ¨λΈμ AI Hubμ μ΄κ±°λ AI ν¬μ€μΌμ΄ μ§μ μλ΅ λ°μ΄ν°λ‘ νμ΅ν λͺ¨λΈμ λλ€.)
2. Model
(1) Self Alignment Pretraining (SAP)
νκ΅ μλ£ κΈ°λ‘μ νΒ·μ νΌμ©μ²΄λ‘ μ°μ¬, μμ΄ μ©μ΄λ μΈμν μ μλ λͺ¨λΈμ΄ νμν©λλ€.
Multi Similarity Lossλ₯Ό μ΄μ©ν΄ λμΌν μ½λμ μ©μ΄ κ°μ λμ μ μ¬λλ₯Ό κ°λλ‘ νμ΅νμ΅λλ€.
μ) C3843080 || κ³ νμ μ§ν
C3843080 || Hypertension
C3843080 || High Blood Pressure
C3843080 || HTN
C3843080 || HBP
- SapBERT-KO-EN : https://huggingface.co/snumin44/sap-bert-ko-en
- Github : https://github.com/snumin44/SapBERT-KO-EN
(2) Dense Passage Retrieval (DPR)
SapBERT-KO-ENμ κ²μ λͺ¨λΈλ‘ λ§λ€κΈ° μν΄ μΆκ°μ μΈ Fine-tuningμ ν΄μΌ ν©λλ€.
Bi-Encoder κ΅¬μ‘°λ‘ μ§μμ ν
μ€νΈμ μ μ¬λλ₯Ό κ³μ°νλ DPR λ°©μμΌλ‘ Fine-tuning νμ΅λλ€.
λ€μκ³Ό κ°μ΄ κΈ°μ‘΄μ λ°μ΄ν° μ
μ νΒ·μ νΌμ©μ²΄ μνμ μ¦κ°ν λ°μ΄ν° μ
μ μ¬μ©νμ΅λλ€.
μ) νκ΅μ΄ λ³λͺ
: κ³ νμ
μμ΄ λ³λͺ
: Hypertenstion
μ§μ (μλ³Έ): μλ²μ§κ° κ³ νμμΈλ° κ·Έκ² λμ§ λͺ¨λ₯΄κ² μ΄. κ³ νμμ΄ λμ§ μ€λͺ
μ’ ν΄μ€.
μ§μ (μ¦κ°): μλ²μ§κ° Hypertenstion μΈλ° κ·Έκ² λμ§ λͺ¨λ₯΄κ² μ΄. Hypertenstion μ΄ λμ§ μ€λͺ
μ’ ν΄μ€.
- Github : https://github.com/snumin44/DPR-KO
3. Training
(1) Self Alignment Pretraining (SAP)
SapBERT-KO-EN νμ΅μ νμ©ν λ² μ΄μ€ λͺ¨λΈ λ° νμ΄νΌ νλΌλ―Έν°λ λ€μκ³Ό κ°μ΅λλ€.
νΒ·μ μλ£ μ©μ΄λ₯Ό μλ‘ν μλ£ μ©μ΄ μ¬μ μΈ KOSTOMμ νμ΅ λ°μ΄ν°λ‘ μ¬μ©νμ΅λλ€.
- Model : klue/bert-base
- Dataset : KOSTOM
- Epochs : 1
- Batch Size : 64
- Max Length : 64
- Dropout : 0.1
- Pooler : 'cls'
- Eval Step : 100
- Threshold : 0.8
- Scale Positive Sample : 1
- Scale Negative Sample : 60
(2) Dense Passage Retrieval (DPR)
Fine-tuningμ νμ©ν λ² μ΄μ€ λͺ¨λΈ λ° νμ΄νΌ νλΌλ―Έν°λ λ€μκ³Ό κ°μ΅λλ€.
- Model : SapBERT-KO-EN(klue/bert-base)
- Dataset : μ΄κ±°λ AI ν¬μ€μΌμ΄ μ§μ μλ΅ λ°μ΄ν°(AI Hub)
- Epochs : 10
- Batch Size : 64
- Dropout : 0.1
- Pooler : 'cls'
4. Example
μ΄ λͺ¨λΈμ μ§λ¬Έμ μΈμ½λ©νλ λͺ¨λΈλ‘, Context λͺ¨λΈκ³Ό ν¨κ» μ¬μ©ν΄μΌ ν©λλ€.
λμΌν μ§λ³μ κ΄ν μ§λ¬Έκ³Ό ν
μ€νΈκ° λμ μ μ¬λλ₯Ό 보μΈλ€λ μ¬μ€μ νμΈν μ μμ΅λλ€.
(β» μλ μ½λμ μμλ ChatGPTλ₯Ό μ΄μ©ν΄ μμ±ν μλ£ ν
μ€νΈμ
λλ€.)
(β» νμ΅ λ°μ΄ν°μ νΉμ± μ μμ λ³΄λ€ μ μ λ ν
μ€νΈμ λν΄ λ μ μλν©λλ€.)
import numpy as np
from transformers import AutoModel, AutoTokenizer
# Question Model
q_model_path = 'snumin44/medical-biencoder-ko-bert-question'
q_model = AutoModel.from_pretrained(q_model_path)
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path)
# Context Model
c_model_path = 'snumin44/medical-biencoder-ko-bert-context'
c_model = AutoModel.from_pretrained(c_model_path)
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path)
query = 'high blood pressure μ²λ°© μ¬λ‘'
targets = [
"""κ³ νμ μ§λ¨.
νμ μλ΄ λ° μνμ΅κ΄ κ΅μ κΆκ³ . μ μΌμ, κ·μΉμ μΈ μ΄λ, κΈμ°, κΈμ£Ό μ§μ.
νμ μ¬λ°©λ¬Έ. νμ: 150/95mmHg. μ½λ¬ΌμΉλ£ μμ. Amlodipine 5mg 1μΌ 1ν μ²λ°©.""",
"""μκΈμ€ λμ°© ν μ λ΄μκ²½ μ§ν.
μ견: Gastric ulcerμμ Forrest IIb κ΄μ°°λ¨. μΆνμ μλμ μΌμΆμ± μΆν νν.
μ²μΉ: μνΌλ€νλ¦° μ£Όμ¬λ‘ μΆν κ°μ νμΈ. Hemoclip 2κ°λ‘ μΆν λΆμ ν΄λ¦¬ννμ¬ μ§ν μλ£.""",
"""νμ€ λμ μ§λ°© μμΉ λ° μ§λ°©κ° μ견.
λ€λ°μ± gallstones νμΈ. μ¦μ μμ κ²½μ° κ²½κ³Ό κ΄μ°° κΆμ₯.
μ°μΈ‘ renal cyst, μμ± κ°λ₯μ± λμΌλ©° μΆκ°μ μΈ μ²μΉ λΆνμ ν¨."""
]
query_feature = q_tokenizer(query, return_tensors='pt')
query_outputs = q_model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()
def cos_sim(A, B):
return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
for idx, target in enumerate(targets):
target_feature = c_tokenizer(target, return_tensors='pt')
target_outputs = c_model(**target_feature, return_dict=True)
target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
similarity = cos_sim(query_embeddings, target_embeddings)
print(f"Similarity between query and target {idx}: {similarity:.4f}")
Similarity between query and target 0: 0.2674
Similarity between query and target 1: 0.0416
Similarity between query and target 2: 0.0476
Citing
@inproceedings{liu2021self,
title={Self-Alignment Pretraining for Biomedical Entity Representations},
author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages={4228--4238},
month = jun,
year={2021}
}
@article{karpukhin2020dense,
title={Dense Passage Retrieval for Open-Domain Question Answering},
author={Vladimir Karpukhin, Barlas OΔuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2020}
}
- Downloads last month
- 39
Model tree for snumin44/medical-biencoder-ko-bert-question
Base model
klue/bert-base