File size: 5,815 Bytes
b71abe0
 
 
 
 
 
 
164c429
 
fcac4f4
164c429
 
b71abe0
 
65521ac
b71abe0
164c429
 
b71abe0
6add6b1
164c429
6add6b1
b71abe0
 
 
 
 
08c1519
b71abe0
08c1519
b71abe0
 
164c429
b71abe0
 
08c1519
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b71abe0
164c429
b71abe0
164c429
 
 
 
 
 
 
 
b71abe0
 
 
164c429
 
 
 
 
 
 
 
 
5b83726
89c7004
164c429
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b71abe0
59560a6
 
 
 
 
f0fc9b8
 
 
 
 
 
 
 
 
 
 
 
 
b71abe0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
library_name: peft
base_model: meta-llama/Llama-2-7b-hf
license: apache-2.0
language:
- en
---
# PathoIE-Llama-2-7B

<img src="https://cdn-uploads.huggingface.co/production/uploads/646704281dd5854d4de2cdda/k4lGzYe3Tp7EOgO_uOvgN.webp" width="500" />


## Training:

Check out our github: https://github.com/HIRC-SNUBH/Curation_LLM_PathoReport.git

- PEFT 0.4.0

## Inference

Since the model was trained using instructions following the ChatML template, modifications to the tokenizer are required.

``` python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-2-7b-hf',
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,   # Optional, if you have insufficient VRAM, lower the precision.
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
tokenizer.add_special_tokens(dict(
    eos_token=AddedToken("<|im_end|>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
    unk_token=AddedToken("<unk>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
    bos_token=AddedToken("<s>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
    pad_token=AddedToken("</s>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True),
))
tokenizer.add_tokens([AddedToken("<|im_start|>", single_word=False, lstrip=True, rstrip=True, normalized=False)], special_tokens=True)
tokenizer.additional_special_tokens = ['<unk>', '<s>', '</s>', '<|im_end|>', '<|im_start|>']

model.resize_token_embeddings(len(tokenizer))
model.config.eos_token_id = tokenizer.eos_token_id

# Load PEFT
model = PeftModel.from_pretrained(base_model, 'Lowenzahn/PathoIE-Llama-2-7B')
model = model.merge_and_unload()
model = model.eval()

# Inference
prompts = ["Machine learning is"]
inputs = tokenizer(prompts, return_tensors="pt")
gen_kwargs = {"max_new_tokens": 1024, "top_p": 0.8, "temperature": 0.0, "do_sample": False, "repetition_penalty": 1.0}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(output)
```


# Prompt example

The pathology report used below is a fictive example.

```
<|im_start|> system
You are a pathologist who specialized in lung cancer.
Your task is extracting informations requested by the user from the lung cancer pathology report and formatting extracted informations into JSON.
The information to be extracted is clearly specified in the report, so one must avoid from inferring information that is not present.
Remember, you MUST answer in JSON only. Avoid any additional explanations.
<|im_end|>
<|im_start|> user
Extract the following informations (value-set) from the report I provide.
If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'.
<value-set>
- MORPHOLOGY_DIAGNOSIS
- SUBTYPE_DOMINANT
- MAX_SIZE_OF_TUMOR(invasive component only)
- MAX_SIZE_OF_TUMOR(including CIS=AIS)
- INVASION_TO_VISCERAL_PLEURAL
- MAIN_BRONCHUS
- INVASION_TO_CHEST_WALL
- INVASION_TO_PARIETAL_PLEURA
- INVASION_TO_PERICARDIUM
- INVASION_TO_PHRENIC_NERVE
- TUMOR_SIZE_CNT
- LUNG_TO_LUNG_METASTASIS
- INTRAPULMONARY_METASTASIS
- SATELLITE_TUMOR_LOCATION
- SEPARATE_TUMOR_LOCATION
- INVASION_TO_MEDIASTINUM
- INVASION_TO_DIAPHRAGM
- INVASION_TO_HEART
- INVASION_TO_RECURRENT_LARYNGEAL_NERVE
- INVASION_TO_TRACHEA
- INVASION_TO_ESOPHAGUS
- INVASION_TO_SPINE
- METASTATIC_RIGHT_UPPER_LOBE
- METASTATIC_RIGHT_MIDDLE_LOBE
- METASTATIC_RIGHT_LOWER_LOBE
- METASTATIC_LEFT_UPPER_LOBE
- METASTATIC_LEFT_LOWER_LOBE
- INVASION_TO_AORTA
- INVASION_TO_SVC
- INVASION_TO_IVC
- INVASION_TO_PULMONARY_ARTERY
- INVASION_TO_PULMONARY_VEIN
- INVASION_TO_CARINA
- PRIMARY_CANCER_LOCATION_RIGHT_UPPER_LOBE
- PRIMARY_CANCER_LOCATION_RIGHT_MIDDLE_LOBE
- PRIMARY_CANCER_LOCATION_RIGHT_LOWER_LOBE
- PRIMARY_CANCER_LOCATION_LEFT_UPPER_LOBE
- PRIMARY_CANCER_LOCATION_LEFT_LOWER_LOBE
- RELATED_TO_ATELECTASIS_OR_OBSTRUCTIVE_PNEUMONITIS
- PRIMARY_SITE_LATERALITY
- LYMPH_METASTASIS_SITES
- NUMER_OF_LYMPH_NODE_META_CASES
---
<report>
[A] Lung, left lower lobe, lobectomy
1. ADENOSQUAMOUS CARCINOMA [by 2015 WHO classification]
- other subtype: acinar (50%), lepidic (30%), solid (20%)
    1) Pre-operative / Previous treatment: not done
    2) Histologic grade: moderately differentiated
    3) Size of tumor:
        a. Invasive component only: 3.5 x 2.5 x 1.3 cm, 2.4 x 2.3 x 1.1 cm
        b. Including CIS component: 3.9 x 2.6 x 1.3 cm, 3.8 x 3.1 x 1.2 cm
    4) Extent of invasion
        a. Invasion to visceral pleura: PRESENT (P2)
        b. Invasion to superior vena cava: present
    5) Main bronchus: not submitted
    6) Necrosis: absent
    7) Resection margin: free from carcinoma (safey margin: 1.1 cm)
    8) Lymph node: metastasis in 2 out of 10 regional lymph nodes
        (peribronchial lymph node: 1/3, LN#5,6 :0/1, LN#7:0/3, LN#12: 1/2)
<|im_end|>
<|im_start|> pathologist
```

## Developed by
- **_ezCaretech AI Team_**
- **_Office of eHealth Research and Business, [SNUBH](https://www.snubh.org/dh/en/)_**


## Citation  
```  
@article{cho2024ie,  
title={Extracting lung cancer staging descriptors from pathology reports: a generative language model approach},  
author={Hyeongmin Cho, Sooyoung Yoo, Borham Kim, Sowon Jang, Leonard Sunwoo, Sanghwan Kim, Donghyoung Lee, Seok Kim, Sejin Nam, Jin-Haeng Chung},  
journal={Journal of Biomedical Informatics},  
volume={157},  
year={2024},  
publisher={Elsevier},  
issn={1532-0464},  
doi={10.1016/j.jbi.2024.104720},  
url={https://doi.org/10.1016/j.jbi.2024.104720}  
}  
```