stefan-insilico commited on
Commit
ac88694
1 Parent(s): f189f6c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -25
README.md CHANGED
@@ -2,11 +2,18 @@
2
  license: cc-by-nc-4.0
3
  ---
4
 
5
- ---
6
- license: cc-by-nc-4.0
7
- ---
 
 
 
 
 
 
8
 
9
- Step 1.
 
10
 
11
  ```python
12
  # Load model and tokenizer
@@ -16,43 +23,172 @@ tokenizer = AutoTokenizer.from_pretrained("insilicomedicine/precious3-gpt", trus
16
  model = AutoModel.from_pretrained("insilicomedicine/precious3-gpt", trust_remote_code=True)
17
  ```
18
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- Step 2.
21
 
22
  ```python
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- # Select device
25
- if torch.cuda.is_available():
26
- device = f"cuda:0"
27
- else:
28
- device = "cpu"
29
- print(device)
30
  ```
31
 
 
 
 
 
32
 
33
- Step 3.
34
  ```python
 
 
 
35
 
36
- # Load unique compounds from precious3-gpt
37
- import pandas as pd
38
- all_entities_with_type = pd.read_csv('p3_entities_with_type.csv')
39
- p3_compounds = [i.strip() for i in all_entities_with_type[all_entities_with_type.type=='compound'].entity.values]
 
 
 
 
 
 
 
 
40
  ```
 
 
 
 
41
 
 
42
 
43
- Step 4.
 
 
 
 
 
 
44
 
45
- ```python
46
- # Example input prompt
47
 
48
- diff2compound = """[BOS]<compound2diff2compound><tissue>liver </tissue><age></age><cell></cell><efo></efo><datatype></datatype><drug></drug><dose></dose><time></time><case></case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species><up>MMP3 SLA PMS1 S100A8 AKR1C2 ADD3 INVS INSIG2 KCTD12 TAF1A MBNL2 HERC2 COG6 IFNK SLC35A1 TBL2 SGMS1 CLHC1 EDEM3 GMCL1 ST6GALNAC4 MTMR1 RPUSD3 ATG4C HOXC6 GOLPH3L RAD50 GLCE WRAP73 NBR1 GJA9 RIMS3 DNAAF2 MRPL34 TRMT61B CETN3 HMG20A GPRIN3 DHRS3 METTL21A HEATR3 MMD FOCAD RHOT1 EMG1 CDC26 FRMD4B INTS8 KLHL8 ANKRD39 NKIRAS1 LIAS FARSA PREPL ZBTB48 VAV3 OXNAD1 METTL23 GPR84 QSER1 SLC16A6 NDFIP2 TUBGCP4 HEATR5A XPO1 ORC5 SLC38A9 COG5 SLC4A7 CRLS1 MCEE LMBRD2 ZMYND19 LARS2 NR2F6 CHCHD4 ACTR6 PTPN14 CDK19 SLC25A12 GMPR2 NUDCD2 ASB3 GDE1 MRPS26 DHRS7B FUT8 PAFAH2 ECE2 POLR3K NUP88 FAM98A BAG4 SATB1 GTF2H2 FASTKD1 PIK3R4 SPICE1 MTFR1 EML4 </up><down>HCAR3 CCNA1 GCH1 MARCKS TYMS C11orf96 APOBEC3B HS3ST2 XIRP1 DGKI ATP2B1 GSG1 SERPINE2 LIMS3 TUBB2A HMGCS1 C12orf75 FCGR1A FCGR1B HEG1 ITGA4 CDC42EP3 RAB27B FKBP5 FAM72A ARNT2 ASS1 PHACTR1 KLF4 ZC3H12D IL22RA2 CCNE2 FEM1C UHRF1 THAP2 GSTO2 CCNA2 PMAIP1 CYP51A1 FOSB BCAT1 CD109 NREP SLC7A5 C4orf46 B3GNT5 CPEB2 NCR3LG1 SCD MSX1 DTL RBM3 PIK3R3 TESC SLFN5 CREB5 TMEM64 USP53 TLE3 SFPQ PHLDA2 CIRBP CACNA2D4 SLC30A1 IL32 SHCBP1 OGFRL1 MAFF ATP1B1 FAS IRF1 LY86 IL6R DHCR7 TMEM217 SPIN4 SRGAP2 GLS C3orf80 ZNF367 OTUD1 SPSB1 RNASE3 GMEB1 KBTBD8 VWF RBM12 CTSO FOXF1 ARHGAP21 HES4 ACAT2 CDCA7 PSPC1 S100P NSMAF CTNNAL1 DPYSL2 MT1G MT1E </down>"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ```
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- Step 5.
53
- ```python
54
 
55
- generated_compounds = generate_compounds(prompt_config=diff2compound, tokenizer=tokenizer,
56
- model=model, p3_compounds=p3_compounds, device=device, top_k=50)
57
- ````
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
 
2
  license: cc-by-nc-4.0
3
  ---
4
 
5
+ ## Precious3-GPT
6
+
7
+ A multi-omics multi-species language model.
8
+
9
+ - **Developer**: [Insilico Medicine](https://insilico.com/precious)
10
+ - **License**: cc-by-nc-4.0
11
+ - **Model size**: 88.3 million parameters
12
+ - **Domain**: Biomedical
13
+ - **Base architecture**: [MPT](https://huggingface.co/mosaicml/mpt-7b)
14
 
15
+ ## Quickstart
16
+ Precious-GPT can be loaded and run as standard Causal Language Model through transformers interface like this:
17
 
18
  ```python
19
  # Load model and tokenizer
 
23
  model = AutoModel.from_pretrained("insilicomedicine/precious3-gpt", trust_remote_code=True)
24
  ```
25
 
26
+ However for the convenience of using all the functionality of the Precious3-GPT model, we provide a handler.
27
+
28
+ ### Run model using Prpecious3-GPT handler step by step
29
+
30
+
31
+ **Step 1 - download Prpecious3-GPT [handler.py](https://huggingface.co/insilicomedicine/precious3-gpt/blob/main/handler.py)**
32
+ ```python
33
+ from handler import EndpointHandler
34
+ precious3gpt_handler = EndpointHandler()
35
+ ```
36
 
37
+ **Step 2 - create input for the handler**
38
 
39
  ```python
40
+ import json
41
+ with open('./generation-configs/meta2diff.json', 'r') as f:
42
+ config_data = json.load(f)
43
+
44
+ # prepare request configuration
45
+ request_config = {"inputs": config_data, "mode": "meta2diff", "parameters": {
46
+ "temperature": 0.8,
47
+ "top_p": 0.2,
48
+ "top_k": 3550,
49
+ "n_next_tokens": 50,
50
+ "random_seed": 137
51
+ }}
52
 
 
 
 
 
 
 
53
  ```
54
 
55
+ **How Precisou3-GPT will see given request**
56
+ ```text
57
+ [BOS]<age_group2diff2age_group><disease2diff2disease><compound2diff2compound><tissue>lung </tissue><age_individ></age_individ><cell></cell><efo>EFO_0000768 </efo><datatype>expression </datatype><drug>curcumin </drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type></dataset_type><gender>m </gender><species>human </species>
58
+ ```
59
 
60
+ **Step 3 - run Precisou3-GPT**
61
  ```python
62
+ output = precious3gpt_handler(request_config)
63
+ ```
64
+
65
 
66
+ **Handler output structure**
67
+ ```json
68
+ {
69
+ "output": {
70
+ "up": List,
71
+ "down": List
72
+ },
73
+ "mode": String, // Generation mode was selected
74
+ "message": "Done!", // or Error
75
+ "input": String // Input prompt was passed
76
+
77
+ }
78
  ```
79
+ Note: If the ```mode``` was supposed to generate compounds, the output would contain ```compounds: List```.
80
+
81
+ ---
82
+ ## Precious3-GPT request configuration
83
 
84
+ ### Generation Modes (`mode` in config)
85
 
86
+ Choose the appropriate mode based on your requirements:
87
+
88
+ 1. **meta2diff**: Generate signature (up- and down- gene lists) given meta-data such as tissue, compound, gender, etc.
89
+ 2. **diff2compound**: Predict compounds based on signature.
90
+ 3. **meta2diff2compound**: Generate signatures given meta-data and then predict compounds based on generated signatures.
91
+
92
+ ---
93
 
 
 
94
 
95
+ ### Instruction (`inputs.instruction` in config)
96
+
97
+ 1. disease2diff2disease - generate signature for disease / predict disease based on given signature
98
+ 2. compound2diff2compound - generate signature for compound / predict compound based on given signature
99
+ 3. age_group2diff2age_group - generate signature for age group / predict age group based on signature
100
+
101
+
102
+ ### Other meta-data (`inputs.` in config)
103
+
104
+ 1. Age (```age```) for human - in years, for macaque and mouse - in days
105
+ 2.
106
+ Full list of available values for each meta-data item you can find in ```p3_entities_with_type.csv```
107
+
108
+
109
+
110
+ ## Examples
111
+
112
+ In the following examples all possible configuration fields are specified. You can leave some meta-data fields in the ```inputs``` section empty string(```""```) or empty list(```[]```).
113
+
114
+ _**Example 1**_
115
+
116
+ If you want to generate a signature given specific meta-data you can use the following configuration. Note, ```up``` and ```down``` fields are empty lists as you want to generate them.
117
+ Here we ask the model to generate a signature for a human within the age group of 70-90 years, male, in tissue - Lungs with disease EFO_0000768.
118
+
119
+ ```json
120
+ {
121
+ "inputs": {
122
+ "instruction": ["age_group2diff2age_group", "disease2diff2disease", "compound2diff2compound"],
123
+ "tissue": ["lung"],
124
+ "age": "",
125
+ "cell": "",
126
+ "efo": "EFO_0000768",
127
+ "datatype": "", "drug": "", "dose": "", "time": "", "case": ["70.0-80.0", "80.0-90.0"], "control": "", "dataset_type": "expression", "gender": "m", "species": "human", "up": [], "down": []
128
+ },
129
+ "mode": "meta2diff",
130
+ "parameters": {
131
+ "temperature": 0.8, "top_p": 0.2, "top_k": 3550, "n_next_tokens": 50, "random_seed": 137
132
+ }
133
+ }
134
  ```
135
 
136
+ Here is output:
137
+ ```json
138
+ {
139
+ "output": {
140
+ "up": [["PTGDR2", "CABYR", "MGAM", "TMED9", "SHOX2", "MAT1A", "MUC5AC", "GASK1B", "CYP1A2", "RP11-266K4.9", ...]], // generated list of up-regulated genes
141
+ "down": [["MB", "OR10V1", "OR51H1", "GOLGA6L10", "OR6M1", "CDX4", "OR4C45", "SPRR2A", "SPDYE9", "GBX2", "ATP4B", ...]] // generated list of down-regulated genes
142
+ },
143
+ "mode": "meta2diff", // generation mode we specified
144
+ "message": "Done!",
145
+ "input": "[BOS]<age_group2diff2age_group><disease2diff2disease><compound2diff2compound><tissue>lung </tissue><cell></cell><efo>EFO_0000768 </efo><datatype></datatype><drug></drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species>", // actual input prompt for the model
146
+ "random_seed": 137
147
+ }
148
+ ```
149
 
 
 
150
 
151
+ _**Example 2**_
152
+
153
+ Now let's generate a signature for a healthy human within the age group of 70-90 years, male, in tissue - whole blood.
154
+ Note, here we use ```disease2diff2disease``` instruction, but we expect to generate signatures for a healthy human, that's why we'd set ```efo``` to empty string "".
155
+ Alternatively, for this example we can add one more instruction to example 2 - "instruction": ["disease2diff2disease", "age_group2diff2age_group"]
156
+
157
+ ```json
158
+ {
159
+ "inputs": {
160
+ "instruction": ["disease2diff2disease", "age_group2diff2age_group"],
161
+ "tissue": ["whole blood"],
162
+ "age": "",
163
+ "cell": "",
164
+ "efo": "",
165
+ "datatype": "", "drug": "", "dose": "", "time": "", "case": "40.0-50.0", "control": "", "dataset_type": "expression", "gender": "m", "species": "human", "up": [],
166
+ "down": []
167
+ },
168
+ "mode": "meta2diff",
169
+ "parameters": {
170
+ "temperature": 0.8,
171
+ "top_p": 0.2,
172
+ "top_k": 3550,
173
+ "n_next_tokens": 50,
174
+ "random_seed": 137
175
+ }
176
+ }
177
+
178
+ ```
179
+
180
+ Here is output:
181
+ ```json
182
+ {
183
+ "output": {
184
+ "up": [["IER3", "APOC2", "EDNRB", "JAKMIP2", "BACE2", ... ]],
185
+ "down": [["TBL1Y", "TDP1", "PLPP4", "CPEB1", "ITPR3", ... ]]
186
+ },
187
+ "mode": "meta2diff",
188
+ "message": "Done!",
189
+ "input": "[BOS]<disease2diff2disease><age_group2diff2age_group><tissue>whole blood </tissue><cell></cell><efo></efo><datatype></datatype><drug></drug><dose></dose><time></time><case>40.0-50.0 </case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species>",
190
+ "random_seed": 137
191
+ }
192
+ ```
193
+
194