insilicomedicine
/

precious3-gpt

@@ -2,11 +2,18 @@
 license: cc-by-nc-4.0
 ---
----
-license: cc-by-nc-4.0
----
-Step 1.
 ```python
 # Load model and tokenizer
@@ -16,43 +23,172 @@ tokenizer = AutoTokenizer.from_pretrained("insilicomedicine/precious3-gpt", trus
 model = AutoModel.from_pretrained("insilicomedicine/precious3-gpt", trust_remote_code=True)
 ```
-Step 2.
 ```python
-# Select device
-if torch.cuda.is_available():
-    device = f"cuda:0"
-else:
-    device = "cpu"
-print(device)
 ```
-Step 3.
 ```python
-# Load unique compounds from precious3-gpt
-import pandas as pd
-all_entities_with_type = pd.read_csv('p3_entities_with_type.csv')
-p3_compounds = [i.strip() for i in all_entities_with_type[all_entities_with_type.type=='compound'].entity.values]
 ```
-Step 4.
-```python
-# Example input prompt
-diff2compound = """[BOS]<compound2diff2compound><tissue>liver </tissue><age></age><cell></cell><efo></efo><datatype></datatype><drug></drug><dose></dose><time></time><case></case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species><up>MMP3 SLA PMS1 S100A8 AKR1C2 ADD3 INVS INSIG2 KCTD12 TAF1A MBNL2 HERC2 COG6 IFNK SLC35A1 TBL2 SGMS1 CLHC1 EDEM3 GMCL1 ST6GALNAC4 MTMR1 RPUSD3 ATG4C HOXC6 GOLPH3L RAD50 GLCE WRAP73 NBR1 GJA9 RIMS3 DNAAF2 MRPL34 TRMT61B CETN3 HMG20A GPRIN3 DHRS3 METTL21A HEATR3 MMD FOCAD RHOT1 EMG1 CDC26 FRMD4B INTS8 KLHL8 ANKRD39 NKIRAS1 LIAS FARSA PREPL ZBTB48 VAV3 OXNAD1 METTL23 GPR84 QSER1 SLC16A6 NDFIP2 TUBGCP4 HEATR5A XPO1 ORC5 SLC38A9 COG5 SLC4A7 CRLS1 MCEE LMBRD2 ZMYND19 LARS2 NR2F6 CHCHD4 ACTR6 PTPN14 CDK19 SLC25A12 GMPR2 NUDCD2 ASB3 GDE1 MRPS26 DHRS7B FUT8 PAFAH2 ECE2 POLR3K NUP88 FAM98A BAG4 SATB1 GTF2H2 FASTKD1 PIK3R4 SPICE1 MTFR1 EML4 </up><down>HCAR3 CCNA1 GCH1 MARCKS TYMS C11orf96 APOBEC3B HS3ST2 XIRP1 DGKI ATP2B1 GSG1 SERPINE2 LIMS3 TUBB2A HMGCS1 C12orf75 FCGR1A FCGR1B HEG1 ITGA4 CDC42EP3 RAB27B FKBP5 FAM72A ARNT2 ASS1 PHACTR1 KLF4 ZC3H12D IL22RA2 CCNE2 FEM1C UHRF1 THAP2 GSTO2 CCNA2 PMAIP1 CYP51A1 FOSB BCAT1 CD109 NREP SLC7A5 C4orf46 B3GNT5 CPEB2 NCR3LG1 SCD MSX1 DTL RBM3 PIK3R3 TESC SLFN5 CREB5 TMEM64 USP53 TLE3 SFPQ PHLDA2 CIRBP CACNA2D4 SLC30A1 IL32 SHCBP1 OGFRL1 MAFF ATP1B1 FAS IRF1 LY86 IL6R DHCR7 TMEM217 SPIN4 SRGAP2 GLS C3orf80 ZNF367 OTUD1 SPSB1 RNASE3 GMEB1 KBTBD8 VWF RBM12 CTSO FOXF1 ARHGAP21 HES4 ACAT2 CDCA7 PSPC1 S100P NSMAF CTNNAL1 DPYSL2 MT1G MT1E </down>"""
 ```
-Step 5.
-```python
-generated_compounds = generate_compounds(prompt_config=diff2compound, tokenizer=tokenizer,
-                                         model=model, p3_compounds=p3_compounds, device=device, top_k=50)
-````

 license: cc-by-nc-4.0
 ---
+## Precious3-GPT
+A multi-omics multi-species language model.
+- **Developer**: [Insilico Medicine](https://insilico.com/precious)
+- **License**: cc-by-nc-4.0
+- **Model size**: 88.3 million parameters
+- **Domain**: Biomedical
+- **Base architecture**: [MPT](https://huggingface.co/mosaicml/mpt-7b)
+## Quickstart
+Precious-GPT can be loaded and run as standard Causal Language Model through transformers interface like this:
 ```python
 # Load model and tokenizer
 model = AutoModel.from_pretrained("insilicomedicine/precious3-gpt", trust_remote_code=True)
 ```
+However for the convenience of using all the functionality of the Precious3-GPT model, we provide a handler.
+### Run model using Prpecious3-GPT handler step by step
+**Step 1 - download Prpecious3-GPT [handler.py](https://huggingface.co/insilicomedicine/precious3-gpt/blob/main/handler.py)**
+```python
+from handler import EndpointHandler
+precious3gpt_handler = EndpointHandler()
+```
+**Step 2 - create input for the handler**
 ```python
+import json
+with open('./generation-configs/meta2diff.json', 'r') as f:
+    config_data = json.load(f)
+# prepare request configuration
+request_config = {"inputs": config_data, "mode": "meta2diff", "parameters": {
+    "temperature": 0.8,
+    "top_p": 0.2,
+    "top_k": 3550,
+    "n_next_tokens": 50,
+    "random_seed": 137
+}}
 ```
+**How Precisou3-GPT will see given request**
+```text
+[BOS]<age_group2diff2age_group><disease2diff2disease><compound2diff2compound><tissue>lung </tissue><age_individ></age_individ><cell></cell><efo>EFO_0000768 </efo><datatype>expression </datatype><drug>curcumin </drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type></dataset_type><gender>m </gender><species>human </species>
+```
+**Step 3 - run Precisou3-GPT**
 ```python
+output = precious3gpt_handler(request_config)
+```
+**Handler output structure**
+```json
+{
+    "output": {
+        "up": List,
+        "down": List
+    },
+    "mode": String, // Generation mode was selected
+    "message": "Done!",  // or Error
+    "input": String // Input prompt was passed
+}
 ```
+Note: If the ```mode``` was supposed to generate compounds, the output would contain ```compounds: List```.
+---
+## Precious3-GPT request configuration
+### Generation Modes (`mode` in config)
+Choose the appropriate mode based on your requirements:
+1. **meta2diff**: Generate signature (up- and down- gene lists) given meta-data such as tissue, compound, gender, etc.
+2. **diff2compound**: Predict compounds based on signature.
+3. **meta2diff2compound**: Generate signatures given meta-data and then predict compounds based on generated signatures.
+---
+### Instruction (`inputs.instruction` in config)
+1. disease2diff2disease - generate signature for disease / predict disease based on given signature
+2. compound2diff2compound - generate signature for compound / predict compound based on given signature
+3. age_group2diff2age_group - generate signature for age group / predict age group based on signature
+### Other meta-data (`inputs.` in config)
+1. Age (```age```) for human - in years, for macaque and mouse - in days
+2.
+Full list of available values for each meta-data item you can find in ```p3_entities_with_type.csv```
+## Examples
+In the following examples all possible configuration fields are specified. You can leave some meta-data fields in the ```inputs``` section empty string(```""```) or empty list(```[]```).
+_**Example 1**_
+If you want to generate a signature given specific meta-data you can use the following configuration. Note, ```up``` and ```down``` fields are empty lists as you want to generate them.
+Here we ask the model to generate a signature for a human within the age group of 70-90 years, male, in tissue - Lungs with disease EFO_0000768.
+```json
+{
+    "inputs": {
+        "instruction": ["age_group2diff2age_group", "disease2diff2disease", "compound2diff2compound"],
+        "tissue": ["lung"],
+        "age": "",
+        "cell": "",
+        "efo": "EFO_0000768",
+        "datatype": "", "drug": "", "dose": "", "time": "", "case": ["70.0-80.0", "80.0-90.0"], "control": "", "dataset_type": "expression", "gender": "m", "species": "human", "up": [], "down": []
+    },
+    "mode": "meta2diff",
+    "parameters": {
+        "temperature": 0.8, "top_p": 0.2, "top_k": 3550, "n_next_tokens": 50, "random_seed": 137
+    }
+}
 ```
+Here is output:
+```json
+{
+  "output": {
+    "up": [["PTGDR2", "CABYR", "MGAM", "TMED9", "SHOX2", "MAT1A", "MUC5AC", "GASK1B", "CYP1A2", "RP11-266K4.9", ...]], // generated list of up-regulated genes
+    "down": [["MB", "OR10V1", "OR51H1", "GOLGA6L10", "OR6M1", "CDX4", "OR4C45", "SPRR2A", "SPDYE9", "GBX2", "ATP4B", ...]] // generated list of down-regulated genes
+  },
+  "mode": "meta2diff", // generation mode we specified
+  "message": "Done!",
+  "input": "[BOS]<age_group2diff2age_group><disease2diff2disease><compound2diff2compound><tissue>lung </tissue><cell></cell><efo>EFO_0000768 </efo><datatype></datatype><drug></drug><dose></dose><time></time><case>70.0-80.0 80.0-90.0 </case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species>", // actual input prompt for the model
+  "random_seed": 137
+}
+```
+_**Example 2**_
+Now let's generate a signature for a healthy human within the age group of 70-90 years, male, in tissue - whole blood.
+Note, here we use ```disease2diff2disease``` instruction, but we expect to generate signatures for a healthy human, that's why we'd set ```efo``` to empty string "".
+Alternatively, for this example we can add one more instruction to example 2 - "instruction": ["disease2diff2disease", "age_group2diff2age_group"]
+```json
+{
+    "inputs": {
+        "instruction": ["disease2diff2disease", "age_group2diff2age_group"],
+        "tissue": ["whole blood"],
+        "age": "",
+        "cell": "",
+        "efo": "",
+        "datatype": "", "drug": "", "dose": "", "time": "", "case": "40.0-50.0", "control": "", "dataset_type": "expression", "gender": "m", "species": "human", "up": [],
+        "down": []
+    },
+    "mode": "meta2diff",
+    "parameters": {
+        "temperature": 0.8,
+        "top_p": 0.2,
+        "top_k": 3550,
+        "n_next_tokens": 50,
+        "random_seed": 137
+    }
+}
+```
+Here is output:
+```json
+{
+  "output": {
+    "up": [["IER3", "APOC2", "EDNRB", "JAKMIP2", "BACE2", ... ]],
+    "down": [["TBL1Y", "TDP1", "PLPP4", "CPEB1", "ITPR3", ... ]]
+  },
+  "mode": "meta2diff",
+  "message": "Done!",
+  "input": "[BOS]<disease2diff2disease><age_group2diff2age_group><tissue>whole blood </tissue><cell></cell><efo></efo><datatype></datatype><drug></drug><dose></dose><time></time><case>40.0-50.0 </case><control></control><dataset_type>expression </dataset_type><gender>m </gender><species>human </species>",
+  "random_seed": 137
+}
+```