This repo contains a fully fine-tuned LLaMA-7b, trained on the 𧬠protein-oriented instructions from the π§ͺ Mol-Instructions dataset.
Instructions for running it can be found at https://github.com/zjunlp/Mol-Instructions.
Please refer to our paper for more details.
𧬠Tasks
Protein design
- Design a protein sequence that exhibits the desired activity and specificity.
- The presence of Mg(2+) is necessary for the protein to function in the desired environment.
- The AMP, (6S)-NADPHX binding site should be located in a region of the protein that is accessible to the ligand.
- The designed protein should have ATP binding, NADPHX epimerase activity, metal ion binding, ADP-dependent NAD(P)H-hydrate dehydratase activity to facilitate nicotinamide nucleotide metabolic process.
- For general function, the protein need meet that Catalyzes the epimerization of the S- and R-forms of NAD(P)HX, a damaged form of NAD(P)H that is a result of enzymatic or heat-dependent hydration
MSNELVLSREQVRRVDQRAIEAYGVPGIVLMENAGRGAAEIIRAACPSAQRVLIACGPGNNGGDGFVIARHLANAGWMVELLLACPADRITGDAQGNHEIIRRMNLPCAVMADARDLEAANDRFATADVIVDALLGTGASGPPREPIASLIRAINEAHRRVSAQPAPSVFAVDIPSGLDCDTGEAANPTVRADHTITFVARKIGFRNPAARDLLGRVHVVDIGAPRAAIQDALTGKSG
Catalytic activity prediction
Please evaluate the following protein sequence and provide an explanation of the enzyme's catalytic activity, including the chemical reaction it facilitates: MDKVAVAGFLPEELCASLSLSPSFRGNQIFQWIGKGVDSFDAMTNLSAELRASLAEKAILRSTRVSDVLKADDGTVKLQIQTEDDLAVETVLLTDKAARKTACVSCQAGCAMGCAFCKTGTLGLARNLSAAEIVEQFLYLEKHAGALDNIVFMGMGEPLLNLDALRKAIAVLTDKRGRNLSSRRITVSTVGIVSGIYDLANNGPDVRLAVSLTTADETLRRELMPASLTNPLSDLRQAISYYIEKTGKRVTLEAVLLSGKNTSEKNADSLIAFAKGLDVHVNLIPWNPVEGLSFVTPDPEETAQFVSRLEKGGLNVTLRMHRGKSISGACGQLGKTNPYA
Based on the provided protein sequence, the enzyme appears to facilitate the chemical reaction: adenosine(37) in tRNA + 2 reduced [2Fe-2S]-[ferredoxin] + 2 S- adenosyl-L-methionine = 2-methyladenosine(37) in tRNA + 5'- deoxyadenosine + L-methionine + 2 oxidized [2Fe-2S]-[ferredoxin] + S- adenosyl-L-homocysteine.
Protein function prediction
Analyze the following amino acid sequence, and determine the function of the resulting protein, its subcellular localization, and any biological processes it may be part of: MNGTVNASAPSKMSEVAVERLSNDKALKVIFVLGGPGSGKGTQCAKIAKHFGFTHLSVGDLLRAEINSGSKNGTMIESMINEGKIVRSEVTIKLLQRAMHESGNDKFLIDGFPRNEENRAAFENLEKIEPEFVLFFDCPMEEMERRILNRNQGRDDDKMETIRKRFKVFIESTLPVIEFYNLKGKLYKIDACKPADEVFEDVKAIFSRFRAKEDSSQQTNICTAKRFELVMCLIKRLFREIKRMWSSFFCKAL
The protein characterized by the amino acid sequence demonstrates ATP binding, cytidylate kinase activity, uridylate kinase activity and is implicated in the 'de novo' pyrimidine nucleobase biosynthetic process, phosphorylation, pyrimidine nucleotide biosynthetic process. Its subcellular localization is primarily within the cytoplasm, nucleus.
Functional description generation
Examine the given protein sequence and share a brief overview of its attributes: MKIVLASNNQGKLAELKAMLAPLGVQLLRQAELGIPEAAEPFRTFVENALAKARHASALSGLPALADDAGLCVEAFGGLPGVDTAFYAVQFGYAKGDANNVKALLAQMAGITDRRAALVSTLVAVRSAEDPEPLIACGRVAGEVALEPMGSNGFGFDPVMFIPEFGQTFAQLPVEVKNANSHRGKATRQMMALMRERWIT
A concise description of the protein with the specified amino acid sequence includes: Pyrophosphatase that catalyzes the hydrolysis of nucleoside triphosphates to their monophosphate derivatives, with a high preference for the non-canonical purine nucleotides XTP (xanthosine triphosphate), dITP (deoxyinosine triphosphate) and ITP. Seems to function as a house-cleaning enzyme that removes non-canonical purine nucleotides from the nucleotide pool, thus preventing their incorporation into DNA/RNA and avoiding chromosomal lesions.
Domain/Motif prediction
Given this protein sequence, can you identify any common protein motifs or domains that it may contain? MANTKYIFITGGVVSSLGKGIAAASIGALLESRGLSVSLIKVDPYINVDPGTMSPFQHGEVFVTEDGTETDLDLGHYERFVRFKASKKNNFTAGKVYETVIRNERKGNYLGGTVQVIPHITNEIKKRIKKGGQNKDIAIVEVGGTVGDIESQPFVEALRQMALELPNSSWAFVHLTLVPFINASGELKTKPTQHSVKELRSLGISPDVLVCRSEQELPKDEKNKIALFCSVPAKSVISMHDVDTVYSIPILLNKQKVDDTILKKLNLKIKKPNLNDWKRVVKAKLLPEKEVNVSFVGKYTELKDSYKSINEALEHAGIQNKAKVNINFVEAEQITSQNVRKVLKKSDAILVPGGFGERGIEGMILACKYARENNVPYLGICLGMQIAIIEYARNVLKLKSANSTEFDSSTKFPVIGLITEWSDISGKKEKRTKNSDLGGTMRLGGQVCKLKKKSNSYKMYKKSEIIERHRHRYEVNPNYKDKMIEQGLDVVGTSIDGKLVEMIELPSHKWFLACQFHPEFTSNPRDGHPIFNSYIKSTITK
Our predictive analysis of the given protein sequence reveals possible domains or motifs. These include: Glutamine amidotransferase, CTP synthase N-terminal domains.
π Demo
As illustrated in our repository, we provide an example to perform generation.
For model fine-tuned on protein-oriented instructions, you can conveniently recover the model weights we trained through the following command.
Please download llama-7b-hf to obtain the pre-training weights of LLaMA-7B, refine the --base_model
to point towards the location where the model weights are saved.
Then replace $DIFF_WEIGHT_PATH
with the path of our provided diff weights, and replace $RECOVER_WEIGHT_PATH
with the desired path to save the recovered weights. If the directory of recovered weights lacks required files (e.g., tokenizer configuration files), you can copy from $DIFF_WEIGHT_PATH
.
python weight_diff.py recover \
--path_raw $BASE_MODEL_PATH \
--path_diff $DIFF_WEIGHT_PATH \
--path_tuned $RECOVER_WEIGHT_PATH
After that, you can execute the following command to generate outputs with the fine-tuned LLaMA model.
>> python generate.py \
--CLI True \
--protein True \
--base_model $RECOVER_WEIGHT_PATH \
π¨ Limitations
The current state of the model, obtained via instruction tuning, is a preliminary demonstration. Its capacity to handle real-world, production-grade tasks remains limited.
π References
If you use our repository, please cite the following related paper:@inproceedings{fang2023mol,
author = {Yin Fang and
Xiaozhuan Liang and
Ningyu Zhang and
Kangwei Liu and
Rui Huang and
Zhuo Chen and
Xiaohui Fan and
Huajun Chen},
title = {Mol-Instructions: {A} Large-Scale Biomolecular Instruction Dataset
for Large Language Models},
booktitle = {{ICLR}},
publisher = {OpenReview.net},
year = {2024},
url = {https://openreview.net/pdf?id=Tlsdsb6l9n}
}
π«±π»βπ«² Acknowledgements
We appreciate LLaMA, Huggingface Transformers Llama, Alpaca, Alpaca-LoRA, Chatbot Service and many other related works for their open-source contributions.
- Downloads last month
- 103