|
--- |
|
license: apache-2.0 |
|
pipeline_tag: text-generation |
|
tags: |
|
- biology |
|
--- |
|
|
|
# InstrcutPLM |
|
InstructPLM is a state-of-the-art protein design model based on [ProGen2](https://www.cell.com/cell-systems/abstract/S2405-4712(23)00272-7) |
|
and [ProteinMPNN](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9997061/) |
|
and trained on [CATH 4.2](https://www.cathdb.info/) dataset. |
|
It can design protein sequences that accurately conform to specified backbone structures. |
|
|
|
<p align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/62a8397d839eeb3ef16a7566/1NRk65EImgBAFgvh8HJrA.png" alt="drawing" width="200"/> |
|
</p> |
|
|
|
Please visit our [repo](https://github.com/Eikor/InstructPLM) and [paper](https://biorxiv.org/cgi/content/short/2024.04.17.589642v1) for more information. |
|
|
|
``` |
|
@article {Qiu2024.04.17.589642, |
|
author = {Jiezhong Qiu and Junde Xu and Jie Hu and Hanqun Cao and Liya Hou and Zijun Gao and Xinyi Zhou and Anni Li and Xiujuan Li and Bin Cui and Fei Yang and Shuang Peng and Ning Sun and Fangyu Wang and Aimin Pan and Jie Tang and Jieping Ye and Junyang Lin and Jin Tang and Xingxu Huang and Pheng Ann Heng and Guangyong Chen}, |
|
title = {InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions}, |
|
elocation-id = {2024.04.17.589642}, |
|
year = {2024}, |
|
doi = {10.1101/2024.04.17.589642}, |
|
publisher = {Cold Spring Harbor Laboratory}, |
|
abstract = {Large language models are renowned for their efficacy in capturing intricate patterns, including co-evolutionary relationships, and underlying protein languages. However, current methodologies often fall short in illustrating the emergence of genomic insertions, duplications, and insertion/deletions (indels), which account for approximately 14\% of human pathogenic mutations. Given that structure dictates function, mutated proteins with similar structures are more likely to persist throughout biological evolution. Motivated by this, we leverage cross-modality alignment and instruct fine-tuning techniques inspired by large language models to align a generative protein language model with protein structure instructions. Specifically, we present a method for generating variable-length and diverse proteins to explore and simulate the complex evolution of life, thereby expanding the repertoire of options for protein engineering. Our proposed protein LM-based approach, InstructPLM, demonstrates significant performance enhancements both in silico and in vitro. On native protein backbones, it achieves a perplexity of 2.68 and a sequence recovery rate of 57.51, surpassing ProteinMPNN by 39.2\% and 25.1\%, respectively. Furthermore, we validate the efficacy of our model by redesigning PETase and L-MDH. For PETase, all fifteen designed variable-length PETase exhibit depolymerization activity, with eleven surpassing the activity levels of the wild type. Regarding L-MDH, an enzyme lacking an experimentally determined structure, InstructPLM is able to design functional enzymes with an AF2-predicted structure. Code and model weights of InstructPLM are publicly available.Competing Interest StatementThe authors have declared no competing interest.}, |
|
URL = {https://www.biorxiv.org/content/early/2024/04/20/2024.04.17.589642}, |
|
eprint = {https://www.biorxiv.org/content/early/2024/04/20/2024.04.17.589642.full.pdf}, |
|
journal = {bioRxiv} |
|
} |
|
``` |
|
|
|
|