WarmMolGenTwo
A target-specific molecule generator model which is warm started (i.e. initialized) from pretrained biochemical language models and trained on interacting protein-compound pairs, viewing targeted molecular generation as a translation task between protein and molecular languages. It was introduced in the paper, "Exploiting pretrained biochemical language models for targeted drug design", which has been accepted for publication in Bioinformatics Published by Oxford University Press and first released in this repository.
WarmMolGenTwo is a Transformer-based encoder-decoder model initialized with Protein RoBERTa and ChemBERTaLM checkpoints, and then, trained on interacting protein-compound pairs filtered from BindingDB. The model takes a protein sequence as an input and outputs a SMILES sequence.
How to use
from transformers import EncoderDecoderModel, RobertaTokenizer, pipeline
protein_tokenizer = RobertaTokenizer.from_pretrained("gokceuludogan/WarmMolGenTwo")
mol_tokenizer = RobertaTokenizer.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")
model = EncoderDecoderModel.from_pretrained("gokceuludogan/WarmMolGenTwo")
inputs = protein_tokenizer("MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTG", >>> return_tensors="pt")
outputs = model.generate(**inputs, decoder_start_token_id=mol_tokenizer.bos_token_id,
eos_token_id=mol_tokenizer.eos_token_id, pad_token_id=mol_tokenizer.eos_token_id,
max_length=128, num_return_sequences=5, do_sample=True, top_p=0.95)
mol_tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Sample output
['CCOC(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)NCCC[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)Cc1ccc(O)cc1)C(C)C',
'CCC(C)[C@H](NC(=O)Cn1nc(-c2cccc3ccccc23)c2cnccc2c1=O)C(O)=O',
'CC(C)[C@H](NC(=O)[C@H](CC(O)=O)NC(=O)[C@@H]1C[C@H]1c1ccccc1)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)OC(C)(C)C',
'CC[C@@H](C)[C@H](NC(=O)\\C=C\\C(C)\\C=C/C=C(/C)\\C=C(/C)\\C)C(=O)N[C@@H](CC(O)=O)C(=O)N[C@@H](CC(O)=O)C(=O)N[C@@H](Cc1cc(O)c(O)c(O)c1)C(O)=O',
'CN1C[C@H](Cn2cnc3cc(O)ccc23)Oc2ccc(cc12)C(F)(F)F']
Citation
@article{10.1093/bioinformatics/btac482,
author = {Uludoğan, Gökçe and Ozkirimli, Elif and Ulgen, Kutlu O. and Karalı, Nilgün Lütfiye and Özgür, Arzucan},
title = "{Exploiting Pretrained Biochemical Language Models for Targeted Drug Design}",
journal = {Bioinformatics},
year = {2022},
doi = {10.1093/bioinformatics/btac482},
url = {https://doi.org/10.1093/bioinformatics/btac482}
}
- Downloads last month
- 128