bindgpt_finetuned / README.md
kuznetsov-insilico's picture
readme changes
bf7503b
metadata
tags:
  - chemistry
widget:
  - text: <LIGAND>
    example_title: Generate molecule

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

alt text

BindGPT is a new framework for building drug discovery models that leverages compute-efficient pretraining, supervised funetuning, prompting, reinforcement learning, and tool use of LMs. This allows BindGPT to build a single pre-trained model that exhibits state-of-the-art performance in 3D Molecule Generation, 3D Conformer Generation, Pocket-Conditioned 3D Molecule Generation, posing them as downstream tasks for a pretrained model, while previous methods build task-specialized models without task transfer abilities. At the same time, thanks to the fast transformer inference technology, BindGPT is 2 orders of magnitude (100 times) faster than previous methods at generation.

This page provides the version of BindGPT funetuned on GEOM-DRUGS dataset. The model was pretrained on the Uni-Mol dataset and finetuned on GEOM-DRUGS. The finetuned model is capable of zero-shot molecule generation and conformer generation within the distribution of the GEOM-DRUGS datasets. We also expose pretrained and finetuned models:

Unconditional generation

The code below provides a minimal standalone example of sampling molecules from the model. It only depends on transformers, tokenizers, rdkit, and pytorch and it's not meant to reproduce the sampling speed reported in the paper (e.g. it does not use flash-attention, mixed precision, and large batch sampling). To reproduce sampling speed, please use the code from our repository: (code coming soon)

# Download model from Hugginface:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("insilicomedicine/bindgpt_finetuned")
model = AutoModelForCausalLM.from_pretrained("insilicomedicine/bindgpt_finetuned").cuda()

# Generate 10 tokenized molecules without condition
NUM_SAMPLES = 10

start_tokens = tokenizer("<LIGAND>", return_tensors="pt")
outputs = model.generate(
    # remove EOS token to continue generation
    input_ids=start_tokens['input_ids'][:, :-1].cuda(),
    attention_mask=start_tokens['attention_mask'][:, :-1].cuda(),
    do_sample=True, max_length=400, num_return_sequences=NUM_SAMPLES
)


# parse results
import re
from rdkit import Chem
def parse_molecule(s):
    try:
        assert '<LIGAND>' in s and '<XYZ>' in s
        _, smiles, xyz = re.split(r'<LIGAND>|<XYZ>', s)
        smiles = re.sub(r'\s', '', smiles)
        conf = Chem.Conformer()
        mol = Chem.MolFromSmiles(smiles)
        assert mol is not None
        coords = list(map(float, xyz.split(' ')[2:]))
        assert len(coords) == (3 * mol.GetNumAtoms())
        for j in range(mol.GetNumAtoms()):
            conf.SetAtomPosition(j, [coords[3*j],coords[3*j+1],coords[3*j+2]])
        mol.AddConformer(conf)
        return mol
    except AssertionError:
        return None

string_molecules = tokenizer.batch_decode(outputs, skip_special_tokens=True)
molecules = [parse_molecule(mol) for mol in string_molecules]

Conformer generation

The code below provides a minimal standalone example of sampling conformers given molecule from the model. It only depends on transformers, tokenizers, rdkit, and pytorch and it's not meant to reproduce the sampling speed reported in the paper (e.g. it does not use flash-attention, mixed precision, and large batch sampling). To reproduce sampling speed, please use the code from our repository: https://github.com/insilicomedicine/bindgpt

smiles = [
    'O=c1n(CCO)c2ccccc2n1CCO',
    'Cc1ccc(C#N)cc1S(=O)(=O)NCc1ccnc(OC(C)(C)C)c1',
    'COC(=O)Cc1csc(NC(=O)Cc2coc3cc(C)ccc23)n1',
]

# tell the tokenizer to right-align sequences
tokenizer.padding_side = 'left'
# Do not forget to add the <XYZ> token 
# after the smiles, otherwise the model might 
# want to continue generating the molecule :)
prompts = tokenizer(
    ["<LIGAND>" + s + '<XYZ>' for s in smiles], return_tensors="pt",
    truncation=True, padding=True,
)

# Generate 1 conformer per molecule
outputs = model.generate(
    # remove EOS token to continue generation
    input_ids=prompts['input_ids'][:, :-1].cuda(),
    attention_mask=prompts['attention_mask'][:, :-1].cuda(),
    do_sample=True, max_length=400, 
    # you can combine this type of conditional generation
    # with multi-sample generation.
    # to sample many conformers per molecule, uncomment this
    # num_return_sequences=10
)

# parse results
string_molecules = tokenizer.batch_decode(outputs, skip_special_tokens=True)
molecules = [parse_molecule(mol) for mol in string_molecules]

Usage and License

Please note that all model weights are exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY 4.0, which permits solely non-commercial usage. We emphatically urge all users to adhere to the highest ethical standards when using our models, including maintaining fairness, transparency, and responsibility in their research. Any usage that may lead to harm or pose a detriment to society is strictly forbidden.

References

If you use our repository, please cite the following related paper:

@article{zholus2024bindgpt,
  author    = {Artem Zholus and Maksim Kuznetsov and Roman Schutski and Rim Shayakhmetov and  Daniil Polykovskiy and Sarath Chandar and Alex Zhavoronkov},
  title     = {BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning},
  journal   = {arXiv},
  year      = {2024},
}