File size: 6,177 Bytes
b97ce95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
tags:
- chemistry
- medical
widget:
- text: <LIGAND>
  example_title: Generate molecule
---
# BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

![alt text](./images/scheme.svg "Main image")

> BindGPT is a new framework for building drug discovery models that leverages compute-efficient pretraining, supervised funetuning, prompting, reinforcement learning, and tool use of LMs. This allows BindGPT to build a single pre-trained model that exhibits state-of-the-art performance in 3D Molecule Generation, 3D Conformer Generation, Pocket-Conditioned 3D Molecule Generation, posing them as downstream tasks for a pretrained model, while previous methods build task-specialized models without task transfer abilities. At the same time, thanks to the fast transformer inference technology, BindGPT is 2 orders of magnitude (100 times) faster than previous methods at generation.
- **website:** https://bindgpt.github.io
- **Repository:** https://github.com/insilicomedicine/bindgpt
- **Paper:** https://arxiv.org/abs/2406.03686


**This page provides the version of BindGPT funetuned on GEOM-DRUGS dataset.** 
The model was pretrained on the [Uni-Mol](https://github.com/deepmodeling/Uni-Mol) dataset and finetuned on GEOM-DRUGS. The finetuned model is capable of zero-shot molecule generation and conformer generation within
the distribution of the GEOM-DRUGS datasets.
We also expose pretrained and finetuned models:

- For the pretrained model, visit [huggingface.co/insilicomedicine/BindGPT](https://huggingface.co/insilicomedicine/BindGPT)
- For the model finetuned with Reinforcement Learning on CrossDocked, visit [huggingface.co/insilicomedicine/BindGPT-RL](https://huggingface.co/insilicomedicine/BindGPT-RL)


## Unconditional generation

The code below provides a minimal standalone example of 
sampling molecules from the model. It only depends on 
`transformers`, `tokenizers`, `rdkit`, and `pytorch`
and it's not meant to reproduce the sampling speed reported
in the paper (e.g. it does not use flash-attention, mixed precision, 
and large batch sampling). 
To reproduce sampling speed, please use the code from our repository:
https://github.com/insilicomedicine/bindgpt

```python
# Download model from Hugginface:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("artemZholus/BindGPT")
model = AutoModelForCausalLM.from_pretrained("artemZholus/BindGPT").cuda()

# Generate 10 tokenized molecules without condition
NUM_SAMPLES = 10

start_tokens = tokenizer("<LIGAND>", return_tensors="pt")
outputs = model.generate(
    # remove EOS token to continue generation
    input_ids=start_tokens['input_ids'][:, :-1].cuda(),
    attention_mask=start_tokens['attention_mask'][:, :-1].cuda(),
    do_sample=True, max_length=400, num_return_sequences=NUM_SAMPLES
)


# parse results
import re
from rdkit import Chem
def parse_molecule(s):
    try:
        assert '<LIGAND>' in s and '<XYZ>' in s
        _, smiles, xyz = re.split(r'<LIGAND>|<XYZ>', s)
        smiles = re.sub(r'\s', '', smiles)
        conf = Chem.Conformer()
        mol = Chem.MolFromSmiles(smiles)
        assert mol is not None
        coords = list(map(float, xyz.split(' ')[2:]))
        assert len(coords) == (3 * mol.GetNumAtoms())
        for j in range(mol.GetNumAtoms()):
            conf.SetAtomPosition(j, [coords[3*j],coords[3*j+1],coords[3*j+2]])
        mol.AddConformer(conf)
        return mol
    except AssertionError:
        return None

string_molecules = tokenizer.batch_decode(outputs, skip_special_tokens=True)
molecules = [parse_molecule(mol) for mol in string_molecules]
```

## Conformer generation

The code below provides a minimal standalone example of 
sampling conformers given molecule from the model. It only depends on 
`transformers`, `tokenizers`, `rdkit`, and `pytorch`
and it's not meant to reproduce the sampling speed reported
in the paper (e.g. it does not use flash-attention, mixed precision, 
and large batch sampling). 
To reproduce sampling speed, please use the code from our repository:
https://github.com/insilicomedicine/bindgpt

```python
smiles = [
    'O=c1n(CCO)c2ccccc2n1CCO',
    'Cc1ccc(C#N)cc1S(=O)(=O)NCc1ccnc(OC(C)(C)C)c1',
    'COC(=O)Cc1csc(NC(=O)Cc2coc3cc(C)ccc23)n1',
]

# tell the tokenizer to right-align sequences
tokenizer.padding_side = 'left'
# Do not forget to add the <XYZ> token 
# after the smiles, otherwise the model might 
# want to continue generating the molecule :)
prompts = tokenizer(
    ["<LIGAND>" + s + '<XYZ>' for s in smiles], return_tensors="pt",
    truncation=True, padding=True,
)

# Generate 1 conformer per molecule
outputs = model.generate(
    # remove EOS token to continue generation
    input_ids=prompts['input_ids'][:, :-1].cuda(),
    attention_mask=prompts['attention_mask'][:, :-1].cuda(),
    do_sample=True, max_length=400, 
    # you can combine this type of conditional generation
    # with multi-sample generation.
    # to sample many conformers per molecule, uncomment this
    # num_return_sequences=10
)

# parse results
string_molecules = tokenizer.batch_decode(outputs, skip_special_tokens=True)
molecules = [parse_molecule(mol) for mol in string_molecules]
```

## Usage and License

Please note that all model weights are exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY 4.0, which permits solely non-commercial usage.
We emphatically urge all users to adhere to the highest ethical standards when using our models, including maintaining fairness, transparency, and responsibility in their research. Any usage that may lead to harm or pose a detriment to society is strictly forbidden.


## References
If you use our repository, please cite the following related paper:

```
@article{zholus2024bindgpt,
  author    = {Artem Zholus and Maksim Kuznetsov and Roman Schutski and Rim Shayakhmetov and  Daniil Polykovskiy and Sarath Chandar and Alex Zhavoronkov},
  title     = {BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning},
  journal   = {arXiv},
  year      = {2024},
}
```