File size: 2,097 Bytes
c4b7645
 
dabbad6
 
 
 
 
f6b65f4
 
 
 
 
 
 
 
c4b7645
 
42d8cc2
 
c389fa0
 
c4b7645
42d8cc2
 
 
 
 
 
c4b7645
42d8cc2
 
c4b7645
42d8cc2
 
c4b7645
42d8cc2
 
c4b7645
42d8cc2
 
 
 
c389fa0
bd6b1d9
 
c389fa0
bd6b1d9
c389fa0
689905d
bd6b1d9
 
 
 
 
 
689905d
c389fa0
bd6b1d9
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
library_name: transformers
tags:
- chemistry
- biology
- cheminformatics
- materials science
license: mit
language:
- en
metrics:
- mse
- r_squared
base_model:
- seyonec/ChemBERTa-zinc-base-v1
---

# ChemSolubilityBERTa
## Model Description
ChemSolubilityBERTa is a prototype designed to predict the aqueous solubility of chemical compounds from their SMILES representations. Based on ChemBERTa, a BERT-like transformer-based architecture, ChemBERTa pre-trained on 77M SMILES strings for molecular property prediction. We adapted ChemBERTa to predict solubility values by fine-tuning ChemBERTa with the ESOL (Estimated SOLubility) dataset, a water solubility prediction dataset of 1,128 samples. A user inputs a SMILES string, and the model outputs a log solubility value (log mol/L).
You can read the full paper [here](./01_ChemSolubilityBERTa.pdf).

## Fine-Tuning Details
- Pretrained model: `seyonec/ChemBERTa-zinc-base-v1`
- Dataset: ESOL (delaney-processed)
- Task: Aqueous solubility prediction (log mol/L)
- Number of training epochs: 3
- Batch size: 16

## How to Use
You can use the model to predict solubility for any molecule represented by a SMILES string:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("username/ChemSolubilityBERTa")
model = AutoModelForSequenceClassification.from_pretrained("username/ChemSolubilityBERTa")

smiles_string = "CCO"  # Example for ethanol
inputs = tokenizer(smiles_string, return_tensors='pt')
outputs = model(**inputs)
solubility = outputs.logits.item()
print(f"Predicted solubility: {solubility}")
```
## Citation and Usage

If you use ChemSolubilityBERTa in your research or projects, please cite the following:

```bibtex
@misc{ChemSolubilityBERTa,
  author = {Farooq Khan},
  title = {ChemSolubilityBERTa: A Transformer-Based Model for Predicting Aqueous Solubility from SMILES},
  year = {2024},
  url = {https://huggingface.co/khanfs/ChemSolubilityBERTa}
}
```

## License
This model is licensed under the [MIT License](https://opensource.org/licenses/MIT).