|
--- |
|
library_name: transformers |
|
tags: |
|
- chemistry |
|
- biology |
|
- cheminformatics |
|
- materials science |
|
license: mit |
|
language: |
|
- en |
|
metrics: |
|
- mse |
|
- r_squared |
|
base_model: |
|
- seyonec/ChemBERTa-zinc-base-v1 |
|
--- |
|
|
|
# ChemSolubilityBERTa |
|
## Model Description |
|
ChemSolubilityBERTa is a prototype designed to predict the aqueous solubility of chemical compounds from their SMILES representations. Based on ChemBERTa, a BERT-like transformer-based architecture, ChemBERTa pre-trained on 77M SMILES strings for molecular property prediction. We adapted ChemBERTa to predict solubility values by fine-tuning ChemBERTa with the ESOL (Estimated SOLubility) dataset, a water solubility prediction dataset of 1,128 samples. A user inputs a SMILES string, and the model outputs a log solubility value (log mol/L). |
|
You can read the full paper [here](./01_ChemSolubilityBERTa.pdf). |
|
|
|
## Fine-Tuning Details |
|
- Pretrained model: `seyonec/ChemBERTa-zinc-base-v1` |
|
- Dataset: ESOL (delaney-processed) |
|
- Task: Aqueous solubility prediction (log mol/L) |
|
- Number of training epochs: 3 |
|
- Batch size: 16 |
|
|
|
## How to Use |
|
You can use the model to predict solubility for any molecule represented by a SMILES string: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("username/ChemSolubilityBERTa") |
|
model = AutoModelForSequenceClassification.from_pretrained("username/ChemSolubilityBERTa") |
|
|
|
smiles_string = "CCO" # Example for ethanol |
|
inputs = tokenizer(smiles_string, return_tensors='pt') |
|
outputs = model(**inputs) |
|
solubility = outputs.logits.item() |
|
print(f"Predicted solubility: {solubility}") |
|
``` |
|
## Citation and Usage |
|
|
|
If you use ChemSolubilityBERTa in your research or projects, please cite the following: |
|
|
|
```bibtex |
|
@misc{ChemSolubilityBERTa, |
|
author = {Farooq Khan}, |
|
title = {ChemSolubilityBERTa: A Transformer-Based Model for Predicting Aqueous Solubility from SMILES}, |
|
year = {2024}, |
|
url = {https://huggingface.co/khanfs/ChemSolubilityBERTa} |
|
} |
|
``` |
|
|
|
## License |
|
This model is licensed under the [MIT License](https://opensource.org/licenses/MIT). |