Pre-trained T5-small model on PseudoMD-1M datasets.
PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. We provide five examples in Appendix A in the paper.
Pre-training details
Parameters | N |
---|---|
Corpus Size | 1,020,139 |
Training Steps | 100,000 |
Learning Rate | 1e-3 |
Batch Size | 128 |
Warm-up Steps | 1000 |
Weight decay | 0.1 |
Example Usage
from transformers import AutoTokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained("SCIR-HI/ada-t5-small")
tokenizer = AutoTokenizer.from_pretrained("SCIR-HI/ada-t5-small", model_max_length=512)
Citation
@article{chen2023artificially,
title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery},
author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing},
journal={arXiv preprint arXiv:2309.05203},
year={2023}
}
- Downloads last month
- 133
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.