SCIR-HI/ada-t5-base · Hugging Face

Pre-trained T5-base model on PseudoMD-1M datasets.

PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. We provide five examples in Appendix A in the paper.

Pre-training details

Parameters	N
Corpus Size	1,020,139
Training Steps	100,000
Learning Rate	1e-3
Batch Size	128
Warm-up Steps	1000
Weight decay	0.1

Example Usage

from transformers import AutoTokenizer, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("SCIR-HI/ada-t5-base")
tokenizer = AutoTokenizer.from_pretrained("SCIR-HI/ada-t5-base", model_max_length=512)

Citation

@article{chen2023artificially,
  title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery},
  author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing},
  journal={arXiv preprint arXiv:2309.05203},
  year={2023}
}