File size: 7,808 Bytes
c1c6f0c dc08ea6 b5bf1a3 d4dd7c2 c1c6f0c 13999ec 4ff431e 13999ec 063909c 387119e 5666eac 21fccb5 def83ff 19de80d c9d1755 ce9cb80 9426280 a67b519 447cdf5 9426280 eced2cb 3d3aea9 def83ff eced2cb aa20085 eced2cb aa20085 eced2cb 5f1bcdb 3947e27 eced2cb 21fccb5 c69ec57 2c8dad6 b8b2ea1 10c10c4 2c8dad6 b8b2ea1 10c10c4 b8b2ea1 c69ec57 517eb35 c0b4e62 c0483ce 223e5bb 517eb35 9426280 21fccb5 12444b9 786ad9b 12444b9 095e3a7 2953ffb 095e3a7 21fccb5 1ff8511 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
license: apache-2.0
pipeline_tag: text-generation
tags:
- chemistry
language:
- en
- zh
---
# ChemLLM-7B-Chat: LLM for Chemistry and Molecule Science
> [!IMPORTANT]
> Better using New version of ChemLLM!
> [AI4Chem/ChemLLM-7B-Chat-1.5-DPO](https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1.5-DPO) or [AI4Chem/ChemLLM-7B-Chat-1.5-SFT](https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1.5-SFT)
ChemLLM-7B-Chat, The First Open-source Large Language Model for Chemistry and Molecule Science, Build based on InternLM-2 with ❤
[![Paper page](https://huggingface.co/datasets/huggingface/badges/resolve/main/paper-page-sm.svg)](https://huggingface.co/papers/2402.06852)
<center><img src='https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/wdFV6p3rTBCtskbeuVwNJ.png'></center>
## News
- ChemLLM-1.5 released! Two versions are available [AI4Chem/ChemLLM-7B-Chat-1.5-DPO](https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1.5-DPO) or [AI4Chem/ChemLLM-7B-Chat-1.5-SFT](https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1.5-SFT).[2024-4-2]
- ChemLLM-1.5 updated! Have a try on [Demo Site](https://chemllm.org/#/chat) or [API Reference](https://api.chemllm.org/docs).[2024-3-23]
- ChemLLM has been featured by HuggingFace on [“Daily Papers” page](https://huggingface.co/papers/2402.06852).[2024-2-13]
- ChemLLM arXiv preprint released.[ChemLLM: A Chemical Large Language Model](https://arxiv.org/abs/2402.06852)[2024-2-10]
- News report from [Shanghai AI Lab](https://mp.weixin.qq.com/s/u-i7lQxJzrytipek4a87fw)[2024-1-26]
- ChemLLM-7B-Chat ver 1.0 released. https://chemllm.org/ [2024-1-18]
- ChemLLM-7B-Chat ver 1.0 open-sourced.[2024-1-17]
- Chepybara ver 0.2 online Demo released. https://chemllm.org/ [2023-12-9]
## Usage
Try [online demo](https://chemllm.org/) instantly, or...
Install `transformers`,
```
pip install transformers
```
Load `ChemLLM-7B-Chat` and run,
```
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
model_name_or_id = "AI4Chem/ChemLLM-7B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id,trust_remote_code=True)
prompt = "What is Molecule of Ibuprofen?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
do_sample=True,
top_k=1,
temperature=0.9,
max_new_tokens=500,
repetition_penalty=1.5,
pad_token_id=tokenizer.eos_token_id
)
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## System Prompt Best Practice
You can use the same Dialogue Templates and System Prompt from [Agent Chepybara](https://chemllm.org/) to get a better response in local inference.
### Dialogue Templates
For queries in ShareGPT format like,
```
{'instruction':"...","prompt":"...","answer":"...","history":[[q1,a1],[q2,a2]]}
```
You can format it into this InternLM2 Dialogue format like,
```
def InternLM2_format(instruction,prompt,answer,history):
prefix_template=[
"<|im_start|>system\n",
"{}",
"<|im_end|>\n"
]
prompt_template=[
"<|im_start|>user\n",
"{}",
"<|im_end|>\n"
"<|im_start|>assistant\n",
"{}",
"<|im_end|>\n"
]
system = f'{prefix_template[0]}{prefix_template[1].format(instruction)}{prefix_template[2]}'
history = "".join([f'{prompt_template[0]}{prompt_template[1].format(qa[0])}{prompt_template[2]}{prompt_template[3]}{prompt_template[4].format(qa[1])}{prompt_template[5]}' for qa in history])
prompt = f'{prompt_template[0]}{prompt_template[1].format(prompt)}{prompt_template[2]}{prompt_template[3]}'
return f"{system}{history}{prompt}"
```
And there is a good example for system prompt,
```
- Chepybara is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be Professional, Sophisticated, and Chemical-centric.
- For uncertain notions and data, Chepybara always assumes it with theoretical prediction and notices users then.
- Chepybara can accept SMILES (Simplified Molecular Input Line Entry System) string, and prefer output IUPAC names (International Union of Pure and Applied Chemistry nomenclature of organic chemistry), depict reactions in SMARTS (SMILES arbitrary target specification) string. Self-Referencing Embedded Strings (SELFIES) are also accepted.
- Chepybara always solves problems and thinks in step-by-step fashion, Output begin with *Let's think step by step*."
```
## Results
### MMLU Highlights
| dataset | ChatGLM3-6B | Qwen-7B | LLaMA-2-7B | Mistral-7B | InternLM2-7B-Chat | ChemLLM-7B-Chat |
| ---------------------- | ----------- | ------- | ---------- | ---------- | ----------------- | ----------------- |
| college chemistry | 43.0 | 39.0 | 27.0 | 40.0 | 43.0 | 47.0 |
| college mathematics | 28.0 | 33.0 | 33.0 | 30.0 | 36.0 | 41.0 |
| college physics | 32.4 | 35.3 | 25.5 | 34.3 | 41.2 | 48.0 |
| formal logic | 35.7 | 43.7 | 24.6 | 40.5 | 34.9 | 47.6 |
| moral scenarios | 26.4 | 35.0 | 24.1 | 39.9 | 38.6 | 44.3 |
| humanities average | 62.7 | 62.5 | 51.7 | 64.5 | 66.5 | 68.6 |
| stem average | 46.5 | 45.8 | 39.0 | 47.8 | 52.2 | 52.6 |
| social science average | 68.2 | 65.8 | 55.5 | 68.1 | 69.7 | 71.9 |
| other average | 60.5 | 60.3 | 51.3 | 62.4 | 63.2 | 65.2 |
| mmlu | 58.0 | 57.1 | 48.2 | 59.2 | 61.7 | 63.2 |
*(OpenCompass)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/dvqKoPi0il6vrnGcSZp9p.png)
### Chemical Benchmark
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/qFl2h0fTXYTjQsDZXjSx8.png)
*(Score judged by ChatGPT-4-turbo)
### Professional Translation
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/kVDK3H8a0802HWYHtlHYP.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/ERbod2Elccw-k_6tEYZjO.png)
You can try it [online](chemllm.org).
## Cite this work
```
@misc{zhang2024chemllm,
title={ChemLLM: A Chemical Large Language Model},
author={Di Zhang and Wei Liu and Qian Tan and Jingdan Chen and Hang Yan and Yuliang Yan and Jiatong Li and Weiran Huang and Xiangyu Yue and Dongzhan Zhou and Shufei Zhang and Mao Su and Hansen Zhong and Yuqiang Li and Wanli Ouyang},
year={2024},
eprint={2402.06852},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
```
## Disclaimer
LLM may generate incorrect answers, Please pay attention to proofreading at your own risk.
## Open Source License
The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, or other questions and collaborations, please contact <support@chemllm.org>.
## Demo
[Agent Chepybara](https://chemllm.org/)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/vsA5MJVP7-XmBp6uFs3tV.png)
## Contact
(AI4Physics Sciecne, Shanghai AI Lab)[support@chemllm.org] |