xp1992slz commited on
Commit
7a2f15e
1 Parent(s): 47dfacd

update text

Browse files
Files changed (1) hide show
  1. README.md +142 -142
README.md CHANGED
@@ -1,142 +1,142 @@
1
- ---
2
- license: llama3
3
- language:
4
- - en
5
- pipeline_tag: text-generation
6
- tags:
7
- - nvidia
8
- - chatqa-2
9
- - chatqa
10
- - llama-3
11
- - pytorch
12
- ---
13
-
14
-
15
- ## Model Details
16
- We introduce Llama3-ChatQA-2, which bridges the gap between open-source LLMs and leading proprietary models (e.g., GPT-4-Turbo) in long-context understanding and retrieval-augmented generation (RAG) capabilities. Llama3-ChatQA-2 is developed using an improved training recipe from [ChatQA-1.5 paper](https://arxiv.org/pdf/2401.10225), and it is built on top of [Llama-3 base model](https://huggingface.co/meta-llama/Meta-Llama-3-70B). Specifically, we continued training of Llama-3 base models to extend the context window from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model’s instruction-following, RAG performance, and long-context understanding capabilities. Llama3-ChatQA-2 has two variants: Llama3-ChatQA-2-8B and Llama3-ChatQA-2-70B. Both models were originally trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), we converted the checkpoints to Hugging Face format. **For more information about ChatQA 2, check the [website](https://chatqa2-project.github.io/)!**
17
-
18
- ## Other Resources
19
- [Llama3-ChatQA-2-70B](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B)   [Evaluation Data](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B/tree/main/data)   [Training Data](https://huggingface.co/datasets/nvidia/ChatQA2-Long-SFT-data)   [Website](https://chatqa2-project.github.io/)   [Paper](https://arxiv.org/abs/2407.14482)
20
-
21
- ## Overview of Benchmark Results
22
- <!-- Results in [ChatRAG Bench](https://huggingface.co/datasets/nvidia/ChatRAG-Bench) are as follows: -->
23
- We evaluate ChatQA 2 on short-context RAG benchmark (ChatRAG) (within 4K tokens), long context tasks from SCROLLS and LongBench (within 32K tokens), and ultra-long context tasks from In- finiteBench (beyond 100K tokens). Results are shown below.
24
-
25
-
26
- ![Example Image](overview.png)
27
- <!-- | | ChatQA-2-70B | GPT-4-Turbo-2024-04-09 | Qwen2-72B-Instruct | Llama3.1-70B-Instruct |
28
- | -- |:--:|:--:|:--:|:--:|
29
- | Ultra-long (4k) | 41.04 | 33.16 | 39.77 | 39.81 |
30
- | Long (32k) | 48.15 | 51.93 | 49.94 | 49.92 |
31
- | Short (4k) | 56.30 | 54.72 | 54.06 | 52.12 | -->
32
-
33
- Note that ChatQA-2 is built based on Llama-3 base model.
34
-
35
-
36
- ## Prompt Format
37
- **We highly recommend that you use the prompt format we provide, as follows:**
38
- ### when context is available
39
- <pre>
40
- System: {System}
41
-
42
- {Context}
43
-
44
- User: {Question}
45
-
46
- Assistant: {Response}
47
-
48
- User: {Question}
49
-
50
- Assistant:
51
- </pre>
52
-
53
- ### when context is not available
54
- <pre>
55
- System: {System}
56
-
57
- User: {Question}
58
-
59
- Assistant: {Response}
60
-
61
- User: {Question}
62
-
63
- Assistant:
64
- </pre>
65
- **The content of the system's turn (i.e., {System}) for both scenarios is as follows:**
66
- <pre>
67
- This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.
68
- </pre>
69
- **Note that our ChatQA-2 models are optimized for the capability with context, e.g., over documents or retrieved context.**
70
-
71
- ## How to use
72
-
73
- ### take the whole document as context
74
- This can be applied to the scenario where the whole document can be fitted into the model, so that there is no need to run retrieval over the document.
75
- ```python
76
- from transformers import AutoTokenizer, AutoModelForCausalLM
77
- import torch
78
-
79
- model_id = "nvidia/Llama3-ChatQA-2-8B"
80
-
81
- tokenizer = AutoTokenizer.from_pretrained(model_id)
82
- model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
83
-
84
- messages = [
85
- {"role": "user", "content": "what is the percentage change of the net income from Q4 FY23 to Q4 FY24?"}
86
- ]
87
-
88
- document = """NVIDIA (NASDAQ: NVDA) today reported revenue for the fourth quarter ended January 28, 2024, of $22.1 billion, up 22% from the previous quarter and up 265% from a year ago.\nFor the quarter, GAAP earnings per diluted share was $4.93, up 33% from the previous quarter and up 765% from a year ago. Non-GAAP earnings per diluted share was $5.16, up 28% from the previous quarter and up 486% from a year ago.\nQ4 Fiscal 2024 Summary\nGAAP\n| $ in millions, except earnings per share | Q4 FY24 | Q3 FY24 | Q4 FY23 | Q/Q | Y/Y |\n| Revenue | $22,103 | $18,120 | $6,051 | Up 22% | Up 265% |\n| Gross margin | 76.0% | 74.0% | 63.3% | Up 2.0 pts | Up 12.7 pts |\n| Operating expenses | $3,176 | $2,983 | $2,576 | Up 6% | Up 23% |\n| Operating income | $13,615 | $10,417 | $1,257 | Up 31% | Up 983% |\n| Net income | $12,285 | $9,243 | $1,414 | Up 33% | Up 769% |\n| Diluted earnings per share | $4.93 | $3.71 | $0.57 | Up 33% | Up 765% |"""
89
-
90
- def get_formatted_input(messages, context):
91
- system = "System: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."
92
- instruction = "Please give a full and complete answer for the question."
93
-
94
- for item in messages:
95
- if item['role'] == "user":
96
- ## only apply this instruction for the first user turn
97
- item['content'] = instruction + " " + item['content']
98
- break
99
-
100
- conversation = '\n\n'.join(["User: " + item["content"] if item["role"] == "user" else "Assistant: " + item["content"] for item in messages]) + "\n\nAssistant:"
101
- formatted_input = system + "\n\n" + context + "\n\n" + conversation
102
-
103
- return formatted_input
104
-
105
- formatted_input = get_formatted_input(messages, document)
106
- tokenized_prompt = tokenizer(tokenizer.bos_token + formatted_input, return_tensors="pt").to(model.device)
107
-
108
- terminators = [
109
- tokenizer.eos_token_id,
110
- tokenizer.convert_tokens_to_ids("<|eot_id|>")
111
- ]
112
-
113
- outputs = model.generate(input_ids=tokenized_prompt.input_ids, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=128, eos_token_id=terminators)
114
-
115
- response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
116
- print(tokenizer.decode(response, skip_special_tokens=True))
117
- ```
118
-
119
- ## Command to run generation
120
- ```
121
- python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset ${dataset_name} --start-idx 0 --end-idx ${num_samples} --max-tokens ${max_tokens} --sample-input-file ${dataset_path}
122
- ```
123
-
124
- see all_command.sh for all detailed configuration.
125
-
126
- ## Correspondence to
127
- Peng Xu (pengx@nvidia.com), Wei Ping (wping@nvidia.com)
128
-
129
- ## Citation
130
- <pre>
131
- @article{xu2024chatqa,
132
- title={ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities},
133
- author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan},
134
- journal={arXiv preprint arXiv:2407.14482},
135
- year={2024}
136
- }
137
- </pre>
138
-
139
-
140
- ## License
141
- The Model is released under Non-Commercial License and the use of this model is also governed by the [META LLAMA 3 COMMUNITY LICENSE AGREEMENT](https://llama.meta.com/llama3/license/)
142
-
 
1
+ ---
2
+ license: llama3
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - nvidia
8
+ - chatqa-2
9
+ - chatqa
10
+ - llama-3
11
+ - pytorch
12
+ ---
13
+
14
+
15
+ ## Model Details
16
+ We introduce Llama3-ChatQA-2, a suite of 128K long-context models, which bridges the gap between open-source LLMs and leading proprietary models (e.g., GPT-4-Turbo) in long-context understanding and retrieval-augmented generation (RAG) capabilities. Llama3-ChatQA-2 is developed using an improved training recipe from [ChatQA-1.5 paper](https://arxiv.org/pdf/2401.10225), and it is built on top of [Llama-3 base model](https://huggingface.co/meta-llama/Meta-Llama-3-70B). Specifically, we continued training of Llama-3 base models to extend the context window from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model’s instruction-following, RAG performance, and long-context understanding capabilities. Llama3-ChatQA-2 has two variants: Llama3-ChatQA-2-8B and Llama3-ChatQA-2-70B. Both models were originally trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), we converted the checkpoints to Hugging Face format. **For more information about ChatQA 2, check the [website](https://chatqa2-project.github.io/)!**
17
+
18
+ ## Other Resources
19
+ [Llama3-ChatQA-2-70B](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B) &ensp; [Evaluation Data](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B/tree/main/data) &ensp; [Training Data](https://huggingface.co/datasets/nvidia/ChatQA2-Long-SFT-data) &ensp; [Website](https://chatqa2-project.github.io/) &ensp; [Paper](https://arxiv.org/abs/2407.14482)
20
+
21
+ ## Overview of Benchmark Results
22
+ <!-- Results in [ChatRAG Bench](https://huggingface.co/datasets/nvidia/ChatRAG-Bench) are as follows: -->
23
+ We evaluate ChatQA 2 on short-context RAG benchmark (ChatRAG) (within 4K tokens), long context tasks from SCROLLS and LongBench (within 32K tokens), and ultra-long context tasks from In- finiteBench (beyond 100K tokens). Results are shown below.
24
+
25
+
26
+ ![Example Image](overview.png)
27
+ <!-- | | ChatQA-2-70B | GPT-4-Turbo-2024-04-09 | Qwen2-72B-Instruct | Llama3.1-70B-Instruct |
28
+ | -- |:--:|:--:|:--:|:--:|
29
+ | Ultra-long (4k) | 41.04 | 33.16 | 39.77 | 39.81 |
30
+ | Long (32k) | 48.15 | 51.93 | 49.94 | 49.92 |
31
+ | Short (4k) | 56.30 | 54.72 | 54.06 | 52.12 | -->
32
+
33
+ Note that ChatQA-2 is built based on Llama-3 base model.
34
+
35
+
36
+ ## Prompt Format
37
+ **We highly recommend that you use the prompt format we provide, as follows:**
38
+ ### when context is available
39
+ <pre>
40
+ System: {System}
41
+
42
+ {Context}
43
+
44
+ User: {Question}
45
+
46
+ Assistant: {Response}
47
+
48
+ User: {Question}
49
+
50
+ Assistant:
51
+ </pre>
52
+
53
+ ### when context is not available
54
+ <pre>
55
+ System: {System}
56
+
57
+ User: {Question}
58
+
59
+ Assistant: {Response}
60
+
61
+ User: {Question}
62
+
63
+ Assistant:
64
+ </pre>
65
+ **The content of the system's turn (i.e., {System}) for both scenarios is as follows:**
66
+ <pre>
67
+ This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.
68
+ </pre>
69
+ **Note that our ChatQA-2 models are optimized for the capability with context, e.g., over documents or retrieved context.**
70
+
71
+ ## How to use
72
+
73
+ ### take the whole document as context
74
+ This can be applied to the scenario where the whole document can be fitted into the model, so that there is no need to run retrieval over the document.
75
+ ```python
76
+ from transformers import AutoTokenizer, AutoModelForCausalLM
77
+ import torch
78
+
79
+ model_id = "nvidia/Llama3-ChatQA-2-8B"
80
+
81
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
82
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
83
+
84
+ messages = [
85
+ {"role": "user", "content": "what is the percentage change of the net income from Q4 FY23 to Q4 FY24?"}
86
+ ]
87
+
88
+ document = """NVIDIA (NASDAQ: NVDA) today reported revenue for the fourth quarter ended January 28, 2024, of $22.1 billion, up 22% from the previous quarter and up 265% from a year ago.\nFor the quarter, GAAP earnings per diluted share was $4.93, up 33% from the previous quarter and up 765% from a year ago. Non-GAAP earnings per diluted share was $5.16, up 28% from the previous quarter and up 486% from a year ago.\nQ4 Fiscal 2024 Summary\nGAAP\n| $ in millions, except earnings per share | Q4 FY24 | Q3 FY24 | Q4 FY23 | Q/Q | Y/Y |\n| Revenue | $22,103 | $18,120 | $6,051 | Up 22% | Up 265% |\n| Gross margin | 76.0% | 74.0% | 63.3% | Up 2.0 pts | Up 12.7 pts |\n| Operating expenses | $3,176 | $2,983 | $2,576 | Up 6% | Up 23% |\n| Operating income | $13,615 | $10,417 | $1,257 | Up 31% | Up 983% |\n| Net income | $12,285 | $9,243 | $1,414 | Up 33% | Up 769% |\n| Diluted earnings per share | $4.93 | $3.71 | $0.57 | Up 33% | Up 765% |"""
89
+
90
+ def get_formatted_input(messages, context):
91
+ system = "System: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."
92
+ instruction = "Please give a full and complete answer for the question."
93
+
94
+ for item in messages:
95
+ if item['role'] == "user":
96
+ ## only apply this instruction for the first user turn
97
+ item['content'] = instruction + " " + item['content']
98
+ break
99
+
100
+ conversation = '\n\n'.join(["User: " + item["content"] if item["role"] == "user" else "Assistant: " + item["content"] for item in messages]) + "\n\nAssistant:"
101
+ formatted_input = system + "\n\n" + context + "\n\n" + conversation
102
+
103
+ return formatted_input
104
+
105
+ formatted_input = get_formatted_input(messages, document)
106
+ tokenized_prompt = tokenizer(tokenizer.bos_token + formatted_input, return_tensors="pt").to(model.device)
107
+
108
+ terminators = [
109
+ tokenizer.eos_token_id,
110
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
111
+ ]
112
+
113
+ outputs = model.generate(input_ids=tokenized_prompt.input_ids, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=128, eos_token_id=terminators)
114
+
115
+ response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
116
+ print(tokenizer.decode(response, skip_special_tokens=True))
117
+ ```
118
+
119
+ ## Command to run generation
120
+ ```
121
+ python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset ${dataset_name} --start-idx 0 --end-idx ${num_samples} --max-tokens ${max_tokens} --sample-input-file ${dataset_path}
122
+ ```
123
+
124
+ see all_command.sh for all detailed configuration.
125
+
126
+ ## Correspondence to
127
+ Peng Xu (pengx@nvidia.com), Wei Ping (wping@nvidia.com)
128
+
129
+ ## Citation
130
+ <pre>
131
+ @article{xu2024chatqa,
132
+ title={ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities},
133
+ author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan},
134
+ journal={arXiv preprint arXiv:2407.14482},
135
+ year={2024}
136
+ }
137
+ </pre>
138
+
139
+
140
+ ## License
141
+ The Model is released under Non-Commercial License and the use of this model is also governed by the [META LLAMA 3 COMMUNITY LICENSE AGREEMENT](https://llama.meta.com/llama3/license/)
142
+