chenwuml commited on
Commit
1294845
1 Parent(s): 27c011d

Update READEME.md

Browse files
Files changed (1) hide show
  1. README.md +10 -6
README.md CHANGED
@@ -110,23 +110,27 @@ This example demonstrates `MegaBeam-Mistral-7B-512k`'s long context capability b
110
  ## Serve MegaBeam-Mistral-7B-512k on EC2 instances ##
111
  On an AWS `g5.48xlarge` instance, install vLLM as per [vLLM docs](https://vllm.readthedocs.io/en/latest/).
112
  ```shell
113
- pip install vllm==0.5.1
114
  ```
115
 
116
  ### Start the server
117
  ```shell
118
- VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
119
- --model aws-prototyping/MegaBeam-Mistral-7B-512k \
120
- --tensor-parallel-size 8 \
121
- --revision g5-48x
 
 
 
122
  ```
123
  **Important Note** - In the repo revision `g5-48x`, `config.json` has been updated to set `max_position_embeddings` to 288,800, fitting the model's KV cache on a single `g5.48xlarge` instance with 8 A10 GPUs (24GB RAM per GPU).
124
 
125
- On an instance with larger GPU RAM (e.g. `p4d.24xlarge`), simply remove the `revision` argument in order to support the full sequence length of 524,288 tokens:
126
  ```shell
127
  VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
128
  --model aws-prototyping/MegaBeam-Mistral-7B-512k \
129
  --tensor-parallel-size 8 \
 
130
  ```
131
 
132
  ### Run the client
 
110
  ## Serve MegaBeam-Mistral-7B-512k on EC2 instances ##
111
  On an AWS `g5.48xlarge` instance, install vLLM as per [vLLM docs](https://vllm.readthedocs.io/en/latest/).
112
  ```shell
113
+ pip install vllm==0.6.2
114
  ```
115
 
116
  ### Start the server
117
  ```shell
118
+ export VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
119
+ export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
120
+ python3 -m vllm.entrypoints.openai.api_server \
121
+ --model aws-prototyping/MegaBeam-Mistral-7B-512k \
122
+ --max-model-len 288800 \
123
+ --tensor-parallel-size 8 \
124
+ --enable-prefix-caching
125
  ```
126
  **Important Note** - In the repo revision `g5-48x`, `config.json` has been updated to set `max_position_embeddings` to 288,800, fitting the model's KV cache on a single `g5.48xlarge` instance with 8 A10 GPUs (24GB RAM per GPU).
127
 
128
+ On an instance with larger GPU RAM (e.g. `p4d.24xlarge`), simply remove the `MAX_MODEL_LEN` argument in order to support the full sequence length of 524,288 tokens:
129
  ```shell
130
  VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
131
  --model aws-prototyping/MegaBeam-Mistral-7B-512k \
132
  --tensor-parallel-size 8 \
133
+ --enable-prefix-caching
134
  ```
135
 
136
  ### Run the client