Update READEME.md
Browse files
README.md
CHANGED
@@ -110,23 +110,27 @@ This example demonstrates `MegaBeam-Mistral-7B-512k`'s long context capability b
|
|
110 |
## Serve MegaBeam-Mistral-7B-512k on EC2 instances ##
|
111 |
On an AWS `g5.48xlarge` instance, install vLLM as per [vLLM docs](https://vllm.readthedocs.io/en/latest/).
|
112 |
```shell
|
113 |
-
pip install vllm==0.
|
114 |
```
|
115 |
|
116 |
### Start the server
|
117 |
```shell
|
118 |
-
VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
|
119 |
-
|
120 |
-
|
121 |
-
|
|
|
|
|
|
|
122 |
```
|
123 |
**Important Note** - In the repo revision `g5-48x`, `config.json` has been updated to set `max_position_embeddings` to 288,800, fitting the model's KV cache on a single `g5.48xlarge` instance with 8 A10 GPUs (24GB RAM per GPU).
|
124 |
|
125 |
-
On an instance with larger GPU RAM (e.g. `p4d.24xlarge`), simply remove the `
|
126 |
```shell
|
127 |
VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
|
128 |
--model aws-prototyping/MegaBeam-Mistral-7B-512k \
|
129 |
--tensor-parallel-size 8 \
|
|
|
130 |
```
|
131 |
|
132 |
### Run the client
|
|
|
110 |
## Serve MegaBeam-Mistral-7B-512k on EC2 instances ##
|
111 |
On an AWS `g5.48xlarge` instance, install vLLM as per [vLLM docs](https://vllm.readthedocs.io/en/latest/).
|
112 |
```shell
|
113 |
+
pip install vllm==0.6.2
|
114 |
```
|
115 |
|
116 |
### Start the server
|
117 |
```shell
|
118 |
+
export VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
|
119 |
+
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
|
120 |
+
python3 -m vllm.entrypoints.openai.api_server \
|
121 |
+
--model aws-prototyping/MegaBeam-Mistral-7B-512k \
|
122 |
+
--max-model-len 288800 \
|
123 |
+
--tensor-parallel-size 8 \
|
124 |
+
--enable-prefix-caching
|
125 |
```
|
126 |
**Important Note** - In the repo revision `g5-48x`, `config.json` has been updated to set `max_position_embeddings` to 288,800, fitting the model's KV cache on a single `g5.48xlarge` instance with 8 A10 GPUs (24GB RAM per GPU).
|
127 |
|
128 |
+
On an instance with larger GPU RAM (e.g. `p4d.24xlarge`), simply remove the `MAX_MODEL_LEN` argument in order to support the full sequence length of 524,288 tokens:
|
129 |
```shell
|
130 |
VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
|
131 |
--model aws-prototyping/MegaBeam-Mistral-7B-512k \
|
132 |
--tensor-parallel-size 8 \
|
133 |
+
--enable-prefix-caching
|
134 |
```
|
135 |
|
136 |
### Run the client
|