aws-prototyping
/

MegaBeam-Mistral-7B-512k

Text Generation

text-generation-inference

Model card Files Files and versions Community

chenwuml commited on 9 days ago

Commit

1294845

•

1 Parent(s): 27c011d

Update READEME.md

Files changed (1) hide show

README.md +10 -6

README.md CHANGED Viewed

@@ -110,23 +110,27 @@ This example demonstrates `MegaBeam-Mistral-7B-512k`'s long context capability b
 ## Serve MegaBeam-Mistral-7B-512k on EC2 instances ##
 On an AWS `g5.48xlarge` instance, install vLLM as per [vLLM docs](https://vllm.readthedocs.io/en/latest/).
 ```shell
-pip install vllm==0.5.1
 ```
 ### Start the server
 ```shell
-VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
-        --model aws-prototyping/MegaBeam-Mistral-7B-512k \
-        --tensor-parallel-size 8 \
-        --revision g5-48x
 ```
 **Important Note** - In the repo revision `g5-48x`, `config.json` has been updated to set `max_position_embeddings` to 288,800, fitting the model's KV cache on a single `g5.48xlarge` instance with 8 A10 GPUs (24GB RAM per GPU).
-On an instance with larger GPU RAM (e.g. `p4d.24xlarge`), simply remove the `revision` argument in order to support the full sequence length of 524,288 tokens:
 ```shell
 VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
         --model aws-prototyping/MegaBeam-Mistral-7B-512k \
         --tensor-parallel-size 8 \
 ```
 ### Run the client

 ## Serve MegaBeam-Mistral-7B-512k on EC2 instances ##
 On an AWS `g5.48xlarge` instance, install vLLM as per [vLLM docs](https://vllm.readthedocs.io/en/latest/).
 ```shell
+pip install vllm==0.6.2
 ```
 ### Start the server
 ```shell
+export VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
+export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
+python3 -m vllm.entrypoints.openai.api_server \
+    --model aws-prototyping/MegaBeam-Mistral-7B-512k \
+    --max-model-len 288800 \
+    --tensor-parallel-size 8 \
+    --enable-prefix-caching
 ```
 **Important Note** - In the repo revision `g5-48x`, `config.json` has been updated to set `max_position_embeddings` to 288,800, fitting the model's KV cache on a single `g5.48xlarge` instance with 8 A10 GPUs (24GB RAM per GPU).
+On an instance with larger GPU RAM (e.g. `p4d.24xlarge`), simply remove the `MAX_MODEL_LEN` argument in order to support the full sequence length of 524,288 tokens:
 ```shell
 VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
         --model aws-prototyping/MegaBeam-Mistral-7B-512k \
         --tensor-parallel-size 8 \
+        --enable-prefix-caching
 ```
 ### Run the client