Update README.md
Browse files
README.md
CHANGED
@@ -123,7 +123,6 @@ python3 -m vllm.entrypoints.openai.api_server \
|
|
123 |
--tensor-parallel-size 8 \
|
124 |
--enable-prefix-caching
|
125 |
```
|
126 |
-
**Important Note** - In the repo revision `g5-48x`, `config.json` has been updated to set `max_position_embeddings` to 288,800, fitting the model's KV cache on a single `g5.48xlarge` instance with 8 A10 GPUs (24GB RAM per GPU).
|
127 |
|
128 |
On an instance with larger GPU RAM (e.g. `p4d.24xlarge`), simply remove the `MAX_MODEL_LEN` argument in order to support the full sequence length of 524,288 tokens:
|
129 |
```shell
|
|
|
123 |
--tensor-parallel-size 8 \
|
124 |
--enable-prefix-caching
|
125 |
```
|
|
|
126 |
|
127 |
On an instance with larger GPU RAM (e.g. `p4d.24xlarge`), simply remove the `MAX_MODEL_LEN` argument in order to support the full sequence length of 524,288 tokens:
|
128 |
```shell
|