Infinity support: Short max_length=2048 for more optimized deployment
#20
by
michaelfeil
- opened
@ Alibaba Team, thanks so much for the support of this model.
How to run with https://github.com/michaelfeil/infinity
Run via Docker:
docker run --gpus all -p 7997 michaelf34/infinity:0.0.68-trt-onnx v2 --model-id Alibaba-NLP/gte-Qwen2-1.5B-instruct --revision "refs/pr/20" --dtype bfloat16 --batch-size 8 --device cuda --engine torch --port 7997 --no-bettertransformer
Run via CLI:
pip install infinity_emb flash-attn
infinity_emb v2 --model-id Alibaba-NLP/gte-Qwen2-1.5B-instruct --revision "refs/pr/20" --dtype bfloat16 --batch-size 8 --device cuda --engine torch --port 7997 --no-bettertransformer
DO NOT MERGE!
michaelfeil
changed pull request title from
Infinity support: Short max_length for more optimized deployment
to Infinity support: Short max_length=2048 for more optimized deployment
michaelfeil
changed pull request status to
closed
michaelfeil
changed pull request status to
open
thenlper
changed pull request status to
merged
Please do not merge PR, as mentioned above.
Ill open a new PR to undo this -> https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/discussions/22