[Bug] win11下gpu:4090使用docker部署internlm-chat-20b-4bit,显示WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.提问时无反馈卡死。
部署方法如下:
1.创建dockerfile:
FROM nvcr.io/nvidia/pytorch:22.12-py3
WORKDIR /workspace/
ENV PYTHONPATH /workspace/
RUN pip uninstall -y opencv-python
RUN pip install opencv-python==4.8.0.74 -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install lmdeploy>=0.0.9 -i https://pypi.tuna.tsinghua.edu.cn/simple
2.下载model
git clone https://huggingface.co/lmdeploy/turbomind-internlm-chat-20b-w4
把目录turbomind-internlm-chat-20b-w4改为workspace
或者
git clone https://huggingface.co/internlm/internlm-chat-20b-4bit
python3 -m lmdeploy.serve.turbomind.deploy
--model-name internlm-chat-20b
--model-path ./internlm-chat-20b-4bit
--model-format awq
--group-size 128
3.运行docker容器:
docker build -f docker/Dockerfile -t internlm .
docker run -it -d --gpus all --ipc=host -p 7891:8000 --name=internlm --ulimit memlock=-1 --ulimit stack=67108864 -v D:\ai\internlm:/workspace internlm
4.在容器中执行命令启动lmdeploy:
python3 -m lmdeploy.serve.gradio.app ./workspace 0.0.0.0 8000
5.运行结果:
root@ec0690884cd4:/workspace# python3 -m lmdeploy.serve.gradio.app ./workspace 0.0.0.0 8000
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [LlamaCacheManager] max_entry_count = 48
[TM][INFO] [LlamaCacheManager] chunk_size = 1
[TM][INFO] [LlamaCacheManager][allocate]
[TM][INFO] [LlamaCacheManager][allocate] malloc 1
[TM][INFO] [LlamaCacheManager][allocate] count = 1
[TM][INFO] [LlamaCacheManager][allocate] free = 1
[TM][INFO] [internalThreadEntry] 0
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
server is gonna mount on: http://0.0.0.0:8000
Running on local URL: http://0.0.0.0:8000
6.打开浏览器访问LMDeploy Playground:
http://127.0.0.1:7891/
7.任意问题访问结果:
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [verifyRequests] Skipping invalid infer request for id 1721701, code = 1
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [synchronize] batch_size = 0
[TM][INFO] [LlamaCacheManager][create] 1721701
[TM][INFO] [LlamaCacheManager][allocate]
[TM][INFO] [LlamaCacheManager][allocate] free = 0
[TM][INFO] [init] infer_request_count = 1
[TM][INFO] [init] batch_size = 1
[TM][INFO] [init] session_len = 2056
[TM][INFO] [init] max_input_length = 14
[TM][INFO] [init] max_context_len = 14
[TM][INFO] [init] slot sequence_id history_len input_len context_len tmp_input_len token_ids.size cache_len
[TM][INFO] [init] 0 1721701 0 14 14 14 0 0
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 13, max_input_len = 13, max_context_len = 13
[TM][INFO] context decoding start
8.最终状态描述:
网页无任何响应,卡死。此时显存占用为16G。
lmdeploy/turbomind/chat.py 脚本跑呢?