AWQ model in text-generation-webui
#1
by
sdranju
- opened
Hello,
Seems text-generation-webui do not support AWQ quantized model. Do you have any idea to make a workaround?
regards.
As discussed in the README, vLLM only supports Llama AWQ models at this time
@TheBloke They don't however support AWQ Mistral yet apparently.
ValueError: Quantization is not supported for <class 'vllm.model_executor.models.mistral.MistralForCausalLM'>.
They should do - version 0.2 just pushed a few hours ago with Mistral support listed. I updated my README a minute ago to say it now worked
Are you running 0.2?
I am running version 0.2.0
, well more accurately I'm running from source main
which I can see is tagged at 0.2.0
.
Successfully built vllm
Installing collected packages: vllm
Successfully installed vllm-0.2.0
root@21cb0f50ccdf:/vllm# cd ..
root@21cb0f50ccdf:/# python -m vllm.entrypoints.api_server --model TheBloke/Mistral-7B-v0.1-AWQ --quantization awq --dtype float16
WARNING 09-29 16:49:26 config.py:341] Casting torch.bfloat16 to torch.float16.
INFO 09-29 16:49:26 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Mistral-7B-v0.1-AWQ', tokenizer='TheBloke/Mistral-7B-v0.1-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/api_server.py", line 74, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args
engine = cls(engine_args.worker_use_ray,
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 270, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 108, in __init__
self._init_workers(distributed_init_method)
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 140, in _init_workers
self._run_workers(
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 692, in _run_workers
output = executor(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 68, in init_model
self.model = get_model(self.model_config)
File "/usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader.py", line 67, in get_model
raise ValueError(
ValueError: Quantization is not supported for <class 'vllm.model_executor.models.mistral.MistralForCausalLM'>.
Oh, damn. I guess they just added unquantised support.
I'll remove mention of it from the README again!