falcon-7b-instruct-sharded not work with huggingface/text-generation-inference
I have two NVIDIA GeForce RTX 3080 GPUs on my system with total 20GB GPU memory. It was successful running some LLM models which are required up to 10GB GPU memory.
But it cannot run falcon-7b-instruct-sharded model with the huggingface's text-generation-inference.
For the text-generation-inference command with official recommended parameters:
docker run --gpus all --shm-size 1g -p 8080:80 -v /root/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id vilsonrodrigues/falcon-7b-instruct-sharded --num-shard 2
It reports error "sharded is not supported for this model".
I've also tried removed all shard related parameters and run the text-generation-inference command with only use single GPU.
docker run --gpus 1 -p 8080:80 -v /root/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id vilsonrodrigues/falcon-7b-instruct-sharded
It reports error "torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 8.73 GiB
Requested : 157.53 MiB
Device limit : 9.78 GiB
Free (according to CUDA): 19.31 MiB"
Please check the following logs:
# docker run --gpus all --shm-size 1g -p 8080:80 -v /root/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id vilsonrodrigues/falcon-7b-instruct-sharded --num-shard 2
2023-07-09T15:25:17.922667Z INFO text_generation_launcher: Args { model_id: "vilsonrodrigues/falcon-7b-instruct-sharded", revision: None, sharded: None, num_shard: Some(2), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "10dd21d407e5", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-09T15:25:17.922719Z INFO text_generation_launcher: Sharding model on 2 processes
2023-07-09T15:25:17.922871Z INFO text_generation_launcher: Starting download process.
2023-07-09T15:25:21.330343Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-07-09T15:25:22.026919Z INFO text_generation_launcher: Successfully downloaded weights.
2023-07-09T15:25:22.027240Z INFO text_generation_launcher: Starting shard 0
2023-07-09T15:25:22.027310Z INFO text_generation_launcher: Starting shard 1
2023-07-09T15:25:25.942328Z WARN shard-manager: text_generation_launcher: We're not using custom kernels.
rank=1
2023-07-09T15:25:25.970715Z WARN shard-manager: text_generation_launcher: We're not using custom kernels.
rank=0
2023-07-09T15:25:26.454690Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 240, in get_model
raise NotImplementedError("sharded is not supported for this model")
NotImplementedError: sharded is not supported for this model
rank=0
2023-07-09T15:25:26.839306Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 240, in get_model
raise NotImplementedError("sharded is not supported for this model")
NotImplementedError: sharded is not supported for this model
rank=1
2023-07-09T15:25:27.331120Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-09T15:25:27.331151Z ERROR text_generation_launcher: Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 240, in get_model
raise NotImplementedError("sharded is not supported for this model")
NotImplementedError: sharded is not supported for this model
2023-07-09T15:25:27.331177Z INFO text_generation_launcher: Shutting down shards
2023-07-09T15:25:27.760486Z INFO text_generation_launcher: Shard 1 terminated
# docker run --gpus 1 -p 8080:80 -v /root/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id vilsonrodrigues/falcon-7b-instruct-sharded
2023-07-09T15:31:46.866685Z INFO text_generation_launcher: Args { model_id: "vilsonrodrigues/falcon-7b-instruct-sharded", revision: None, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "651ed0250429", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-09T15:31:46.866936Z INFO text_generation_launcher: Starting download process.
2023-07-09T15:31:50.637241Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-07-09T15:31:51.270900Z INFO text_generation_launcher: Successfully downloaded weights.
2023-07-09T15:31:51.271200Z INFO text_generation_launcher: Starting shard 0
2023-07-09T15:31:54.552445Z WARN shard-manager: text_generation_launcher: We're not using custom kernels.
rank=0
2023-07-09T15:32:00.971897Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 253, in get_model
return FlashRWSharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 56, in __init__
model = FlashRWForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 628, in __init__
self.transformer = FlashRWModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 558, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 559, in <listcomp>
FlashRWLayer(layer_id, config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 411, in __init__
self.mlp = FlashMLP(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 363, in __init__
self.dense_h_to_4h = TensorParallelColumnLinear.load(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 234, in load
return cls.load_multi(config, [prefix], weights, bias, dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 238, in load_multi
weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 127, in get_multi_weights_col
w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 127, in <listcomp>
w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 98, in get_sharded
tensor = tensor.to(device=self.device)
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 8.73 GiB
Requested : 157.53 MiB
Device limit : 9.78 GiB
Free (according to CUDA): 17.31 MiB
PyTorch limit (set by user-supplied memory fraction)
: 17179869184.00 GiB
rank=0
2023-07-09T15:32:01.279858Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
Error: ShardCannotStart
2023-07-09T15:32:02.979796Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-09T15:32:02.979844Z ERROR text_generation_launcher: You are using a model of type RefinedWebModel to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 253, in get_model
return FlashRWSharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 56, in __init__
model = FlashRWForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 628, in __init__
self.transformer = FlashRWModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 558, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 559, in <listcomp>
FlashRWLayer(layer_id, config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 411, in __init__
self.mlp = FlashMLP(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 363, in __init__
self.dense_h_to_4h = TensorParallelColumnLinear.load(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 234, in load
return cls.load_multi(config, [prefix], weights, bias, dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 238, in load_multi
weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 127, in get_multi_weights_col
w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 127, in <listcomp>
w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 98, in get_sharded
tensor = tensor.to(device=self.device)
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 8.73 GiB
Requested : 157.53 MiB
Device limit : 9.78 GiB
Free (according to CUDA): 17.31 MiB
PyTorch limit (set by user-supplied memory fraction)
: 17179869184.00 GiB
2023-07-09T15:32:02.979915Z INFO text_generation_launcher: Shutting down shards
# nvidia-smi
Sun Jul 9 23:19:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 30% 39C P0 87W / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:A1:00.0 Off | N/A |
| 30% 38C P0 N/A / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
probably the 3080 doesn't support the Falcon 7b. V100 also does not support
#update
as Falcon-7b does not allow sharding you will need a gpu with at least 16gb. an alternative would be to apply quantization (gptq or bitsandbytes)
Thank you for your swift response.
I suspect this might be an issue related to text-generation-inference. In my system, your model works with Huggingface Transformers.
For further discussion, I've posted an issue at https://github.com/huggingface/text-generation-inference/issues/574.
probably the 3080 doesn't support the Falcon 7b. V100 also does not support
#update
as Falcon-7b does not allow sharding you will need a gpu with at least 16gb. an alternative would be to apply quantization (gptq or bitsandbytes)