falcon-7b-instruct-sharded not work with huggingface/text-generation-inference

#2
by tanshuai - opened

I have two NVIDIA GeForce RTX 3080 GPUs on my system with total 20GB GPU memory. It was successful running some LLM models which are required up to 10GB GPU memory.

But it cannot run falcon-7b-instruct-sharded model with the huggingface's text-generation-inference.

For the text-generation-inference command with official recommended parameters:

docker run --gpus all --shm-size 1g -p 8080:80 -v /root/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id vilsonrodrigues/falcon-7b-instruct-sharded --num-shard 2

It reports error "sharded is not supported for this model".

I've also tried removed all shard related parameters and run the text-generation-inference command with only use single GPU.

docker run --gpus 1 -p 8080:80 -v /root/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id vilsonrodrigues/falcon-7b-instruct-sharded

It reports error "torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 8.73 GiB
Requested : 157.53 MiB
Device limit : 9.78 GiB
Free (according to CUDA): 19.31 MiB"

Please check the following logs:

# docker run --gpus all  --shm-size 1g -p 8080:80 -v /root/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id vilsonrodrigues/falcon-7b-instruct-sharded --num-shard 2
2023-07-09T15:25:17.922667Z  INFO text_generation_launcher: Args { model_id: "vilsonrodrigues/falcon-7b-instruct-sharded", revision: None, sharded: None, num_shard: Some(2), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "10dd21d407e5", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-09T15:25:17.922719Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-07-09T15:25:17.922871Z  INFO text_generation_launcher: Starting download process.
2023-07-09T15:25:21.330343Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-09T15:25:22.026919Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-09T15:25:22.027240Z  INFO text_generation_launcher: Starting shard 0
2023-07-09T15:25:22.027310Z  INFO text_generation_launcher: Starting shard 1
2023-07-09T15:25:25.942328Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=1
2023-07-09T15:25:25.970715Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-09T15:25:26.454690Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 240, in get_model
    raise NotImplementedError("sharded is not supported for this model")
NotImplementedError: sharded is not supported for this model
 rank=0
2023-07-09T15:25:26.839306Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 240, in get_model
    raise NotImplementedError("sharded is not supported for this model")
NotImplementedError: sharded is not supported for this model
 rank=1
2023-07-09T15:25:27.331120Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-09T15:25:27.331151Z ERROR text_generation_launcher: Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 240, in get_model
    raise NotImplementedError("sharded is not supported for this model")

NotImplementedError: sharded is not supported for this model


2023-07-09T15:25:27.331177Z  INFO text_generation_launcher: Shutting down shards
2023-07-09T15:25:27.760486Z  INFO text_generation_launcher: Shard 1 terminated
# docker run --gpus 1 -p 8080:80 -v /root/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id vilsonrodrigues/falcon-7b-instruct-sharded
2023-07-09T15:31:46.866685Z  INFO text_generation_launcher: Args { model_id: "vilsonrodrigues/falcon-7b-instruct-sharded", revision: None, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "651ed0250429", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-09T15:31:46.866936Z  INFO text_generation_launcher: Starting download process.
2023-07-09T15:31:50.637241Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-09T15:31:51.270900Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-09T15:31:51.271200Z  INFO text_generation_launcher: Starting shard 0
2023-07-09T15:31:54.552445Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-09T15:32:00.971897Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 253, in get_model
    return FlashRWSharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 56, in __init__
    model = FlashRWForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 628, in __init__
    self.transformer = FlashRWModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 558, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 559, in <listcomp>
    FlashRWLayer(layer_id, config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 411, in __init__
    self.mlp = FlashMLP(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 363, in __init__
    self.dense_h_to_4h = TensorParallelColumnLinear.load(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 234, in load
    return cls.load_multi(config, [prefix], weights, bias, dim=0)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 238, in load_multi
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 127, in get_multi_weights_col
    w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 127, in <listcomp>
    w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 98, in get_sharded
    tensor = tensor.to(device=self.device)
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated     : 8.73 GiB
Requested               : 157.53 MiB
Device limit            : 9.78 GiB
Free (according to CUDA): 17.31 MiB
PyTorch limit (set by user-supplied memory fraction)
                        : 17179869184.00 GiB
 rank=0
2023-07-09T15:32:01.279858Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
Error: ShardCannotStart
2023-07-09T15:32:02.979796Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-09T15:32:02.979844Z ERROR text_generation_launcher: You are using a model of type RefinedWebModel to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 253, in get_model
    return FlashRWSharded(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 56, in __init__
    model = FlashRWForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 628, in __init__
    self.transformer = FlashRWModel(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 558, in __init__
    [

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 559, in <listcomp>
    FlashRWLayer(layer_id, config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 411, in __init__
    self.mlp = FlashMLP(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 363, in __init__
    self.dense_h_to_4h = TensorParallelColumnLinear.load(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 234, in load
    return cls.load_multi(config, [prefix], weights, bias, dim=0)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 238, in load_multi
    weight = weights.get_multi_weights_col(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 127, in get_multi_weights_col
    w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 127, in <listcomp>
    w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 98, in get_sharded
    tensor = tensor.to(device=self.device)

torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated     : 8.73 GiB
Requested               : 157.53 MiB
Device limit            : 9.78 GiB
Free (according to CUDA): 17.31 MiB
PyTorch limit (set by user-supplied memory fraction)
                        : 17179869184.00 GiB


2023-07-09T15:32:02.979915Z  INFO text_generation_launcher: Shutting down shards
# nvidia-smi
Sun Jul  9 23:19:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   39C    P0    87W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:A1:00.0 Off |                  N/A |
| 30%   38C    P0    N/A / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

probably the 3080 doesn't support the Falcon 7b. V100 also does not support

#update

as Falcon-7b does not allow sharding you will need a gpu with at least 16gb. an alternative would be to apply quantization (gptq or bitsandbytes)

Thank you for your swift response.

I suspect this might be an issue related to text-generation-inference. In my system, your model works with Huggingface Transformers.

For further discussion, I've posted an issue at https://github.com/huggingface/text-generation-inference/issues/574.

probably the 3080 doesn't support the Falcon 7b. V100 also does not support

#update

as Falcon-7b does not allow sharding you will need a gpu with at least 16gb. an alternative would be to apply quantization (gptq or bitsandbytes)

tanshuai changed discussion status to closed

Sign up or log in to comment