Text Generation
Transformers
PyTorch
TensorBoard
Safetensors
bloom
Eval Results
text-generation-inference
Inference Endpoints

How to use bloom-176B to generate or evaluate on Multi-graphics?

#62
by xuyifan - opened

I have a machine with 8*A100, and i would like to use bloom-176B to generate text, and apply evaluation on different datasets. I use the code in bigscience-workshop/bigscience/sortval repo,bigscience/evaluation/generation/generate.py. But this code only allows the model to run on a graphics card.
Apperently single A100 is not enough:
RuntimeError: CUDA out of memory. Tried to allocate 1.53 GiB (GPU 0; 79.17 GiB total capacity; 77.14 GiB already allocated; 1.20 GiB free; 77.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How to use bloom-176B to generate or evaluate on Multi-graphics? Should I change the code in generate.py,or use other code?

I have tried use accelerate:

from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForCausalLM
import torch
from accelerate import Accelerator
def generate_from_text(model, text, tokenizer, max_length=200, greedy=False, top_k=0):
    input_ids = tokenizer.encode(text, return_tensors='pt').to("cuda:0")
    max_length = input_ids.size(-1) + max_length
    
    greedy_output = model.generate(
        input_ids.to('cuda:0'),
        max_length=max_length,
        do_sample=not greedy,
        top_k=None if greedy else top_k,
    )
    return tokenizer.decode(greedy_output[0], skip_special_tokens=True)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
print("load tokenizer")
accelerator = Accelerator()
device = accelerator.device
model = AutoModelForCausalLM.from_pretrained(
    arg.checkpoints,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    offload_folder="./offload",
)
model = accelerator.prepare(model)
print("load model")


text = 'hello'
output = generate_from_text(model, text, tokenizer, max_length=500, greedy=True, top_k=1)
print(output)

and it shows :

load tokenizer
Traceback (most recent call last):
File "/mnt/yrfs/xuyifan/bloom/bigscience-sorteval/eva.py", line 28, in
model = accelerator.prepare(model)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 545, in prepare
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 545, in
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 441, in _prepare_one
return self.prepare_model(obj)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 565, in prepare_model
model = model.to(self.device)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
return self._apply(convert)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 1.53 GiB (GPU 0; 79.17 GiB total capacity; 77.14 GiB already allocated; 1.20 GiB free; 77.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

BigScience Workshop org

Hi @xuyifan ! Sorry for the late reply. Why is it that you need Accelerator? I don't think you need it.

Sign up or log in to comment