Text Generation
Transformers
PyTorch
TensorBoard
Safetensors
bloom
Eval Results
text-generation-inference
Inference Endpoints

What is the best way to run Bloom-176B locally at interactive speeds?

#156
by aeva - opened

Hello. I'm interested in running bloom on a local machine to experiment with interactive uses. I gather from reading various posts here that the model really must be resident in ram (or vram) to avoid being bottlenecked on repeated disk io, as that bottleneck prevents interactive use on most consumer hardware because most people don't have ~400gb of ram (much less vram).

Suppose I were to build a machine that had enough ram for the model to be resident for the lifetime of the program, what kind of CPU would be necessary to hit performance in the ballpark of a token per second or better? Or is this use case only suitable for expensive cloud deployment and would I be better off exploring a lighter but less capable model (and if so, which one?)

BigScience Workshop org

Hi @aeva
You might be interested in https://huggingface.co/bigscience/bloom/discussions/152
I believe you'll be able to run it locally if it works on a gcolab instance cc @borzunov

@ybelkada I doubt Petals is appropriate for my use case since it's distributed whereas I'm specifically aiming to perform text generation tasks quickly on a single machine without requiring a network connection.

I did some rough math, and the thing I was hoping to do is probably not going to be viable with normal hardware, much less at a reasonable cost. In the event that someone finds my analysis interesting or useful (or I made an egregious error and someone kindly points it out), this is what I worked out:

The only motherboard I found on newegg that supports enough ram for the whole model to be resident only supports DDR4 2133, which if my numbers are right, gives a theoretical bandwidth of 15.89 GB/s. The build cost for just the motherboard and the ram is $3464 so far.

If my reading of https://towardsdatascience.com/run-bloom-the-largest-open-access-ai-model-on-your-desktop-computer-f48e1e2a9a32 last night was correct, that that gives a pessimal throughput of 21 seconds per token, assuming no other bottlenecks and middling CPU cache perf.

I don't know enough about how BLOOM works to estimate what CPU cache perf might look like, and I haven't found perf reports on that, so I'm not going to assume that putting in a nice processor would be in any way likely to make the numbers any better than that.

It looks extremely improbable that BLOOM is viable to run at interactive speeds on today's high end consumer hardware in any configuration (based on the above for CPU-centric build, and other threads for GPU-centric build), and that individuals with interests like mine are probably better served by paying for cloud time (or finding a more accessible model with similar capabilities, if any exist).

I'm quite new to all this but, from my understanding of what you've posted, the primary driver of performance (seconds per token) is memory bandwidth, correct? If so, might you be able to get better performance by, instead of focusing on RAM, looking into running multiple NVMe SSD drives on a low cost, consumer grade, PCIe 5.0 motherboard? For example, the system built in this video has a claimed real-world bandwidth performance of 28GB per second: https://www.youtube.com/watch?v=3voNJPuLydw A quick search of Newegg shows Intel based motherboards with PCIe 5.0 support, and 4+ M.2 slots, starting at around $175. I don't know if board at that low end of the market would actually provide the full PCIe 5.0 bandwidth but even significantly higher end consumer motherboards should still be significantly less expensive than the one you mention having found.

Edit: On deeper inspection, it looks like the video in question is using some commercial SSD hardware that is, undoubtedly, expensive and likely doesn't produce anywhere near that bandwidth for anything but pure sequential reads.

@sblaszak even if the fastest available NVMe is about 7500 MB/s right now, you raise an interesting point about PCIe bandwidth and the availability of motherboards supporting multiple NVMe slots.

According to wikipedia, PCIe 5's bandwidth is 63 GB/s (and the later revisions have very promising gains there too). I haven't thought this through, but if it's possible to meaningfully parallelize the IO across multiple high end NVMes, then maybe this could be made to work with a much lower amount of high end ram. From my notes from yesterday, 44.7 GB/s is probably the best possible ram bandwidth for normal hardware right now, which I think is what we'd have to beat for disk IO to stop being the main bottleneck. I don't know off hand if that 44.7 GB/s is per slot or the bus total, but assuming this can be set up so that streaming the new data into ram doesn't saturate the bus, a double buffering scheme for BLOOM blocks could then hit 7 seconds per token. This is hand waving a lot obviously, and I don't know what kind of CPU would be needed.

Would RAID0ing 6 NVMes be able to hit the same bandwidth as the ram?

As I mentioned in my edit, one of the first things I thought of after my initial post is that most of the performance gain from RAID 0 will be for sequential reads. Is that a common usage case for BLOOM? If it uses a lot of random reads, I would think that the expected actual bandwidth of such a setup would significantly drop off...

Sign up or log in to comment