System requirement?

#27
by nguyengoc - opened

What is the system requirement to run this model, and how can I find that?

From reading the config, this is a Flaot16 model, using the Model Memory Estimator (https://huggingface.co/spaces/hf-accelerate/model-memory-usage), it provides the following specs for the wizard-coder 30B (as a llm):

dtype Largest Layer or Residual Group Total Size Training using Adam
float32 2.59 GB 125.48 GB 501.92 GB
int8 664.02 MB 31.37 GB 125.48 GB
float16/bfloat16 1.3 GB 62.74 GB 250.96 GB
int4 332.01 MB 15.68 GB 62.74 GB

So, if you pull this down, you'll need 63GB of RAM to run it. I would love to quantize this to a int8, so it could fit on a 4090 or A6000, but don't know how right now.

I am able to run in on M1 Max 64GB. Not super fast, but it works

llama_print_timings:      sample time =  1804.24 ms /   729 runs   (    2.47 ms per token,   404.05 tokens per second)
llama_print_timings: prompt eval time =  3652.04 ms /   144 tokens (   25.36 ms per token,    39.43 tokens per second)
llama_print_timings:        eval time = 94289.78 ms /   728 runs   (  129.52 ms per token,     7.72 tokens per second)
llama_print_timings:       total time = 100932.23 ms
Output generated in 101.16 seconds (7.20 tokens/s, 728 tokens, context 144, seed 1690939106)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  3652.09 ms
llama_print_timings:      sample time =  2548.89 ms /  1024 runs   (    2.49 ms per token,   401.74 tokens per second)
llama_print_timings: prompt eval time = 13158.02 ms /   751 tokens (   17.52 ms per token,    57.08 tokens per second)
llama_print_timings:        eval time = 141916.85 ms /  1023 runs   (  138.73 ms per token,     7.21 tokens per second)
llama_print_timings:       total time = 159473.00 ms
Output generated in 159.71 seconds (6.41 tokens/s, 1024 tokens, context 886, seed 1686911609)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  3652.09 ms
llama_print_timings:      sample time =   694.30 ms /   276 runs   (    2.52 ms per token,   397.52 tokens per second)
llama_print_timings: prompt eval time = 19746.02 ms /  1023 tokens (   19.30 ms per token,    51.81 tokens per second)
llama_print_timings:        eval time = 43975.35 ms /   275 runs   (  159.91 ms per token,     6.25 tokens per second)
llama_print_timings:       total time = 64842.96 ms
Output generated in 65.07 seconds (4.23 tokens/s, 275 tokens, context 1909, seed 828516400)

Sign up or log in to comment