mradermacher/BabyHercules-4x150M-GGUF

Owner Aug 31

@nicoboss now looking into the IQ4_XS.

Owner Aug 31

We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.

mradermacher

Owner Aug 31

homing in on iq4_xs is going to be very tight, as just a few GB off is going to be a problem

mradermacher

Owner Aug 31

llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloaded 24/316 layers to GPU
llm_load_tensors: CPU buffer size = 523472.97 MiB
llm_load_tensors: CUDA0 buffer size = 19645.50 MiB
llm_load_tensors: CUDA1 buffer size = 19645.50 MiB

compute_imatrix: 130.55 seconds per pass - ETA 11 hours 23.22 minutes

mradermacher

Owner Aug 31

•

edited Aug 31

Judging form actual memory usage, we might even get another 30GB or more in there.

mradermacher

Owner Aug 31

And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.

nicoboss

Aug 31

•

edited Aug 31

@nicoboss now looking into the IQ4_XS.

Awesome. Thanks a lot!

We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.

I will experiment with RPC some more. Please keep BigLlama-3.1-1T-Instruct.Q6_K.gguf for a few days unless you need the storage for something more important.

Judging form actual memory usage, we might even get another 30GB or more in there.

You only used the two RTX 4090 GPUs so technically you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s. But IQ4_XS will be good enough for now. It’s better what you used for your older large models you never ended up requantizing as far I'm aware.

And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.

Great. I see it completed successfully and is now working on the BigLlama 1T quant task. They will be great to stress test my new internet gateway using which I had not experienced any internet issues so far.

mradermacher

Owner Aug 31

unless you need the storage for something more important.

Well, in fact, once bigllama is quanted, I will empty out all /*pool's (it's only the source gguf).

Also, since the big models really dried out at the moment,

you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s

No, because the kernel doesn't compile on the 3080, and probably also not on the 2070:

ggml_cuda_compute_forward: MUL failed
CUDA error: no kernel image is available for execution on the device

That is probably due to me forcing mmq for quality reasons (a lot of models overflow in f16 but work when mmq is forced), but I haven't verified that yet.

But IQ4_XS will be good enough for now.

Yeah, and eyeballing your graphs, IQ4_XS isn't as bad as we thought, and neither are Q3* (all non-imatrix).

They will be great to stress test my new internet gateway

I am really optimistic that it was the gateway, maybe an overheating problem. It has uploaded quite a bit so far without a hitch, more than with the old gateway at the end.

291 hidden messages

Expand all

mradermacher

Owner 7 days ago

•

edited 7 days ago

So lots of tweaking, watching, and waitiing for the fallout of past mistweaks to clear out a bit (the grey blocks of ready imatrix jobs in the middle of the quant queues shouldn't be there), I really like the algorithm. During the day, it will do imatrix at full speed. During the evening, basically nothing, and during the night, pretty much it trickles an imatrix through from time to time based on demand. And quants are mostly idle on nico during the night, and definitely during the evening. Since we had lots of small models today, nico was able to just keep up with the rest (and generated some imatrices in advance tro be used at night), but even if it can't keep up generating imatrices, at night it will slow down because nico1 is also the biggest imatrix consumer and they will mostly be done on demand only.

As a side effect, we also don't have the issue anymore that the imatrix queue order differs from the quant queue order, causing imatrices to be calculated that nobody is waiting for and vice versa.

I really like it.

mradermacher

Owner 7 days ago

Something completely different - I was told (I think by slaren) that the imatrix code is essentially unmaintained, and ikawrakow said he is no longer contributing to llama.cpp (https://github.com/ggerganov/llama.cpp/discussions/5063#discussioncomment-10996711) instead implements improvements in his own fork.

Any idea what is going on there?

mradermacher

Owner 7 days ago

And something else entirelly different: Since I was repeteadly asked about the "imatrix Q8_0" quants I went to verify that they don't exist. Naive grepping shows imatrix data is used:

size_t quantize_q8_0(const float * restrict src, void * restrict dst, int64_t nrow, int64_t n_per_row, const float * quant_weights) {

alas, the next line:

(void)quant_weights; // not used

So, nothing new here, but at least I now have a better basis than "somebody told me".

mradermacher

Owner 7 days ago

BTW, if you ever get finished with the quant measurement, the next big project might be to put imatrix data on a deterministic basis and improve the imatrix data we use.

:^)

mradermacher

Owner 3 days ago

just fyi, the "huggingface-cli upload stuck in endless read call" happened on another node (leia), so it's definitely some kind of huggingface/hf-cli problem.

mradermacher

Owner 2 days ago

btw., the tess model had another interesting upload error:

NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f2f4b970850>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)"

wtf., intermittent dns problems? that's a new one :)

nicoboss

about 11 hours ago

Today I collected really interesting measurements regarding the hardware usage during imatrix and quantization tasks. Here the results:

1 GHz = 90 Watt
2 GHz = 110 Watt
3 GHz = 140 Watt
4 GHz = 210 Watt
4.67 GHz = 340 Watt

If I set the limit to 5 GHz the CPU is reaching its 350-Watt BIOS power limit during peeks and clocks to 4.67 GHz due to being power limited.

Tasks running during the test:

142+Quyen-Pro-Max-v0.1                            run/imatrix 11/80 11.79s/c 52.3/62.5m(62) [266/318] 9.8944
nico1     750  175  I BigWeave-v14-90b                             run/imatrix 21/24,IQ3_XS [768/921]
nico1     750  134  I openbuddy-deepseek-67b-v15.3-4k              run/imatrix 8/24,Q6_K [29/858]

Total PCIE Bandwidth (GB/s)	Total PCIE Rd Bandwidth (GB/s)	Total PCIE Wr Bandwidth (GB/s)	Total PCIE Bandwidth Local (GB/s)	Total PCIE Rd Bandwidth Local (GB/s)	Total PCIE Wr Bandwidth Local (GB/s)	Quad 0 PCIE Rd Bandwidth Local (GB/s)	Quad 0 PCIE Wr Bandwidth Local (GB/s)	Quad 1 PCIE Rd Bandwidth Local (GB/s)	Quad 1 PCIE Wr Bandwidth Local (GB/s)	Quad 2 PCIE Rd Bandwidth Local (GB/s)	Quad 2 PCIE Wr Bandwidth Local (GB/s)	Quad 3 PCIE Rd Bandwidth Local (GB/s)	Quad 3 PCIE Wr Bandwidth Local (GB/s)
16.93	15.21	1.72	16.93	15.21	1.72	0.00	0.11	14.96	1.42	0.24	0.09	0.00	0.11
16.98	15.22	1.76	16.98	15.22	1.76	0.00	0.10	14.97	1.46	0.25	0.10	0.00	0.10
15.45	13.68	1.77	15.45	13.68	1.77	0.00	0.19	13.34	1.30	0.33	0.10	0.00	0.19
15.17	13.59	1.58	15.17	13.59	1.58	0.00	0.16	12.86	1.16	0.71	0.09	0.01	0.16
14.83	13.04	1.79	14.83	13.04	1.79	0.00	0.21	12.68	1.28	0.36	0.09	0.00	0.21
10.09	7.70	2.39	10.09	7.70	2.39	0.38	0.17	6.69	1.95	0.25	0.09	0.38	0.17
12.61	10.80	1.81	12.61	10.80	1.81	0.00	0.09	10.53	1.56	0.27	0.06	0.00	0.09
14.61	12.94	1.68	14.61	12.94	1.68	0.00	0.16	12.64	1.28	0.29	0.09	0.00	0.16
15.26	13.53	1.73	15.26	13.53	1.73	0.01	0.22	13.17	1.18	0.35	0.10	0.01	0.22
14.59	12.94	1.65	14.59	12.94	1.65	0.00	0.14	12.62	1.28	0.31	0.10	0.00	0.13

Packed 512-bit FP Ops Retired (%)	Packed 256-bit FP Ops Retired (%)	Packed 128-bit FP Ops Retired (%)	Scalar/MMX/x87 FP Ops Retired (%)
0.77	0.43	41.38	57.42
0.15	0.15	41.78	57.92
0.61	0.17	29.09	70.14
0.31	0.36	27.99	71.34
0.04	0.24	56.66	43.06
0.26	0.16	41.53	58.04
1.35	0.21	32.83	65.61
1.61	0.22	32.49	65.68
0.68	0.18	32.74	66.40
0.68	0.20	25.76	73.35

L1 DC Miss (pti)	L2 Data Read Miss (pti)	L1 IC Miss (pti)	L2 Code Read Miss (pti)
1.08	0.23	0.32	0.03
2.30	0.42	1.81	0.17
2.09	0.61	0.27	0.03
1.66	0.22	0.15	0.01
1.10	0.07	0.17	0.00
1.75	0.24	0.73	0.06
1.63	0.25	1.33	0.12
2.07	0.68	0.41	0.05
2.05	0.23	0.86	0.04
1.93	0.41	1.56	0.14

Local Inbound Read Data Bytes(GB/s)	Local Outbound Write Data Bytes (GB/s)	Remote Outbound Write Data Bytes (GB/s)	Local Socket Inbound Data to CPU Moderator (CCM) 0 at Interface 0 (GB/s)	Local Socket Inbound Data to CPU Moderator (CCM) 1 at Interface 0 (GB/s)	Local Socket Inbound Data to CPU Moderator (CCM) 2 at Interface 0 (GB/s)	Local Socket Inbound Data to CPU Moderator (CCM) 3 at Interface 0 (GB/s)	Local Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 0 (GB/s)	Local Socket Outbound Data from CPU Moderator (CCM) 1 at Interface 0 (GB/s)	Local Socket Outbound Data from CPU Moderator (CCM) 2 at Interface 0 (GB/s)	Local Socket Outbound Data from CPU Moderator (CCM) 3 at Interface 0 (GB/s)	Remote Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 0 (GB/s)	Remote Socket Outbound Data from CPU Moderator (CCM) 1 at Interface 0 (GB/s)	Remote Socket Outbound Data from CPU Moderator (CCM) 2 at Interface 0 (GB/s)	Remote Socket Outbound Data from CPU Moderator (CCM) 3 at Interface 0 (GB/s)
23.91	15.63	0.03	2.53	2.36	14.13	4.89	1.02	0.51	10.90	3.21	0.01	0.01	0.01	0.01
20.43	16.07	0.02	1.47	1.57	0.65	16.73	0.33	0.28	0.16	15.30	0.01	0.01	0.00	0.01
12.60	7.37	0.02	0.78	1.42	1.27	9.13	0.14	0.45	0.53	6.25	0.01	0.01	0.00	0.00
22.40	16.48	0.02	1.46	2.49	2.23	16.22	0.53	0.53	0.97	14.45	0.01	0.01	0.00	0.01
18.70	14.81	0.02	0.69	1.25	0.56	16.20	0.07	0.17	0.05	14.52	0.01	0.01	0.00	0.01
21.43	15.46	0.03	0.99	1.80	1.83	16.80	0.15	0.41	0.61	14.29	0.01	0.01	0.01	0.01
20.83	16.16	0.03	0.78	2.01	1.04	17.00	0.24	0.41	0.37	15.15	0.01	0.01	0.00	0.01
19.60	15.28	0.02	1.27	1.61	0.61	16.11	0.39	0.32	0.13	14.45	0.01	0.01	0.01	0.01
18.92	15.19	0.02	0.83	1.43	0.76	15.89	0.14	0.39	0.12	14.54	0.01	0.01	0.01	0.01
20.25	15.74	0.02	1.28	3.35	1.60	14.02	0.31	2.31	0.49	12.63	0.01	0.01	0.01	0.01

All DC Fills (pti)	DC Fills From Same CCX (pti)	DC Fills From different CCX in same node (pti)	DC Fills From Local Memory (pti)
0.89	0.85	0.00	0.04
1.10	1.05	0.00	0.05
0.87	0.83	0.00	0.04
0.54	0.53	0.00	0.01
0.85	0.82	0.00	0.02
0.58	0.57	0.00	0.01
1.08	1.01	0.00	0.07
0.91	0.88	0.01	0.02
1.31	1.23	0.01	0.07
5.91	5.42	0.00	0.48

Total Upstream DMA Read Write Data Bytes (GB/s)	Local Upstream DMA Read Data Bytes (GB/s)	Local Upstream DMA Write Data Bytes (GB/s)
15.65	14.15	1.50
14.61	13.12	1.49
12.10	10.23	1.87
9.92	7.82	2.10
16.14	14.71	1.43
14.91	13.41	1.50
14.95	13.45	1.50
17.33	15.64	1.70
17.06	15.44	1.63
15.32	13.91	1.41

Retired SSE/AVX Flops(GFLOPs)	FP Dispatch Faults (pti)
4.61	0.00
4.21	0.00
4.70	0.00
4.94	0.00
4.46	0.00
5.68	0.00
4.55	0.00
4.40	0.00
5.54	0.00
6.00	0.00

HwPf DC Fills From DRAM or IO connected in local node (pti)	HwPf DC Fills From Cache of another CCX in local node (pti)	HwPf DC Fills From L3 or different L2 in same CCX (pti)	HwPf DC Fills From L2 (pti)
0.00	0.00	0.01	0.19
0.05	0.00	0.02	0.24
0.02	0.00	0.01	0.22
0.04	0.00	0.02	0.24
0.01	0.00	0.02	0.25
0.02	0.00	0.04	0.28
0.13	0.00	0.02	0.53
0.01	0.00	0.01	0.19
0.01	0.00	0.02	0.22
0.23	0.01	0.05	1.99

Utilization (%)	System time (%)	User time (%)	System instructions (%)	User instructions (%)	Eff Freq (MHz)	IPC (Sys + User)	IPC (Sys)	IPC (User)	CPI (Sys + User)	CPI (Sys)	CPI (User)	Giga Instructions Per Sec	Retired Branches (pti)	Retired Branches Mispredicted (pti)
99.95	0.50	99.14	0.14	99.86	4661.73	1.56	0.43	1.57	0.64	2.33	0.64	7.25	45.81	2.01
99.95	1.96	97.68	0.80	99.20	4668.11	1.53	0.62	1.55	0.65	1.61	0.65	7.10	45.21	2.37
99.95	0.49	99.16	0.13	99.87	4665.48	1.57	0.42	1.58	0.64	2.38	0.63	7.29	38.26	1.61
99.95	0.48	99.17	0.13	99.87	4674.00	1.51	0.42	1.51	0.66	2.39	0.66	7.02	37.79	2.07
99.95	0.99	98.65	0.31	99.69	4661.84	1.58	0.49	1.59	0.63	2.03	0.63	7.34	39.13	1.53
99.95	7.26	92.35	2.38	97.62	4660.15	1.40	0.46	1.47	0.71	2.18	0.68	6.50	41.32	3.13
99.95	2.82	96.78	0.84	99.16	4681.43	1.47	0.43	1.50	0.68	2.30	0.67	6.87	37.85	2.39
99.95	0.52	99.12	0.13	99.87	4663.53	1.57	0.41	1.57	0.64	2.46	0.64	7.27	54.70	2.61
99.95	0.47	99.17	0.13	99.87	4674.25	1.53	0.42	1.53	0.65	2.39	0.65	7.12	39.25	1.31
93.41	1.39	98.23	0.38	99.62	4690.36	1.53	0.41	1.55	0.65	2.41	0.65	6.69	41.38	1.02

IC Fetch Miss Ratio	Op Cache Fetch Miss Ratio	IC Access (pti)	IC Miss (pti)	DC Access (pti)
0.06	0.01	2.49	0.15	227.32
0.07	0.07	19.45	1.44	274.00
0.04	0.04	12.51	0.46	259.45
0.05	0.02	3.54	0.17	254.83
0.09	0.01	2.08	0.18	257.49
0.05	0.02	3.97	0.21	279.37
0.04	0.02	3.86	0.17	240.55
0.04	0.02	4.23	0.16	245.10
0.03	0.03	5.35	0.19	251.73
0.08	0.02	4.45	0.37	257.30

L2 Access (pti)	L2 Access from IC Miss (pti)	L2 Access from DC Miss (pti)	L2 Access from L2 HWPF (pti)	L2 Miss (pti)	L2 Miss from IC Miss (pti)	L2 Miss from DC Miss (pti)	L2 Miss from L2 HWPF (pti)	L2 Hit (pti)	L2 Hit from IC Miss (pti)	L2 Hit from DC Miss (pti)	L2 Hit from L2 HWPF (pti)
1.32	0.10	0.60	0.40	0.11	0.01	0.04	0.06	0.96	0.08	0.54	0.34
0.52	0.03	0.40	0.09	0.04	0.00	0.02	0.02	0.47	0.03	0.37	0.07
2.15	0.08	1.44	0.53	0.20	0.02	0.08	0.11	1.84	0.07	1.34	0.43
1.99	0.28	1.36	0.20	0.12	0.01	0.05	0.06	1.75	0.25	1.36	0.14
1.62	0.08	1.41	0.15	0.09	0.01	0.04	0.04	1.50	0.07	1.32	0.11
2.04	0.17	1.66	0.21	0.12	0.01	0.05	0.06	1.91	0.16	1.60	0.15
2.83	0.31	1.47	0.83	0.21	0.03	0.06	0.12	2.19	0.27	1.21	0.71
1.64	0.06	1.20	0.34	0.08	0.01	0.03	0.04	1.55	0.06	1.18	0.30
0.86	0.05	0.66	0.13	0.07	0.00	0.03	0.04	0.75	0.04	0.62	0.09
2.12	0.13	1.93	0.79	0.20	0.04	0.07	0.09	2.36	0.08	1.59	0.70

L3 Access	L3 Miss	L3 Miss %	Ave L3 Miss Latency (ns)
79581629.00	26869900.00	33.76	108.63
74689852.00	29239352.00	39.15	113.88
67081825.00	22431193.00	33.44	106.52
52306516.00	16520234.00	31.58	111.40
45881135.00	9610550.00	20.95	104.17
63687583.00	26049615.00	40.90	124.44
57509142.00	14470472.00	25.16	103.89
71741584.00	17767547.00	24.77	102.42
61719580.00	19476650.00	31.56	100.81
62135911.00	27658654.00	44.51	118.97

Total Mem Bw (GB/s)	Local DRAM Read Data Bytes(GB/s)	Local DRAM Write Data Bytes(GB/s)	Total Mem RdBw (GB/s)	Total Mem WrBw (GB/s)
54.44	36.59	17.85	36.59	17.85
54.07	36.88	17.19	36.88	17.19
52.21	35.50	16.71	35.50	16.71
52.94	35.66	17.28	35.66	17.28
53.34	36.29	17.05	36.29	17.05
49.20	32.95	16.25	32.95	16.25
37.36	25.07	12.29	25.07	12.29
56.59	38.58	18.00	38.58	18.00
62.99	43.26	19.72	43.26	19.72
53.37	35.92	17.44	35.92	17.44

Total PCIE Bandwidth (GB/s)	Total PCIE Rd Bandwidth (GB/s)	Total PCIE Wr Bandwidth (GB/s)	Total PCIE Bandwidth Local (GB/s)	Total PCIE Rd Bandwidth Local (GB/s)	Total PCIE Wr Bandwidth Local (GB/s)	Quad 0 PCIE Wr Bandwidth Local (GB/s)	Quad 1 PCIE Rd Bandwidth Local (GB/s)	Quad 1 PCIE Wr Bandwidth Local (GB/s)	Quad 2 PCIE Rd Bandwidth Local (GB/s)	Quad 3 PCIE Wr Bandwidth Local (GB/s)
15.06	13.70	1.36	15.06	13.70	1.36	0.03	13.06	1.30	0.64	0.03
14.75	13.38	1.37	14.75	13.38	1.37	0.04	13.27	1.30	0.11	0.04
14.89	13.60	1.29	14.89	13.60	1.29	0.02	13.51	1.25	0.09	0.02
14.92	13.54	1.38	14.92	13.54	1.38	0.07	13.41	1.26	0.12	0.06
14.46	13.14	1.32	14.46	13.14	1.32	0.01	13.08	1.30	0.06	0.01
14.51	13.20	1.30	14.51	13.20	1.30	0.01	13.15	1.28	0.05	0.01
15.01	13.69	1.33	15.01	13.69	1.33	0.03	13.57	1.26	0.11	0.03
14.80	13.48	1.32	14.80	13.48	1.32	0.03	13.36	1.26	0.12	0.03
6.96	4.49	2.47	6.96	4.49	2.47	0.05	4.35	2.37	0.13	0.05
14.53	13.22	1.31	14.53	13.22	1.31	0.03	13.09	1.24	0.13	0.03

Total_Dispatch_Slots	SMT_Disp_contention	Frontend_Bound	Bad_Speculation	Backend_Bound	Retiring	Frontend_Bound.Latency	Frontend_Bound.BW	Bad_Speculation.Mispredicts	Bad_Speculation.Pipeline_Restarts	Backend_Bound.Memory	Backend_Bound.CPU	Retiring.Fastpath	Retiring.Microcode
83375943738.00	40.51	4.31	4.85	17.42	31.61	3.54	0.77	4.84	0.02	2.18	15.23	31.60	0.01
83724495948.00	42.84	2.49	2.83	18.25	32.80	2.07	0.43	2.81	0.02	1.35	16.90	32.79	0.01
83593257192.00	42.13	3.47	4.26	18.34	30.59	2.81	0.66	4.22	0.03	1.64	16.70	30.58	0.01
77319727164.00	42.22	2.61	2.87	19.14	32.18	2.11	0.50	2.85	0.03	1.43	17.71	32.17	0.01
83661001026.00	42.16	3.38	4.33	18.15	30.79	2.74	0.64	4.30	0.03	1.58	16.56	30.78	0.01
83528696586.00	42.56	2.18	2.52	19.02	32.82	1.77	0.42	2.49	0.02	1.17	17.86	32.81	0.01
83669848248.00	38.92	6.94	3.70	19.44	29.80	5.36	1.59	3.68	0.02	3.51	15.94	29.57	0.23
83371451310.00	42.09	3.34	3.71	17.11	32.51	2.77	0.56	3.70	0.01	2.03	15.08	32.50	0.01
83517669888.00	42.20	2.88	3.61	17.84	32.38	2.34	0.54	3.59	0.03	1.43	16.42	32.37	0.01
83398053606.00	42.64	2.30	2.56	17.76	33.95	1.90	0.39	2.55	0.02	1.39	16.37	33.94	0.01

L1 ITLB Miss (pti)	L2 ITLB Miss (pti)	L1 DTLB Miss (pti)	L2 DTLB Miss (pti)
0.26	0.06	0.57	0.05
0.00	0.00	0.17	0.01
0.11	0.04	0.40	0.05
0.00	0.00	0.21	0.01
0.00	0.00	0.17	0.01
0.00	0.00	0.09	0.00
0.04	0.01	0.19	0.02
0.00	0.00	0.23	0.01
0.00	0.00	0.26	0.01
0.01	0.00	0.23	0.01

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:C1:00.0 Off |                  Off |
|  0%   33C    P0            110W /  450W |   20837MiB /  24564MiB |     68%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   4018019      C   ...s/llama.cpp/build/bin/llama-imatrix        474MiB |
+-----------------------------------------------------------------------------------------+

With all layers offloaded to GPU:

142+KoSOLAR-v0.2-gugutypus-10.7B                  run/imatrix 50/48 1.52s/c 2.5/9.0m(9) [93/355] 12.9785

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:C1:00.0 Off |                  Off |
|  0%   31C    P0             94W /  450W |   20967MiB /  24564MiB |     22%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2515801      C   ...s/llama.cpp/build/bin/llama-imatrix        476MiB |
+-----------------------------------------------------------------------------------------+

mradermacher

Owner about 5 hours ago

Nice that our communication breakdown has ended - I suspect you overlooked my question about ikawrakow's work?

I am not sure what your conclusion w.r.t. power usage is, but I have been clocking my efficiency cores on my home server at 2.3 instead of 4.5GHz or so for a long time - half speed, but only a third of the power usage. Power usage grows roughly qudratically with frequency, and both amd and intel chips clock way outside the most efficient range. Even a moderate decrease from 4.5 to 4 or 3.5 decrease power usage a lot more than I lose throughput.

For nvidia, maybe clocking down the compute units but not the memory might not affect computation speed much, but power usage a lot. Maybe there is a point in reducing frequency for both at certain times?

mradermacher
/

BabyHercules-4x150M-GGUF

2 abc or not 2 abc