2 abc or not 2 abc

#2
by mradermacher - opened

@nicoboss now looking into the IQ4_XS.

We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.

homing in on iq4_xs is going to be very tight, as just a few GB off is going to be a problem

llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloaded 24/316 layers to GPU
llm_load_tensors: CPU buffer size = 523472.97 MiB
llm_load_tensors: CUDA0 buffer size = 19645.50 MiB
llm_load_tensors: CUDA1 buffer size = 19645.50 MiB

compute_imatrix: 130.55 seconds per pass - ETA 11 hours 23.22 minutes

| 0% 37C P0 125W / 450W | 22153MiB / 24564MiB | 70% Default |
| 0% 35C P0 69W / 450W | 20531MiB / 24564MiB | 0% Default |

Judging form actual memory usage, we might even get another 30GB or more in there.

And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.

@nicoboss now looking into the IQ4_XS.

Awesome. Thanks a lot!

We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.

I will experiment with RPC some more. Please keep BigLlama-3.1-1T-Instruct.Q6_K.gguf for a few days unless you need the storage for something more important.

Judging form actual memory usage, we might even get another 30GB or more in there.

You only used the two RTX 4090 GPUs so technically you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s. But IQ4_XS will be good enough for now. It’s better what you used for your older large models you never ended up requantizing as far I'm aware.

And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.

Great. I see it completed successfully and is now working on the BigLlama 1T quant task. They will be great to stress test my new internet gateway using which I had not experienced any internet issues so far.

unless you need the storage for something more important.

Well, in fact, once bigllama is quanted, I will empty out all /*pool's (it's only the source gguf).

Also, since the big models really dried out at the moment,

you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s

No, because the kernel doesn't compile on the 3080, and probably also not on the 2070:

ggml_cuda_compute_forward: MUL failed
CUDA error: no kernel image is available for execution on the device

That is probably due to me forcing mmq for quality reasons (a lot of models overflow in f16 but work when mmq is forced), but I haven't verified that yet.

But IQ4_XS will be good enough for now.

Yeah, and eyeballing your graphs, IQ4_XS isn't as bad as we thought, and neither are Q3* (all non-imatrix).

They will be great to stress test my new internet gateway

I am really optimistic that it was the gateway, maybe an overheating problem. It has uploaded quite a bit so far without a hitch, more than with the old gateway at the end.

So lots of tweaking, watching, and waitiing for the fallout of past mistweaks to clear out a bit (the grey blocks of ready imatrix jobs in the middle of the quant queues shouldn't be there), I really like the algorithm. During the day, it will do imatrix at full speed. During the evening, basically nothing, and during the night, pretty much it trickles an imatrix through from time to time based on demand. And quants are mostly idle on nico during the night, and definitely during the evening. Since we had lots of small models today, nico was able to just keep up with the rest (and generated some imatrices in advance tro be used at night), but even if it can't keep up generating imatrices, at night it will slow down because nico1 is also the biggest imatrix consumer and they will mostly be done on demand only.

As a side effect, we also don't have the issue anymore that the imatrix queue order differs from the quant queue order, causing imatrices to be calculated that nobody is waiting for and vice versa.

I really like it.

Something completely different - I was told (I think by slaren) that the imatrix code is essentially unmaintained, and ikawrakow said he is no longer contributing to llama.cpp (https://github.com/ggerganov/llama.cpp/discussions/5063#discussioncomment-10996711) instead implements improvements in his own fork.

Any idea what is going on there?

And something else entirelly different: Since I was repeteadly asked about the "imatrix Q8_0" quants I went to verify that they don't exist. Naive grepping shows imatrix data is used:

size_t quantize_q8_0(const float * restrict src, void * restrict dst, int64_t nrow, int64_t n_per_row, const float * quant_weights) {

alas, the next line:

(void)quant_weights; // not used

So, nothing new here, but at least I now have a better basis than "somebody told me".

BTW, if you ever get finished with the quant measurement, the next big project might be to put imatrix data on a deterministic basis and improve the imatrix data we use.

:^)

just fyi, the "huggingface-cli upload stuck in endless read call" happened on another node (leia), so it's definitely some kind of huggingface/hf-cli problem.

btw., the tess model had another interesting upload error:

NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f2f4b970850>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)"

wtf., intermittent dns problems? that's a new one :)

Today I collected really interesting measurements regarding the hardware usage during imatrix and quantization tasks. Here the results:

1 GHz = 90 Watt
2 GHz = 110 Watt
3 GHz = 140 Watt
4 GHz = 210 Watt
4.67 GHz = 340 Watt

If I set the limit to 5 GHz the CPU is reaching its 350-Watt BIOS power limit during peeks and clocks to 4.67 GHz due to being power limited.

Tasks running during the test:

142+Quyen-Pro-Max-v0.1                            run/imatrix 11/80 11.79s/c 52.3/62.5m(62) [266/318] 9.8944
nico1     750  175  I BigWeave-v14-90b                             run/imatrix 21/24,IQ3_XS [768/921]
nico1     750  134  I openbuddy-deepseek-67b-v15.3-4k              run/imatrix 8/24,Q6_K [29/858]
Total PCIE Bandwidth (GB/s) Total PCIE Rd Bandwidth (GB/s) Total PCIE Wr Bandwidth (GB/s) Total PCIE Bandwidth Local (GB/s) Total PCIE Bandwidth Remote (GB/s) Total PCIE Rd Bandwidth Local (GB/s) Total PCIE Wr Bandwidth Local (GB/s) Total PCIE Rd Bandwidth Remote (GB/s) Total PCIE Wr Bandwidth Remote (GB/s) Quad 0 PCIE Rd Bandwidth Local (GB/s) Quad 0 PCIE Wr Bandwidth Local (GB/s) Quad 0 PCIE Rd Bandwidth Remote (GB/s) Quad 0 PCIE Wr Bandwidth Remote (GB/s) Quad 1 PCIE Rd Bandwidth Local (GB/s) Quad 1 PCIE Wr Bandwidth Local (GB/s) Quad 1 PCIE Rd Bandwidth Remote (GB/s) Quad 1 PCIE Wr Bandwidth Remote (GB/s) Quad 2 PCIE Rd Bandwidth Local (GB/s) Quad 2 PCIE Wr Bandwidth Local (GB/s) Quad 2 PCIE Rd Bandwidth Remote (GB/s) Quad 2 PCIE Wr Bandwidth Remote (GB/s) Quad 3 PCIE Rd Bandwidth Local (GB/s) Quad 3 PCIE Wr Bandwidth Local (GB/s) Quad 3 PCIE Rd Bandwidth Remote (GB/s) Quad 3 PCIE Wr Bandwidth Remote (GB/s)
16.93 15.21 1.72 16.93 0.00 15.21 1.72 0.00 0.00 0.00 0.11 0.00 0.00 14.96 1.42 0.00 0.00 0.24 0.09 0.00 0.00 0.00 0.11 0.00 0.00
16.98 15.22 1.76 16.98 0.00 15.22 1.76 0.00 0.00 0.00 0.10 0.00 0.00 14.97 1.46 0.00 0.00 0.25 0.10 0.00 0.00 0.00 0.10 0.00 0.00
15.45 13.68 1.77 15.45 0.00 13.68 1.77 0.00 0.00 0.00 0.19 0.00 0.00 13.34 1.30 0.00 0.00 0.33 0.10 0.00 0.00 0.00 0.19 0.00 0.00
15.17 13.59 1.58 15.17 0.00 13.59 1.58 0.00 0.00 0.00 0.16 0.00 0.00 12.86 1.16 0.00 0.00 0.71 0.09 0.00 0.00 0.01 0.16 0.00 0.00
14.83 13.04 1.79 14.83 0.00 13.04 1.79 0.00 0.00 0.00 0.21 0.00 0.00 12.68 1.28 0.00 0.00 0.36 0.09 0.00 0.00 0.00 0.21 0.00 0.00
10.09 7.70 2.39 10.09 0.00 7.70 2.39 0.00 0.00 0.38 0.17 0.00 0.00 6.69 1.95 0.00 0.00 0.25 0.09 0.00 0.00 0.38 0.17 0.00 0.00
12.61 10.80 1.81 12.61 0.00 10.80 1.81 0.00 0.00 0.00 0.09 0.00 0.00 10.53 1.56 0.00 0.00 0.27 0.06 0.00 0.00 0.00 0.09 0.00 0.00
14.61 12.94 1.68 14.61 0.00 12.94 1.68 0.00 0.00 0.00 0.16 0.00 0.00 12.64 1.28 0.00 0.00 0.29 0.09 0.00 0.00 0.00 0.16 0.00 0.00
15.26 13.53 1.73 15.26 0.00 13.53 1.73 0.00 0.00 0.01 0.22 0.00 0.00 13.17 1.18 0.00 0.00 0.35 0.10 0.00 0.00 0.01 0.22 0.00 0.00
14.59 12.94 1.65 14.59 0.00 12.94 1.65 0.00 0.00 0.00 0.14 0.00 0.00 12.62 1.28 0.00 0.00 0.31 0.10 0.00 0.00 0.00 0.13 0.00 0.00
Packed 512-bit FP Ops Retired (%) Packed 256-bit FP Ops Retired (%) Packed 128-bit FP Ops Retired (%) Scalar/MMX/x87 FP Ops Retired (%)
0.77 0.43 41.38 57.42
0.15 0.15 41.78 57.92
0.61 0.17 29.09 70.14
0.31 0.36 27.99 71.34
0.04 0.24 56.66 43.06
0.26 0.16 41.53 58.04
1.35 0.21 32.83 65.61
1.61 0.22 32.49 65.68
0.68 0.18 32.74 66.40
0.68 0.20 25.76 73.35
L1 DC Miss (pti) L2 Data Read Miss (pti) L1 IC Miss (pti) L2 Code Read Miss (pti)
1.08 0.23 0.32 0.03
2.30 0.42 1.81 0.17
2.09 0.61 0.27 0.03
1.66 0.22 0.15 0.01
1.10 0.07 0.17 0.00
1.75 0.24 0.73 0.06
1.63 0.25 1.33 0.12
2.07 0.68 0.41 0.05
2.05 0.23 0.86 0.04
1.93 0.41 1.56 0.14
Local Inbound Read Data Bytes(GB/s) Local Outbound Write Data Bytes (GB/s) Remote Inbound Read Data Bytes(GB/s) Remote Outbound Write Data Bytes (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 0 at Interface 0 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 1 at Interface 0 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 2 at Interface 0 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 3 at Interface 0 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 4 at Interface 0 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 5 at Interface 0 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 6 at Interface 0 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 7 at Interface 0 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 0 at Interface 1 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 1 at Interface 1 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 2 at Interface 1 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 3 at Interface 1 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 4 at Interface 1 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 5 at Interface 1 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 6 at Interface 1 (GB/s) Local Socket Inbound Data to CPU Moderator (CCM) 7 at Interface 1 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 0 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 1 at Interface 0 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 2 at Interface 0 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 3 at Interface 0 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 4 at Interface 0 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 5 at Interface 0 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 6 at Interface 0 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 7 at Interface 0 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 1 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 1 at Interface 1 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 2 at Interface 1 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 3 at Interface 1 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 4 at Interface 1 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 5 at Interface 1 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 6 at Interface 1 (GB/s) Local Socket Outbound Data from CPU Moderator (CCM) 7 at Interface 1 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 0 at Interface 0 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 1 at Interface 0 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 2 at Interface 0 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 3 at Interface 0 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 4 at Interface 0 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 5 at Interface 0 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 6 at Interface 0 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 7 at Interface 0 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 0 at Interface 1 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 1 at Interface 1 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 2 at Interface 1 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 3 at Interface 1 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 4 at Interface 1 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 5 at Interface 1 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 6 at Interface 1 (GB/s) Remote Socket Inbound Data to CPU Moderator (CCM) 7 at Interface 1 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 0 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 1 at Interface 0 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 2 at Interface 0 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 3 at Interface 0 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 4 at Interface 0 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 5 at Interface 0 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 6 at Interface 0 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 7 at Interface 0 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 1 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 1 at Interface 1 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 2 at Interface 1 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 3 at Interface 1 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 4 at Interface 1 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 5 at Interface 1 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 6 at Interface 1 (GB/s) Remote Socket Outbound Data from CPU Moderator (CCM) 7 at Interface 1 (GB/s)
23.91 15.63 0.00 0.03 2.53 2.36 14.13 4.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.02 0.51 10.90 3.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20.43 16.07 0.00 0.02 1.47 1.57 0.65 16.73 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.28 0.16 15.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12.60 7.37 0.00 0.02 0.78 1.42 1.27 9.13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.14 0.45 0.53 6.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
22.40 16.48 0.00 0.02 1.46 2.49 2.23 16.22 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.53 0.53 0.97 14.45 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18.70 14.81 0.00 0.02 0.69 1.25 0.56 16.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.17 0.05 14.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
21.43 15.46 0.00 0.03 0.99 1.80 1.83 16.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.15 0.41 0.61 14.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20.83 16.16 0.00 0.03 0.78 2.01 1.04 17.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 0.41 0.37 15.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
19.60 15.28 0.00 0.02 1.27 1.61 0.61 16.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.39 0.32 0.13 14.45 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18.92 15.19 0.00 0.02 0.83 1.43 0.76 15.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.14 0.39 0.12 14.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20.25 15.74 0.00 0.02 1.28 3.35 1.60 14.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 2.31 0.49 12.63 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
All DC Fills (pti) DC Fills From Same CCX (pti) DC Fills From different CCX in same node (pti) DC Fills From Local Memory (pti) DC Fills From Remote CCX Cache (pti) DC Fills From Remote Memory (pti) Remote DRAM Reads %
0.89 0.85 0.00 0.04 0.00 0.00 0.00
1.10 1.05 0.00 0.05 0.00 0.00 0.00
0.87 0.83 0.00 0.04 0.00 0.00 0.00
0.54 0.53 0.00 0.01 0.00 0.00 0.00
0.85 0.82 0.00 0.02 0.00 0.00 0.00
0.58 0.57 0.00 0.01 0.00 0.00 0.00
1.08 1.01 0.00 0.07 0.00 0.00 0.00
0.91 0.88 0.01 0.02 0.00 0.00 0.00
1.31 1.23 0.01 0.07 0.00 0.00 0.00
5.91 5.42 0.00 0.48 0.00 0.00 0.00
Total Upstream DMA Read Write Data Bytes (GB/s) Local Upstream DMA Read Data Bytes (GB/s) Local Upstream DMA Write Data Bytes (GB/s) Remote Upstream DMA Read Data Bytes (GB/s) Remote Upstream DMA Write Data Bytes (GB/s)
15.65 14.15 1.50 0.00 0.00
14.61 13.12 1.49 0.00 0.00
12.10 10.23 1.87 0.00 0.00
9.92 7.82 2.10 0.00 0.00
16.14 14.71 1.43 0.00 0.00
14.91 13.41 1.50 0.00 0.00
14.95 13.45 1.50 0.00 0.00
17.33 15.64 1.70 0.00 0.00
17.06 15.44 1.63 0.00 0.00
15.32 13.91 1.41 0.00 0.00
Retired SSE/AVX Flops(GFLOPs) FP Dispatch Faults (pti)
4.61 0.00
4.21 0.00
4.70 0.00
4.94 0.00
4.46 0.00
5.68 0.00
4.55 0.00
4.40 0.00
5.54 0.00
6.00 0.00
HwPf DC Fills From DRAM or IO connected in remote node (pti) HwPf DC Fills From CCX Cache in remote node (pti) HwPf DC Fills From DRAM or IO connected in local node (pti) HwPf DC Fills From Cache of another CCX in local node (pti) HwPf DC Fills From L3 or different L2 in same CCX (pti) HwPf DC Fills From L2 (pti)
0.00 0.00 0.00 0.00 0.01 0.19
0.00 0.00 0.05 0.00 0.02 0.24
0.00 0.00 0.02 0.00 0.01 0.22
0.00 0.00 0.04 0.00 0.02 0.24
0.00 0.00 0.01 0.00 0.02 0.25
0.00 0.00 0.02 0.00 0.04 0.28
0.00 0.00 0.13 0.00 0.02 0.53
0.00 0.00 0.01 0.00 0.01 0.19
0.00 0.00 0.01 0.00 0.02 0.22
0.00 0.00 0.23 0.01 0.05 1.99
Utilization (%) System time (%) User time (%) System instructions (%) User instructions (%) Eff Freq (MHz) IPC (Sys + User) IPC (Sys) IPC (User) CPI (Sys + User) CPI (Sys) CPI (User) Giga Instructions Per Sec Locked Instructions (pti) Retired Branches (pti) Retired Branches Mispredicted (pti)
99.95 0.50 99.14 0.14 99.86 4661.73 1.56 0.43 1.57 0.64 2.33 0.64 7.25 0.00 45.81 2.01
99.95 1.96 97.68 0.80 99.20 4668.11 1.53 0.62 1.55 0.65 1.61 0.65 7.10 0.00 45.21 2.37
99.95 0.49 99.16 0.13 99.87 4665.48 1.57 0.42 1.58 0.64 2.38 0.63 7.29 0.00 38.26 1.61
99.95 0.48 99.17 0.13 99.87 4674.00 1.51 0.42 1.51 0.66 2.39 0.66 7.02 0.00 37.79 2.07
99.95 0.99 98.65 0.31 99.69 4661.84 1.58 0.49 1.59 0.63 2.03 0.63 7.34 0.00 39.13 1.53
99.95 7.26 92.35 2.38 97.62 4660.15 1.40 0.46 1.47 0.71 2.18 0.68 6.50 0.00 41.32 3.13
99.95 2.82 96.78 0.84 99.16 4681.43 1.47 0.43 1.50 0.68 2.30 0.67 6.87 0.00 37.85 2.39
99.95 0.52 99.12 0.13 99.87 4663.53 1.57 0.41 1.57 0.64 2.46 0.64 7.27 0.00 54.70 2.61
99.95 0.47 99.17 0.13 99.87 4674.25 1.53 0.42 1.53 0.65 2.39 0.65 7.12 0.00 39.25 1.31
93.41 1.39 98.23 0.38 99.62 4690.36 1.53 0.41 1.55 0.65 2.41 0.65 6.69 0.00 41.38 1.02
IC Fetch Miss Ratio Op Cache Fetch Miss Ratio IC Access (pti) IC Miss (pti) DC Access (pti)
0.06 0.01 2.49 0.15 227.32
0.07 0.07 19.45 1.44 274.00
0.04 0.04 12.51 0.46 259.45
0.05 0.02 3.54 0.17 254.83
0.09 0.01 2.08 0.18 257.49
0.05 0.02 3.97 0.21 279.37
0.04 0.02 3.86 0.17 240.55
0.04 0.02 4.23 0.16 245.10
0.03 0.03 5.35 0.19 251.73
0.08 0.02 4.45 0.37 257.30
L2 Access (pti) L2 Access from IC Miss (pti) L2 Access from DC Miss (pti) L2 Access from L2 HWPF (pti) L2 Miss (pti) L2 Miss from IC Miss (pti) L2 Miss from DC Miss (pti) L2 Miss from L2 HWPF (pti) L2 Hit (pti) L2 Hit from IC Miss (pti) L2 Hit from DC Miss (pti) L2 Hit from L2 HWPF (pti)
1.32 0.10 0.60 0.40 0.11 0.01 0.04 0.06 0.96 0.08 0.54 0.34
0.52 0.03 0.40 0.09 0.04 0.00 0.02 0.02 0.47 0.03 0.37 0.07
2.15 0.08 1.44 0.53 0.20 0.02 0.08 0.11 1.84 0.07 1.34 0.43
1.99 0.28 1.36 0.20 0.12 0.01 0.05 0.06 1.75 0.25 1.36 0.14
1.62 0.08 1.41 0.15 0.09 0.01 0.04 0.04 1.50 0.07 1.32 0.11
2.04 0.17 1.66 0.21 0.12 0.01 0.05 0.06 1.91 0.16 1.60 0.15
2.83 0.31 1.47 0.83 0.21 0.03 0.06 0.12 2.19 0.27 1.21 0.71
1.64 0.06 1.20 0.34 0.08 0.01 0.03 0.04 1.55 0.06 1.18 0.30
0.86 0.05 0.66 0.13 0.07 0.00 0.03 0.04 0.75 0.04 0.62 0.09
2.12 0.13 1.93 0.79 0.20 0.04 0.07 0.09 2.36 0.08 1.59 0.70
L3 Access L3 Miss L3 Miss % Ave L3 Miss Latency (ns)
79581629.00 26869900.00 33.76 108.63
74689852.00 29239352.00 39.15 113.88
67081825.00 22431193.00 33.44 106.52
52306516.00 16520234.00 31.58 111.40
45881135.00 9610550.00 20.95 104.17
63687583.00 26049615.00 40.90 124.44
57509142.00 14470472.00 25.16 103.89
71741584.00 17767547.00 24.77 102.42
61719580.00 19476650.00 31.56 100.81
62135911.00 27658654.00 44.51 118.97
Total Mem Bw (GB/s) Local DRAM Read Data Bytes(GB/s) Local DRAM Write Data Bytes(GB/s) Remote DRAM Read Data Bytes (GB/s) Remote DRAM Write Data Bytes (GB/s) Total Mem RdBw (GB/s) Total Mem WrBw (GB/s)
54.44 36.59 17.85 0.00 0.00 36.59 17.85
54.07 36.88 17.19 0.00 0.00 36.88 17.19
52.21 35.50 16.71 0.00 0.00 35.50 16.71
52.94 35.66 17.28 0.00 0.00 35.66 17.28
53.34 36.29 17.05 0.00 0.00 36.29 17.05
49.20 32.95 16.25 0.00 0.00 32.95 16.25
37.36 25.07 12.29 0.00 0.00 25.07 12.29
56.59 38.58 18.00 0.00 0.00 38.58 18.00
62.99 43.26 19.72 0.00 0.00 43.26 19.72
53.37 35.92 17.44 0.00 0.00 35.92 17.44
Total PCIE Bandwidth (GB/s) Total PCIE Rd Bandwidth (GB/s) Total PCIE Wr Bandwidth (GB/s) Total PCIE Bandwidth Local (GB/s) Total PCIE Bandwidth Remote (GB/s) Total PCIE Rd Bandwidth Local (GB/s) Total PCIE Wr Bandwidth Local (GB/s) Total PCIE Rd Bandwidth Remote (GB/s) Total PCIE Wr Bandwidth Remote (GB/s) Quad 0 PCIE Rd Bandwidth Local (GB/s) Quad 0 PCIE Wr Bandwidth Local (GB/s) Quad 0 PCIE Rd Bandwidth Remote (GB/s) Quad 0 PCIE Wr Bandwidth Remote (GB/s) Quad 1 PCIE Rd Bandwidth Local (GB/s) Quad 1 PCIE Wr Bandwidth Local (GB/s) Quad 1 PCIE Rd Bandwidth Remote (GB/s) Quad 1 PCIE Wr Bandwidth Remote (GB/s) Quad 2 PCIE Rd Bandwidth Local (GB/s) Quad 2 PCIE Wr Bandwidth Local (GB/s) Quad 2 PCIE Rd Bandwidth Remote (GB/s) Quad 2 PCIE Wr Bandwidth Remote (GB/s) Quad 3 PCIE Rd Bandwidth Local (GB/s) Quad 3 PCIE Wr Bandwidth Local (GB/s) Quad 3 PCIE Rd Bandwidth Remote (GB/s) Quad 3 PCIE Wr Bandwidth Remote (GB/s)
15.06 13.70 1.36 15.06 0.00 13.70 1.36 0.00 0.00 0.00 0.03 0.00 0.00 13.06 1.30 0.00 0.00 0.64 0.00 0.00 0.00 0.00 0.03 0.00 0.00
14.75 13.38 1.37 14.75 0.00 13.38 1.37 0.00 0.00 0.00 0.04 0.00 0.00 13.27 1.30 0.00 0.00 0.11 0.00 0.00 0.00 0.00 0.04 0.00 0.00
14.89 13.60 1.29 14.89 0.00 13.60 1.29 0.00 0.00 0.00 0.02 0.00 0.00 13.51 1.25 0.00 0.00 0.09 0.00 0.00 0.00 0.00 0.02 0.00 0.00
14.92 13.54 1.38 14.92 0.00 13.54 1.38 0.00 0.00 0.00 0.07 0.00 0.00 13.41 1.26 0.00 0.00 0.12 0.00 0.00 0.00 0.00 0.06 0.00 0.00
14.46 13.14 1.32 14.46 0.00 13.14 1.32 0.00 0.00 0.00 0.01 0.00 0.00 13.08 1.30 0.00 0.00 0.06 0.00 0.00 0.00 0.00 0.01 0.00 0.00
14.51 13.20 1.30 14.51 0.00 13.20 1.30 0.00 0.00 0.00 0.01 0.00 0.00 13.15 1.28 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.01 0.00 0.00
15.01 13.69 1.33 15.01 0.00 13.69 1.33 0.00 0.00 0.00 0.03 0.00 0.00 13.57 1.26 0.00 0.00 0.11 0.00 0.00 0.00 0.00 0.03 0.00 0.00
14.80 13.48 1.32 14.80 0.00 13.48 1.32 0.00 0.00 0.00 0.03 0.00 0.00 13.36 1.26 0.00 0.00 0.12 0.00 0.00 0.00 0.00 0.03 0.00 0.00
6.96 4.49 2.47 6.96 0.00 4.49 2.47 0.00 0.00 0.00 0.05 0.00 0.00 4.35 2.37 0.00 0.00 0.13 0.00 0.00 0.00 0.00 0.05 0.00 0.00
14.53 13.22 1.31 14.53 0.00 13.22 1.31 0.00 0.00 0.00 0.03 0.00 0.00 13.09 1.24 0.00 0.00 0.13 0.00 0.00 0.00 0.00 0.03 0.00 0.00
Total_Dispatch_Slots SMT_Disp_contention Frontend_Bound Bad_Speculation Backend_Bound Retiring Frontend_Bound.Latency Frontend_Bound.BW Bad_Speculation.Mispredicts Bad_Speculation.Pipeline_Restarts Backend_Bound.Memory Backend_Bound.CPU Retiring.Fastpath Retiring.Microcode
83375943738.00 40.51 4.31 4.85 17.42 31.61 3.54 0.77 4.84 0.02 2.18 15.23 31.60 0.01
83724495948.00 42.84 2.49 2.83 18.25 32.80 2.07 0.43 2.81 0.02 1.35 16.90 32.79 0.01
83593257192.00 42.13 3.47 4.26 18.34 30.59 2.81 0.66 4.22 0.03 1.64 16.70 30.58 0.01
77319727164.00 42.22 2.61 2.87 19.14 32.18 2.11 0.50 2.85 0.03 1.43 17.71 32.17 0.01
83661001026.00 42.16 3.38 4.33 18.15 30.79 2.74 0.64 4.30 0.03 1.58 16.56 30.78 0.01
83528696586.00 42.56 2.18 2.52 19.02 32.82 1.77 0.42 2.49 0.02 1.17 17.86 32.81 0.01
83669848248.00 38.92 6.94 3.70 19.44 29.80 5.36 1.59 3.68 0.02 3.51 15.94 29.57 0.23
83371451310.00 42.09 3.34 3.71 17.11 32.51 2.77 0.56 3.70 0.01 2.03 15.08 32.50 0.01
83517669888.00 42.20 2.88 3.61 17.84 32.38 2.34 0.54 3.59 0.03 1.43 16.42 32.37 0.01
83398053606.00 42.64 2.30 2.56 17.76 33.95 1.90 0.39 2.55 0.02 1.39 16.37 33.94 0.01
L1 ITLB Miss (pti) L2 ITLB Miss (pti) L1 DTLB Miss (pti) L2 DTLB Miss (pti) All TLBs Flushed (pti)
0.26 0.06 0.57 0.05 0.00
0.00 0.00 0.17 0.01 0.00
0.11 0.04 0.40 0.05 0.00
0.00 0.00 0.21 0.01 0.00
0.00 0.00 0.17 0.01 0.00
0.00 0.00 0.09 0.00 0.00
0.04 0.01 0.19 0.02 0.00
0.00 0.00 0.23 0.01 0.00
0.00 0.00 0.26 0.01 0.00
0.01 0.00 0.23 0.01 0.00
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:C1:00.0 Off |                  Off |
|  0%   33C    P0            110W /  450W |   20837MiB /  24564MiB |     68%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   4018019      C   ...s/llama.cpp/build/bin/llama-imatrix        474MiB |
+-----------------------------------------------------------------------------------------+

With all layers offloaded to GPU:

142+KoSOLAR-v0.2-gugutypus-10.7B                  run/imatrix 50/48 1.52s/c 2.5/9.0m(9) [93/355] 12.9785
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:C1:00.0 Off |                  Off |
|  0%   31C    P0             94W /  450W |   20967MiB /  24564MiB |     22%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2515801      C   ...s/llama.cpp/build/bin/llama-imatrix        476MiB |
+-----------------------------------------------------------------------------------------+

Nice that our communication breakdown has ended - I suspect you overlooked my question about ikawrakow's work?

I am not sure what your conclusion w.r.t. power usage is, but I have been clocking my efficiency cores on my home server at 2.3 instead of 4.5GHz or so for a long time - half speed, but only a third of the power usage. Power usage grows roughly qudratically with frequency, and both amd and intel chips clock way outside the most efficient range. Even a moderate decrease from 4.5 to 4 or 3.5 decrease power usage a lot more than I lose throughput.

For nvidia, maybe clocking down the compute units but not the memory might not affect computation speed much, but power usage a lot. Maybe there is a point in reducing frequency for both at certain times?

Sign up or log in to comment