Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggerganov/llama.cpp#5962
In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.
See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix
- Last updated 2024-02-27 (add IQ4_XS).
- imatrix from wiki.train, 200*512 tokens.
- KL-divergence measured on wiki.test.
Bits per weight | KL-divergence median | KL-divergence q99 | Top tokens differ | ln(PPL(Q)/PPL(base)) | |
---|---|---|---|---|---|
IQ1_S | 1.78 | 0.5495 | 5.5174 | 0.3840 | 0.9235 |
IQ2_XXS | 2.20 | 0.1751 | 2.4983 | 0.2313 | 0.2988 |
IQ2_XS | 2.43 | 0.1146 | 1.7693 | 0.1943 | 0.2046 |
IQ2_S | 2.55 | 0.0949 | 1.6284 | 0.1806 | 0.1722 |
IQ2_M | 2.76 | 0.0702 | 1.0935 | 0.1557 | 0.1223 |
Q2_K_S | 2.79 | 0.0829 | 1.5111 | 0.1735 | 0.1600 |
Q2_K | 3.00 | 0.0588 | 1.0337 | 0.1492 | 0.1103 |
IQ3_XXS | 3.21 | 0.0330 | 0.5492 | 0.1137 | 0.0589 |
IQ3_XS | 3.32 | 0.0296 | 0.4550 | 0.1071 | 0.0458 |
Q3_K_S | 3.50 | 0.0304 | 0.4481 | 0.1068 | 0.0511 |
IQ3_S | 3.52 | 0.0205 | 0.3018 | 0.0895 | 0.0306 |
IQ3_M | 3.63 | 0.0186 | 0.2740 | 0.0859 | 0.0268 |
Q3_K_M | 3.89 | 0.0171 | 0.2546 | 0.0839 | 0.0258 |
Q3_K_L | 4.22 | 0.0152 | 0.2202 | 0.0797 | 0.0205 |
IQ4_XS | 4.32 | 0.0088 | 0.1082 | 0.0606 | 0.0079 |
IQ4_NL | 4.56 | 0.0085 | 0.1077 | 0.0605 | 0.0074 |
Q4_K_S | 4.57 | 0.0083 | 0.1012 | 0.0600 | 0.0081 |
Q4_K_M | 4.83 | 0.0075 | 0.0885 | 0.0576 | 0.0060 |
Q5_K_S | 5.52 | 0.0045 | 0.0393 | 0.0454 | 0.0005 |
Q5_K_M | 5.67 | 0.0043 | 0.0368 | 0.0444 | 0.0005 |
Q6_K | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 |
- Last updated 2024-03-15 (bench #6083).
GiB | pp512 -ngl 99 | tg128 -ngl 99 | pp512 -ngl 0 | tg128 -ngl 0 | pp512 -ngl 0 #6083 | |
---|---|---|---|---|---|---|
IQ1_S | 1.50 | 709.29 | 74.85 | 324.35 | 15.66 | 585.61 |
IQ2_XS | 2.05 | 704.52 | 58.44 | 316.10 | 15.11 | 557.68 |
IQ3_XS | 2.79 | 682.72 | 45.79 | 300.61 | 10.49 | 527.83 |
IQ4_XS | 3.64 | 712.96 | 64.17 | 292.36 | 11.06 | 495.92 |
Q4_0 | 3.83 | 870.44 | 63.42 | 310.94 | 10.44 | 554.56 |
Q5_K | 4.78 | 691.40 | 46.52 | 273.83 | 8.54 | 453.58 |
Q6_K | 5.53 | 661.98 | 47.57 | 261.16 | 7.34 | 415.22 |
Q8_0 | 7.17 | 881.95 | 39.74 | 270.70 | 5.74 | 440.44 |
f16 | 13.49 | 211.12 | 3.06 | 303.60 |
I see this chart is for Mistral 7b. Would there be a meaningful difference in the same chart done with a larger model, perhaps a 70b? It's my understanding that performance at low BPW scales up with parameter count, so I'd be curious to see how the graph changes.