...
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main
https://github.com/AUTOMATIC1111/stable-diffusion-webui
Intel tools
Intel oneAPI
oneMKL - OneAPI Math Kernel Library, OneDNN - oneAPI Deep Neural Network Library
...
| Model | sec to load the model | layers to GPU | prompt eval rate | eval rate | |
|---|---|---|---|---|---|
deepseek-r1:70b | 54.25 (2.5x slower) | 81/81 | 0.89 tokens/s | 1.62 tokens/s | |
llama3.3:70b | 53.34 (2.5x slower) | 81/81 | 1.52 tokens/s | 1.44 tokens/s | |
qwen3:32b | 28.04 (2.8x slower) | 65/65 | 3.76 tokens/s | 2.93 tokens/s | |
phi3:14b | 19.09 (5.4x slower) | 41/41 | 10.48 tokens/s | 7.70 tokens/s | |
deepseek-v2:16b | 14.56 (3.6x slower) | 28/28 | 4.96 tokens/s | 11.26 tokens/s | |
openchat:7b | 6.53 (2.6x slower) | 33/33 | 29.24 tokens/s | 16.35 tokens/s | |
llama4:scout | N/A | N/A | N/A | N/A | |
gemma3:27b | N/A | N/A | N/A | N/A | |
mistral-small3.1:24b | N/A | N/A | N/A | N/A |
CPU vs GPU
| Model | started in (seconds) | param | SIZE | prompt eval rate | eval rate |
|---|---|---|---|---|---|
deepseek-r1:70b | 21.34 | 70B | 42 GB | 2.20 tokens/s | 1.24 tokens/s |
llama3.3:70b | 21.34 | 70B | 42 GB | 2.39 tokens/s | 1.23 tokens/s |
qwen3:32b | 10.04 | 32B | 20 GB | 5.63 tokens/s | 2.54 tokens/s |
gemma3:27b | 1.76 | 27B | 17 GB | 6.66 tokens/s | 3.03 tokens/s |
mistral-small3.1:24b | 3.26 | 24B | 15 GB | 7.72 tokens/s | 3.60 tokens/s |
llama4:scout | 13.55 | 17B | 67 GB | 11.47 tokens/s | 4.76 tokens/s |
deepseek-v2:16b | 4.02 | 16B | 8.9 GB | 58.75 tokens/s | 24.50 tokens/s |
phi3:14b | 3.52 | 14B | 7.9 GB | 15.12 tokens/s | 6.05 tokens/s |
openchat:7b | 2.51 | 7B | 4.1 GB | 30.37 tokens/s | 11.19 tokens/s |
ollama CPU
install
| Code Block |
|---|
curl -fsSL https://ollama.com/install.sh | sh |
...