...
| Code Block |
|---|
source llm_env/bin/activate #pip install open-webui==0.2.5 pip install open-webui # 0.6.10 open-webui serve |
GPU backend models performance
| Model | sec to load the model | layers to GPU | prompt eval rate | eval rate | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
deepseek-r1:70bDeepSeek R1 Distill Llama 70B | 54.25 (2.5x slower) | 81/81 | 0.89 tokens/s | 1.62 tokens/s | |||||||
llama3.3:70b | 53.34 (2.5x slower) | 81/81 | 1.52 tokens/s | 1.44 tokens/s | |||||||
qwen3:32b | 28.04 (2.8x slower) | Qwen3 32B | 28.04 | 65/65 | 3.76 tokens/s | 2.93 tokens/s | |||||
phi3:14b | 19.09 (5.4x slower) | 41/41 | openchat7b | 6.53 | 33/33 | llama4:scout | Llama 3.1 70B Instruct 2024 12 | gemma3:27b | mistral-small3.1:24b10.48 tokens/s | 7.70 tokens/s | |
deepseek-v2:16b | 14.56 (3.6x slower) | 28/28 | 4.96 tokens/s | 11.26 tokens/s | |||||||
openchat:7b | 6.53 (2.6x slower) | 33/33 | 29.24 tokens/s | 16.35 tokens/s | |||||||
llama4:scout | N/A | N/A | N/A | N/A | |||||||
gemma3:27b | N/A | N/A | N/A | N/A | |||||||
mistral-small3.1:24b | N/A | N/A | N/A | N/A |
CPU vs GPU
| Model | started in (seconds) | param | SIZE | prompt eval rate | eval rate |
|---|---|---|---|---|---|
deepseek-r1:70b | 21.34 | 70B | 42 GB | 2.20 tokens/s | 1.24 tokens/s |
llama3.3:70b | 21.34 | 70B | 42 GB | 2.39 tokens/s | 1.23 tokens/s |
qwen3:32b | 10.04 | 32B | 20 GB | 5.63 tokens/s | 2.54 tokens/s |
gemma3:27b | 1.76 | 27B | 17 GB | 6.66 tokens/s | 3.03 tokens/s |
mistral-small3.1:24b | 3.26 | 24B | 15 GB | 7.72 tokens/s | 3.60 tokens/s |
llama4:scout | 13.55 | 17B | 67 GB | 11.47 tokens/s | 4.76 tokens/s |
deepseek-v2:16b | 4.02 | 16B | 8.9 GB | 58.75 tokens/s | 24.50 tokens/s |
phi3:14b | 3.52 | 14B | 7.9 GB | 15.12 tokens/s | 6.05 tokens/s |
openchat:7b | 2.51 | 7B | 4.1 GB | 30.37 tokens/s | 11.19 tokens/s |
ollama CPU
install
| Code Block |
|---|
curl -fsSL https://ollama.com/install.sh | sh |
...


