Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
source llm_env/bin/activate
#pip install open-webui==0.2.5
pip install open-webui # 0.6.10
open-webui serve


GPU backend models performance

mistral-small3.1:24b
Modelsec to load the modellayers to GPUprompt eval rateeval rate

deepseek-r1:70bDeepSeek R1 Distill Llama 70B

54.25 (2.5x slower)

81/81

0.89 tokens/s1.62 tokens/s

llama3.3:70b

53.34 (2.5x slower)

81/811.52 tokens/s1.44 tokens/s

qwen3:32b

28.04 (2.8x slower)

Qwen3 32B

28.04

65/65

3.76 tokens/s2.93 tokens/s

phi3:14b

19.09 (5.4x slower)

41/41

openchat7b

6.53

33/33

llama4:scout

Llama 3.1 70B Instruct 2024 12

gemma3:27b

10.48 tokens/s7.70 tokens/s

deepseek-v2:16b

14.56 (3.6x slower)

28/284.96 tokens/s11.26 tokens/s

openchat:7b

6.53 (2.6x slower)

33/33

29.24 tokens/s16.35 tokens/s

llama4:scout

N/A

N/AN/AN/A

gemma3:27b

N/A

N/AN/AN/A

mistral-small3.1:24b

N/A

N/AN/AN/A

CPU vs GPU

Modelstarted in (seconds)paramSIZEprompt eval rateeval rate

deepseek-r1:70b

21.3470B42 GB2.20 tokens/s1.24 tokens/s

llama3.3:70b

21.3470B42 GB2.39 tokens/s1.23 tokens/s

qwen3:32b

10.0432B20 GB5.63 tokens/s2.54 tokens/s

gemma3:27b

1.76

27B17 GB6.66 tokens/s3.03 tokens/s

mistral-small3.1:24b

3.26

24B15 GB7.72 tokens/s3.60 tokens/s

llama4:scout

13.55

17B67 GB11.47 tokens/s4.76 tokens/s

deepseek-v2:16b

4.02

16B8.9 GB58.75 tokens/s24.50 tokens/s

phi3:14b

3.5214B7.9 GB15.12 tokens/s6.05 tokens/s

openchat:7b

2.517B4.1 GB30.37 tokens/s11.19 tokens/s

ollama CPU

install

Code Block
curl -fsSL https://ollama.com/install.sh | sh

...