CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
it is ML (machine learning) model used in Immich to do text search on media (photo and video) files
Immich support a lot of model, than can be found at https://huggingface.co/immich-app and than configured in settings
Different models a takes different time to index media. Some model are Multilanguage, some are not.
default CLIP model is ViT-B-32__openai (not multilanguage)
Time to process
first test stats on library with 320 000 photos and 17 000 videos with size of 5.4 TB
latest test process library with 331000 photos 22800 videos with size of 6 TB (same library with some new photos added)

nllb-clip-large-siglip__mrl takes even more time than ViT-B-16-SigLIP-512__webli too process media
takes about 20 days

on i7-1355U

Executed on 2 devices with same library. Using OpenVINO (this time). Results the same for English text. Some difference in Ukrainian.
on Intel Core i7 1355U

on Intel N150

Core i7-1355 + openVINO

Intel N150 + OpenVINO

Core i7-1355 + openVINO

to be found at https://immich.app/docs/features/searching/#clip-models
Since those models are tested mostly on predefined sets, I tested the quality of the search on my personal media folder and some semi-random queries in both English and Ukrainian.
That is some present of "quality" marks deviation based on "feel" results is good, not just formal.
The resulting search of "apple" like the Apple Computers logo is formally 100% correct, but at least 50% of apple fruit feels like more correct output
| Word/Phrase | XLM-Roberta- Base-ViT-B-32__ laion5b_s13b_b90k | ViT-SO400M-16- SigLIP2-256 __webli | ViT-SO400M-16- | ViT-SO400M-14- | ViT-H-14-378- quickgelu __dfn5b | ViT-B-16- SigLIP-i18n-256 __webli | nllb-clip- large-siglip __mrl | nllb-clip- large-siglip __v1 |
|---|---|---|---|---|---|---|---|---|
| Total score | 65% | 72% | 74% | 73% | 56% | 59% | 56% | 49% |
| English Score (/18) | 85% | 94% | 98% | 91% | 95% | 71% | 71% | 63% |
| anxiety | 0 | 100 | 100 | 95 | 100 | 100 | 0 | 0 |
| apple | 98 | 95 | 95 | 95 | 95 | 20 | 80 | 75 |
| apple tree | 90 | 95 | 95 | 95 | 95 | 80 | 80 | 80 |
| arguing people | 95 | 95 | 95 | 75 | 100 | 95 | 90 | 90 |
| car | 95 | 80 | 95 | 75 | 75 | 60 | 80 | 80 |
| car on a parking lot | 95 | 90 | 95 | 85 | 100 | 70 | 100 | 100 |
| city fog | 95 | 95 | 100 | 100 | 100 | 75 | 100 | 80 |
| exchange rate | 93 | 95 | 95 | 90 | 80 | 70 | 80 | 70 |
| feel sad | 95 | 95 | 90 | 80 | 95 | 5 | 0 | 4 |
| fog | 95 | 70 | 100 | 70 | 95 | 70 | 70 | 70 |
| green fence | 100 | 100 | 100 | 100 | 98 | 100 | 50 | 50 |
| honda civic | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 85 |
| peach | 65 | 100 | 100 | 100 | 95 | 95 | 55 | 50 |
| person feel anxiety | 98 | 97 | 98 | 96 | 98 | 95 | 95 | 65 |
| Range Rover | 35 | 95 | 100 | 90 | 90 | 35 | 50 | 60 |
| river fog | 100 | 100 | 100 | 100 | 100 | 100 | 85 | 90 |
| serenity | 85 | 100 | 100 | 100 | 95 | 65 | 75 | 0 |
| TV | 100 | 100 | 100 | 100 | 100 | 100 | 95 | 85 |
| Ukrainian score (/31) | 45% | 51% | 51% | 56% | 18% | 48% | 40% | 35% |
| відчувати радість | 50 | 0 | 0 | 0 | 0 | 0 | 25 | 28 |
| відчувати сум | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| заправка wog | 0 | 100 | 95 | 100 | 0 | 95 | 0 | 0 |
| заправка кло | 50 | 65 | 95 | 90 | 0 | 50 | 0 | 0 |
| зелений забор | 85 | 85 | 95 | 90 | 5 | 85 | 95 | 80 |
| курс обміну валют | 95 | 65 | 85 | 65 | 0 | 65 | 65 | 65 |
| люди в лісі | 100 | 100 | 100 | 100 | 85 | 100 | 5 | 5 |
| люди які сперечаються | 0 | 65 | 5 | 85 | 0 | 0 | 90 | 90 |
| людина що відчувають розчарування | 0 | 0 | 0 | 0 | 0 | 30 | 80 | 5 |
| мати сумніви | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
| машина на стоянці | 100 | 100 | 95 | 100 | 60 | 45 | 3 | 3 |
| машина | 100 | 65 | 95 | 100 | 20 | 100 | 3 | 3 |
| персик | 95 | 100 | 5 | 95 | 70 | 0 | 55 | 30 |
| почуття суму | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| радість | 5 | 0 | 0 | 0 | 0 | 70 | 0 | 0 |
| Рейндж Ровер | 5 | 90 | 70 | 90 | 0 | 10 | 50 | 80 |
| Розчарування | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| спокій | 0 | 0 | 0 | 0 | 50 | 3 | 0 | 0 |
| сум | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| сумніватись | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| сумувати | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| тв | 90 | 80 | 97 | 95 | 80 | 90 | 95 | 80 |
| туман на річці | 95 | 95 | 95 | 95 | 0 | 95 | 95 | 95 |
| туман у місті | 100 | 95 | 95 | 95 | 0 | 100 | 50 | 40 |
| туман | 0 | 95 | 95 | 95 | 0 | 95 | 40 | 40 |
| Хонда Сівік | 100 | 90 | 90 | 100 | 0 | 100 | 100 | 90 |
| Хонда Цивік | 100 | 97 | 97 | 100 | 0 | 100 | 100 | 90 |
| Хонда Цівік | 100 | 97 | 97 | 100 | 0 | 100 | 100 | 90 |
| ціна на бензин | 80 | 40 | 92 | 50 | 80 | 65 | 100 | 90 |
| яблуко | 0 | 0 | 8 | 5 | 60 | 0 | 80 | 85 |
| яблуня | 70 | 70 | 70 | 80 | 85 | 90 | 20 | 0 |
| Model | Search Result |
|---|---|
| XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k |
|
| XLM-Roberta-Large-Vit-B-16Plus |
|
| ViT-B-16-SigLIP-512__webli |
|
| nllb-clip-large-siglip__mrl |
|
nllb-clip-large-siglip__v1 (half year later) |
|
ViT-SO400M-16-SigLIP2-256__webli |
|
ViT-H-14-378-quickgelu__dfn5b |
|
ViT-SO400M-14-SigLIP2-378__webli |
|
ViT-SO400M-16-SigLIP2-384__webli |
|
ViT-B-16-SigLIP-i18n-256__webli |
|
| Model | Search Result |
|---|---|
| XLM-Roberta-Large-Vit-B-16Plus |
|
| ViT-B-16-SigLIP-512__webli |
|
| nllb-clip-large-siglip__mrl |
|
nllb-clip-large-siglip__v1 (half year later) |
|
XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k |
|
ViT-H-14-378-quickgelu__dfn5b |
|
ViT-SO400M-16-SigLIP2-256__webli |
|
ViT-SO400M-14-SigLIP2-378__webli |
|
ViT-SO400M-16-SigLIP2-384__webli |
|
ViT-B-16-SigLIP-i18n-256__webli |
|
| Model | Search Result |
|---|---|
| XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k |
|
| XLM-Roberta-Large-Vit-B-16Plus |
|
| ViT-B-16-SigLIP-512__webli |
|
| nllb-clip-large-siglip__mrl |
|
nllb-clip-large-siglip__v1 (half year later) |
|
ViT-H-14-378-quickgelu__dfn5b | random people photos. ( 75% people and river) |
ViT-SO400M-16-SigLIP2-256__webli | first output screen is random screenshots (need couple of page down to find the rivers) - considering that as a fail
|
ViT-SO400M-14-SigLIP2-378__webli |
|
ViT-SO400M-16-SigLIP2-384__webli |
|
ViT-B-16-SigLIP-i18n-256__webli |
|