Intro
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
it is ML (machine learning) model used in Immich to do text search on media (photo and video) files
Immich support a lot of model, than can be found at https://huggingface.co/immich-app and than configured in settings
Different models a takes different time to index media. Some model are Multilanguage, some are not.
default CLIP model is ViT-B-32__openai (not multilanguage)
Time of media storage processing
Time to process
first test stats on library with 320 000 photos and 17 000 videos with size of 5.4 TB
latest test process library with 331000 photos 22800 videos with size of 6 TB (same library with some new photos added)
ViT-B-16-SigLIP-512__webli
nllb-clip-large-siglip__mrl takes even more time than ViT-B-16-SigLIP-512__webli too process media
nllb-clip-large-siglip__v1
takes about 20 days
XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k
on i7-1355U
ViT-SO400M-16-SigLIP2-256__webli
Executed on 2 devices with same library. Using OpenVINO (this time). Results the same for English text. Some difference in Ukrainian.
on Intel Core i7 1355U
on Intel N150
ViT-SO400M-14-SigLIP2-378__webli
Core i7-1355 + openVINO
ViT-B-16-SigLIP-i18n-256__webli
Intel N150 + OpenVINO
ViT-H-14-378-quickgelu__dfn5b
Core i7-1355 + openVINO
Some official data
to be found at https://immich.app/docs/features/searching/#clip-models
Quality of search by some random queries
Since those models are tested mostly on predefined sets, I tested the quality of the search on my personal media folder and some semi-random queries in both English and Ukrainian.
That is some present of "quality" marks deviation based on "feel" results is good, not just formal.
The resulting search of "apple" like the Apple Computers logo is formally 100% correct, but at least 50% of apple fruit feels like more correct output
Summary
Word/Phrase | XLM-Roberta- Base-ViT-B-32__ laion5b_s13b_b90k | ViT-SO400M-16- SigLIP2-256 __webli | ViT-SO400M-16- | ViT-SO400M-14- | ViT-H-14-378- quickgelu __dfn5b | ViT-B-16- SigLIP-i18n-256 __webli | nllb-clip- large-siglip __mrl | nllb-clip- large-siglip __v1 |
---|---|---|---|---|---|---|---|---|
Total score | 65% | 72% | 74% | 73% | 56% | 59% | 56% | 49% |
English Score (/18) | 85% | 94% | 98% | 91% | 95% | 71% | 71% | 63% |
anxiety | 0 | 100 | 100 | 95 | 100 | 100 | 0 | 0 |
apple | 98 | 95 | 95 | 95 | 95 | 20 | 80 | 75 |
apple tree | 90 | 95 | 95 | 95 | 95 | 80 | 80 | 80 |
arguing people | 95 | 95 | 95 | 75 | 100 | 95 | 90 | 90 |
car | 95 | 80 | 95 | 75 | 75 | 60 | 80 | 80 |
car on a parking lot | 95 | 90 | 95 | 85 | 100 | 70 | 100 | 100 |
city fog | 95 | 95 | 100 | 100 | 100 | 75 | 100 | 80 |
exchange rate | 93 | 95 | 95 | 90 | 80 | 70 | 80 | 70 |
feel sad | 95 | 95 | 90 | 80 | 95 | 5 | 0 | 4 |
fog | 95 | 70 | 100 | 70 | 95 | 70 | 70 | 70 |
green fence | 100 | 100 | 100 | 100 | 98 | 100 | 50 | 50 |
honda civic | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 85 |
peach | 65 | 100 | 100 | 100 | 95 | 95 | 55 | 50 |
person feel anxiety | 98 | 97 | 98 | 96 | 98 | 95 | 95 | 65 |
Range Rover | 35 | 95 | 100 | 90 | 90 | 35 | 50 | 60 |
river fog | 100 | 100 | 100 | 100 | 100 | 100 | 85 | 90 |
serenity | 85 | 100 | 100 | 100 | 95 | 65 | 75 | 0 |
TV | 100 | 100 | 100 | 100 | 100 | 100 | 95 | 85 |
Ukrainian score (/31) | 45% | 51% | 51% | 56% | 18% | 48% | 40% | 35% |
відчувати радість | 50 | 0 | 0 | 0 | 0 | 0 | 25 | 28 |
відчувати сум | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
заправка wog | 0 | 100 | 95 | 100 | 0 | 95 | 0 | 0 |
заправка кло | 50 | 65 | 95 | 90 | 0 | 50 | 0 | 0 |
зелений забор | 85 | 85 | 95 | 90 | 5 | 85 | 95 | 80 |
курс обміну валют | 95 | 65 | 85 | 65 | 0 | 65 | 65 | 65 |
люди в лісі | 100 | 100 | 100 | 100 | 85 | 100 | 5 | 5 |
люди які сперечаються | 0 | 65 | 5 | 85 | 0 | 0 | 90 | 90 |
людина що відчувають розчарування | 0 | 0 | 0 | 0 | 0 | 30 | 80 | 5 |
мати сумніви | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
машина на стоянці | 100 | 100 | 95 | 100 | 60 | 45 | 3 | 3 |
машина | 100 | 65 | 95 | 100 | 20 | 100 | 3 | 3 |
персик | 95 | 100 | 5 | 95 | 70 | 0 | 55 | 30 |
почуття суму | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
радість | 5 | 0 | 0 | 0 | 0 | 70 | 0 | 0 |
Рейндж Ровер | 5 | 90 | 70 | 90 | 0 | 10 | 50 | 80 |
Розчарування | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
спокій | 0 | 0 | 0 | 0 | 50 | 3 | 0 | 0 |
сум | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
сумніватись | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
сумувати | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
тв | 90 | 80 | 97 | 95 | 80 | 90 | 95 | 80 |
туман на річці | 95 | 95 | 95 | 95 | 0 | 95 | 95 | 95 |
туман у місті | 100 | 95 | 95 | 95 | 0 | 100 | 50 | 40 |
туман | 0 | 95 | 95 | 95 | 0 | 95 | 40 | 40 |
Хонда Сівік | 100 | 90 | 90 | 100 | 0 | 100 | 100 | 90 |
Хонда Цивік | 100 | 97 | 97 | 100 | 0 | 100 | 100 | 90 |
Хонда Цівік | 100 | 97 | 97 | 100 | 0 | 100 | 100 | 90 |
ціна на бензин | 80 | 40 | 92 | 50 | 80 | 65 | 100 | 90 |
яблуко | 0 | 0 | 8 | 5 | 60 | 0 | 80 | 85 |
яблуня | 70 | 70 | 70 | 80 | 85 | 90 | 20 | 0 |
Image found Examples
Car (English)
Model | Search Result |
---|---|
XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k | |
XLM-Roberta-Large-Vit-B-16Plus | |
ViT-B-16-SigLIP-512__webli | |
nllb-clip-large-siglip__mrl | |
nllb-clip-large-siglip__v1 (half year later) | |
ViT-SO400M-16-SigLIP2-256__webli | |
ViT-H-14-378-quickgelu__dfn5b | |
ViT-SO400M-14-SigLIP2-378__webli | |
ViT-SO400M-16-SigLIP2-384__webli | |
ViT-B-16-SigLIP-i18n-256__webli |
Green Fence (English)
Model | Search Result |
---|---|
XLM-Roberta-Large-Vit-B-16Plus | |
ViT-B-16-SigLIP-512__webli | |
nllb-clip-large-siglip__mrl | |
nllb-clip-large-siglip__v1 (half year later) | |
XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k | |
ViT-H-14-378-quickgelu__dfn5b | |
ViT-SO400M-16-SigLIP2-256__webli | |
ViT-SO400M-14-SigLIP2-378__webli | |
ViT-SO400M-16-SigLIP2-384__webli | |
ViT-B-16-SigLIP-i18n-256__webli |
туман на річці ("River fog" Ukrainian)
Model | Search Result |
---|---|
XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k | |
XLM-Roberta-Large-Vit-B-16Plus | |
ViT-B-16-SigLIP-512__webli | |
nllb-clip-large-siglip__mrl | |
nllb-clip-large-siglip__v1 (half year later) | |
ViT-H-14-378-quickgelu__dfn5b | random people photos. ( 75% people and river) |
ViT-SO400M-16-SigLIP2-256__webli | first output screen is random screenshots (need couple of page down to find the rivers) - considering that as a fail |
ViT-SO400M-14-SigLIP2-378__webli | |
ViT-SO400M-16-SigLIP2-384__webli | |
ViT-B-16-SigLIP-i18n-256__webli |