Intro

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

it is ML (machine learning) model used in Immich to do text search on media (photo and video) files

Immich support a lot of model, than can be found at https://huggingface.co/immich-app and than configured in settings

Different models a takes different time to index media. Some model are Multilanguage, some are not.

default CLIP model is ViT-B-32__openai (not multilanguage)

Time of media storage processing

Time to process

first test stats on library with 320 000 photos and 17 000 videos with size of 5.4 TB

latest test process library with 331000 photos 22800 videos with size of 6 TB (same library with some new photos added)

ViT-B-16-SigLIP-512__webli

nllb-clip-large-siglip__mrl takes even more time than ViT-B-16-SigLIP-512__webli too process media

nllb-clip-large-siglip__v1

takes about 20 days

XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k

on i7-1355U


ViT-SO400M-16-SigLIP2-256__webli

Executed on 2 devices with same library. Using OpenVINO (this time). Results the same for English text. Some difference in Ukrainian.

on Intel Core i7 1355U

on Intel N150

ViT-SO400M-14-SigLIP2-378__webli

Core i7-1355 + openVINO

ViT-B-16-SigLIP-i18n-256__webli

Intel N150 + OpenVINO


ViT-H-14-378-quickgelu__dfn5b

Core i7-1355 + openVINO


Some official data


to be found at https://immich.app/docs/features/searching/#clip-models

Quality of search by some random queries

Since those models are tested mostly on predefined sets, I tested the quality of the search on my personal media folder and some semi-random queries in both English and Ukrainian.

That is some present of "quality" marks deviation based on "feel" results is good, not just formal.
The resulting search of "apple" like the Apple Computers logo is formally 100% correct, but at least 50% of apple fruit feels like more correct output

Summary

Word/PhraseXLM-Roberta-
Base-ViT-B-32__
laion5b_s13b_b90k
ViT-SO400M-16-
SigLIP2-256
__webli

ViT-SO400M-16-
SigLIP2-384
__webli

ViT-SO400M-14-
SigLIP2-378
__webli


ViT-H-14-378-
quickgelu
__dfn5b
ViT-B-16-
SigLIP-i18n-256
__webli
nllb-clip-
large-siglip
__mrl
nllb-clip-
large-siglip
__v1
Total score65%72%74%73%56%59%56%49%
English Score (/18)85%94%98%91%95%71%71%63%
anxiety01001009510010000
apple9895959595208075
apple tree9095959595808080
arguing people95959575100959090
car9580957575608080
car on a parking lot9590958510070100100
city fog95951001001007510080
exchange rate9395959080708070
feel sad9595908095504
fog95701007095707070
green fence100100100100981005050
honda civic10010010010010010010085
peach6510010010095955550
person feel anxiety9897989698959565
Range Rover35951009090355060
river fog1001001001001001008590
serenity851001001009565750
TV1001001001001001009585
Ukrainian score (/31)45%51%51%56%18%48%40%35%
відчувати радість50000002528
відчувати сум00000000
заправка wog01009510009500
заправка кло5065959005000
зелений забор858595905859580
курс обміну валют956585650656565
люди в лісі1001001001008510055
люди які сперечаються065585009090
людина що відчувають розчарування0000030805
мати сумніви00030000
машина на стоянці10010095100604533
машина10065951002010033
персик951005957005530
почуття суму00000000
радість500007000
Рейндж Ровер59070900105080
Розчарування00000000
спокій000050300
сум00000000
сумніватись00000000
сумувати00000000
тв9080979580909580
туман на річці959595950959595
туман у місті10095959501005040
туман09595950954040
Хонда Сівік1009090100010010090
Хонда Цивік1009797100010010090
Хонда Цівік1009797100010010090
ціна на бензин80409250806510090
яблуко00856008085
яблуня707070808590200

Image found Examples

Car (English)

ModelSearch Result
XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k

XLM-Roberta-Large-Vit-B-16Plus

ViT-B-16-SigLIP-512__webli

nllb-clip-large-siglip__mrl

nllb-clip-large-siglip__v1

(half year later)

ViT-SO400M-16-SigLIP2-256__webli

ViT-H-14-378-quickgelu__dfn5b

ViT-SO400M-14-SigLIP2-378__webli

ViT-SO400M-16-SigLIP2-384__webli

ViT-B-16-SigLIP-i18n-256__webli

Green Fence (English)

ModelSearch Result
XLM-Roberta-Large-Vit-B-16Plus

ViT-B-16-SigLIP-512__webli

nllb-clip-large-siglip__mrl

nllb-clip-large-siglip__v1

(half year later)

XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k

ViT-H-14-378-quickgelu__dfn5b

ViT-SO400M-16-SigLIP2-256__webli

ViT-SO400M-14-SigLIP2-378__webli

ViT-SO400M-16-SigLIP2-384__webli

ViT-B-16-SigLIP-i18n-256__webli


туман на річці ("River fog" Ukrainian)

ModelSearch Result
XLM-Roberta-Base-ViT-B-32__laion5b_s13b_b90k

XLM-Roberta-Large-Vit-B-16Plus

ViT-B-16-SigLIP-512__webli

nllb-clip-large-siglip__mrl

nllb-clip-large-siglip__v1

(half year later)

ViT-H-14-378-quickgelu__dfn5b

random people photos. ( 75% people and river)

ViT-SO400M-16-SigLIP2-256__webli

first output screen is random screenshots (need couple of page down to find the rivers) - considering that as a fail


ViT-SO400M-14-SigLIP2-378__webli

ViT-SO400M-16-SigLIP2-384__webli

ViT-B-16-SigLIP-i18n-256__webli


  • No labels