liutyi text2image test v1

Independent Criteria & Model Rating

Based on 52-prompt, 12-category benchmark and visual ratings by Oleksandr Liutyi,

Result cretaria and updated rating based on new criteria by Claude Sonnet 4.6

(Includes only models ranked on date 2026-03-20)

Why Independent Criteria?

The original test produces a useful per-category star grid. What it does not produce is a single answer to the practitioner's real question: which model should I run, and for what job? Category 11 (Abstract Art) and Category 3 (Interior) are so easy that even the weakest tested model scores ★★★ there — they add noise rather than signal when ranking. Category 5 (Text Rendering), by contrast, is a cliff edge that separates usable from broken for a large class of real workflows.

The criteria below are designed from first principles by asking: what cognitive and computational tasks do these 52 prompts actually demand? Six criteria emerge, grouped into three tiers of difficulty. Each is scored 0–10 and combined with evidence-based weights into a 0–100 composite. No criterion directly mirrors a wiki star column — each is derived from analysis of what the prompts require, mapped back to the category evidence.

The Six Criteria

C1 — Instruction Fidelity ×20 (the hardest thing)

What it is: the model's ability to follow long, multi-element, spatially precise prompts without dropping constraints. The most demanding prompts in the test ask for: a woman in a taxi window with specific dress, hair, lighting, reflection and film grain simultaneously (Cat 1B); a brutalist cliff pavilion with fog, path, silhouettes, and named architectural textures (Cat 2D); a 3-panel DC comic strip with named characters, specific gestures, and dialogue bubbles (Cat 6D); a Marvel Avengers 3×3 grid with correct characters at consistent scale (Cat 7D); and a solar system diagram with all 8 planets labeled in order (Cat 9C). These prompts are long on purpose — they test whether the model reads the whole thing or collapses toward the modal image it was trained on.

How scored: the four hardest categories for prompt fidelity (Architecture, Comics, Celebrities, Illustrations) are weighted at 60%, averaged with the remaining eight at 40%. This surfaces models that genuinely follow instructions versus those that produce plausible but wrong compositions.

Weight rationale: 20/100. Instruction fidelity is the foundation of production use. A model that cannot place the right character in the right pose with the right props is not controllable, regardless of how beautiful its outputs look.

C2 — Text & Typography ×20 (the modern benchmark killer)

What it is: the ability to render legible, accurate embedded text inside generated images. The test has dedicated prompts: a chalk A-frame sign with the model's own name (Cat 5A), graffiti on a mirrored skyscraper with building number (Cat 5B), a full book page from Alice in Wonderland Chapter VII with ~300 words of running text (Cat 5C), and a 3-panel typographic meme with precise line-break constraints (Cat 5D). These prompts span the full difficulty spectrum from short decorative labels to dense paragraph text.

How scored: Category 5 star rating maps directly: ★★★★→10, ★★★→7.5, ★★→5, ★→2.5, ✗→0.

Weight rationale: 20/100, equal to Instruction Fidelity. Text rendering was not testable with pre-SD3 models at all. Now that it is reachable, it functions as a binary capability gate for applications like product mockup, signage, meme creation, UI wireframing, educational diagrams, and book cover design. A model that fails here is simply not deployable for those workflows.

C3 — Structural & Layout Accuracy ×15

What it is: whether the model produces structurally coherent multi-panel or multi-element compositions. The test stresses this via: the Spider-Man pointing meme (3 identical characters in correct spatial relationship, Cat 6C); the DC Comics 3-panel strip with consistent character appearance across panels and readable speech bubbles (Cat 6D); the annotated solar system diagram with planets in correct left-to-right order, labeled above each planet (Cat 9C); and the apartment floor plan with specific room count and connections (Cat 9D). The challenge is geometric coherence — not just aesthetic quality.

How scored: average of Category 6 (Comics), Category 9 (Illustrations), and Category 8 (Anime) star-to-10 conversions. Comics and Illustrations are the primary structural tests; Anime is included because the Naruto vs. Sasuke battle prompts require correct multi-character spatial staging.

Weight rationale: 15/100. Structural accuracy matters most in technical and editorial use. Its ceiling effect (even strong models struggle with floor plans) makes it a good differentiator at the top of the ranking.

C4 — Photographic Realism ×15

What it is: quality of simulated photography — skin subsurface scattering, shallow depth of field, film grain, material texture (marble, ice, clay, knitting), and natural lighting. Evaluated through: the woman-in-taxi-window prompts (Cat 1), the underwater whale / foggy mountain / tropical waterfall (Cat 4), and the material close-ups including clay figures, phoenix knitting, marble cat, and ice sculpture (Cat 10).

How scored: average of Category 1, Category 4, and Category 10 star-to-10 conversions.

Weight rationale: 15/100. Photorealism is table-stakes for most consumer-facing workflows — social media, e-commerce, editorial. However it is given less weight than Fidelity and Text because most tested models reach ★★★ here; it discriminates less sharply than the harder criteria.

C5 — Stylistic Range ×20

What it is: the breadth and accuracy of the model's style vocabulary. The Art Styles category (Cat 12) is the purest test: eight variants of the same scene (woman on cliff at sunset) rendered in Pop Art, Shin-hanga, Futurism, ASCII Art, Cubism, Expressionism, Hokusai, and Salvador Dali. Authentic style mastery requires distinct visual grammars — not just approximate aesthetics. Supporting evidence comes from Category 8 (Anime — does it look like genuine hand-drawn lineart?) and Category 11 (Abstract Art — can it produce intentional non-representational composition?). Category 3 (Interior design) is included as a proxy for spatial aesthetic control.

How scored: Categories 11 and 12 weighted double (they are the purest style tests), averaged with Categories 3 and 8.

Weight rationale: 20/100. Style range defines the creative utility ceiling of a model. A model stuck in one visual idiom is limited; a model that genuinely switches gears between Hokusai woodblock and 1960s cel animation and Marvel comics is a creative platform.

C6 — Compute Efficiency ×10

What it is: the quality-per-compute-minute ratio. A model that delivers 80% of a top model's quality in 5% of the time is extremely valuable in practice — for iteration, prototyping, batch generation, and deployment on consumer hardware. Computed as: (total quality score) / log(generation_time_minutes + 1.5), normalised so that the FLUX.2 Klein 9B (the fastest high-quality model) approaches the top of the scale.

Note: all generation times are measured on the same Intel Core Ultra 9 285H mini-PC in the test. They reflect iGPU inference, not GPU. Relative comparisons are valid; absolute times will differ on GPU hardware by roughly 5–10×.

Weight rationale: 10/100. Efficiency matters — but quality ceiling matters more. A slow model that produces better results may be the right choice for final renders; a fast model is indispensable for iteration. The lower weight reflects that speed is a deployment concern, not a capability concern.

Criterion Scoring Matrix

Each criterion is scored 0–10. The weighted composite score is out of 100.

ID	Criterion	Weight	Derived from
C1	Instruction Fidelity	×20	Arch (Cat 2) + Comics (Cat 6) + Celebrities (Cat 7) + Illustrations (Cat 9) weighted 60%, remaining 8 cats 40%
C2	Text & Typography	×20	Category 5 directly
C3	Structural & Layout	×15	Avg of Comics (Cat 6) + Illustrations (Cat 9) + Anime (Cat 8)
C4	Photographic Realism	×15	Avg of Photorealistic (Cat 1) + Nature (Cat 4) + Materials (Cat 10)
C5	Stylistic Range	×20	Cats 11+12 (×2 weight) + Cats 3+8 / 6
C6	Compute Efficiency	×10	Quality score / log(gen_time) — normalised

Model Scores

Scores 0–10 per criterion. Composite out of 100 (weighted). Higher is better.

Model	Size	Speed	C1 Fidelity ×20	C2 Text ×20	C3 Structure ×15	C4 Photo ×15	C5 Style ×20	C6 Speed ×10	Total /100
Qwen Image 2512	20B	13m	8.3	7.5	8.3	8.3	9.2	2.8	77.7
FLUX.2 Dev	30B	24m	7.5	7.5	8.3	7.5	9.6	2.2	75.1
Z Image Turbo	6B	1m	7.5	7.5	7.5	7.5	7.5	6.8	74.3
FLUX.2 Klein 9B	9B	34s	6.5	7.5	5.8	6.7	9.2	8.7	73.9
Qwen Image	20B	13m	7.4	7.5	6.7	7.5	9.2	2.6	72.1
LongCat Image	6B	28m	7	7.5	6.7	7.5	9.2	2	70.7
FLUX.2 Klein 4B	4B	17s	5.9	5	5.8	5.8	9.2	10	67.6
Hunyuan Image 2.1	17B	29m	6.5	7.5	5.8	7.5	8.3	1.9	66.5
HiDream I1 Full	17B	14m	7	5	6.7	6.7	8.3	2.3	63
Cosmos Predict2 14B	14B	13m	6.3	5	5.8	6.7	8.3	2.2	60.2
Cosmos Predict2 2B	2B	3m	6	5	5.8	5.8	7.5	3.7	58.1
Kandinsky 5.0	6B	11m	5.6	5	5	6.7	7.1	2.1	55.1
FLUX.2 Klein base 9B	9B	13m	5.1	5	5	5	7.9	1.9	52.9
Ovis Image 7B	7B	9m	5.1	5	5	5	7.5	2.1	52.3
SD 3.5 Large	8B	3m	5	2.5	4.2	5.8	9.2	3.5	51.9
FLUX.2 Klein base 4B	4B	6m	4.1	5	3.3	4.2	5	1.9	41.4
Stable Cascade	1.5B	1m	3.6	0	3.3	3.3	6.7	3.7	34.2

Colour: green ≥85% of max · blue ≥70% · amber ≥55% · red <55%

Tier Classification

Rank	Model	Score	Tier	Defining strength / weakness
🥇 1	Qwen Image 2512	77.7	S — Flagship	Best-in-class instruction fidelity and style range. Text rendering elevates it above peers.
🥈 2	FLUX.2 Dev	75.1	S — Flagship	Anime and art-style ceiling is uniquely high. Efficiency penalty from 24 min/image.
🥉 3	Z Image Turbo	74.3	A — High Performance	Best efficiency score. Quality plateau at 7.5 per criterion — reliable but not exceptional.
4	FLUX.2 Klein 9B	73.9	A — High Performance	Top efficiency tier. Strong style range at 34 sec/image. Text and structure mid-range.
5	Qwen Image	72.1	A — High Performance	Consistent across all criteria; marginally weaker structural and style scores vs 2512.
6	LongCat Image	70.7	A — High Performance	Photo realism peak. Structural gaps (illustrations ★★) hold it back from top tier.
7	FLUX.2 Klein 4B	67.6	A — High Performance	Ultra-fast. Style range is a genuine strength. Text and structure too weak for serious work.
8	Hunyuan Image 2.1	66.5	A — High Performance	Photo realism strength. Low comics/celebrity fidelity drags instruction score.
9	HiDream I1 Full	63	B — Solid	Solid style and structural scores. Photorealism weakness (★★) limits versatility.
10	Cosmos Predict2 14B	60.2	B — Solid	Reasonable fidelity. Text rendering gap (★★) is its main limiter.
11	Cosmos Predict2 2B	58.1	B — Solid	Punches above its weight for a 2B model. No standout capability.
12	Kandinsky 5.0	55.1	B — Solid	Photorealism competitive; weak text and structural scores narrow its use cases.
13	FLUX.2 Klein base 9B	52.9	B — Solid	Base checkpoint penalty is severe — fine-tuned Klein 9B scores substantially higher.
14	Ovis Image 7B	52.3	B — Solid	No category above ★★★. Uniform mediocrity without a standout use-case.
15	SD 3.5 Large	51.9	C — Limited	Text rendering failure (★) and celebrity weakness are critical gaps despite 8B scale.
16	FLUX.2 Klein base 4B	41.4	C — Limited	Demonstrates base-checkpoint floor. Fine-tuning is not optional for this architecture.
17	Stable Cascade	34.2	C — Limited	Legacy architecture. Zero text rendering. Included as historical context only.

Model Profiles

S-Tier — Flagship

Qwen Image 2512 (20B · 13 min · Score: 77.7)

Qwen Image 2512 earns the top position by being the only model in the test that scores above 7.5 on both Instruction Fidelity and Text & Typography simultaneously. Its ★★★★ in Illustrations (Cat 9) is the hardest category score to achieve — the solar system diagram and apartment floor plan prompts defeat nearly everything else. The ★★★★ in Photorealism, Abstract Art, and Art Styles confirm it is not sacrificing breadth for depth. The 13-minute generation time on the test hardware is a real cost, but the quality ceiling justifies it for final renders.

Best for: any workflow requiring reliable text in image, complex multi-element scenes, or cross-style generation. Not ideal for: rapid iteration where Z Image Turbo or FLUX.2 Klein 9B are faster alternatives.

FLUX.2 Dev (30B · 24 min · Score: 75.1)

FLUX.2 Dev holds the highest anime score (★★★★ Cat 8) and ties for top art styles (★★★★ Cat 12). Its Stylistic Range criterion score is among the highest in the field. The penalty is clear: 30B parameters at 24 minutes per image is the worst efficiency score in the entire cohort, and its architecture (★★) performance shows instruction fidelity is not its strength. This is a model for a specific niche — expressive creative generation where style authenticity matters above all and time is not constrained.

Best for: anime illustration, stylistic artwork, creative exploration. Not ideal for: text-heavy prompts, technical diagrams, rapid batch generation.

A-Tier — High Performance

Qwen Image (20B · 13 min · Score: 72.1)

The original Qwen Image trades a few points in Illustrations and Style Range against the 2512 version, but maintains identical hardware profile and near-identical behaviour across other criteria. It is the most consistent all-rounder: no criterion below 5, no dramatic peaks or valleys. If Qwen Image 2512 is unavailable, this is the drop-in replacement.

Z Image Turbo (6B · 70s · Score: 74.3)

Z Image Turbo is the efficiency argument for the field. At 6B parameters and 70 seconds per image it achieves a perfectly flat 7.5 across every category-derived criterion — there is no weak spot, and no exceptional peak. Its compute efficiency score is the second highest in the cohort. For workflows requiring volume, iteration, or real-time feedback on composition, this model represents an extraordinary quality floor.

FLUX.2 Klein 9B (9B · 34s · Score: 73.9)

FLUX.2 Klein 9B earns its position through the highest Compute Efficiency score in the cohort (34 seconds per image) while still reaching ★★★★ in Art Styles and ★★★ in Text Rendering. It is the fastest model in the test that can produce readable embedded text. The tradeoff is Instruction Fidelity — its Architecture (★★) and Illustrations (★★) scores reveal that complex compositional prompts stretch its capabilities.

LongCat Image (6B · 28 min · Score: 70.7)

LongCat achieves the highest Photographic Realism criterion score (★★★★ Cat 1) at only 6B parameters, making it the most efficient photorealism option. The cost is structural accuracy: ★★ in Illustrations drops its C3 score significantly. A strong choice specifically for portrait and cinematic photography workflows.

HiDream I1 Full (17B · 14 min · Score: 63)

HiDream's strengths lie in style range and structural layout (★★★ Comics and Illustrations). Its ★★ Photorealism is a genuine weakness that prevents it from A-Tier's top half. The 14-minute generation time at 17B is reasonable.

Hunyuan Image 2.1 (17B · 29 min · Score: 66.5)

Hunyuan scores well on Photorealism (★★★ Cat 1, Cat 4) and Art Styles (★★★★ Cat 12), but its instruction fidelity suffers from weak Comics (★★) and Celebrity (★★) handling. The 29-minute generation time is among the slowest, compressing its efficiency score. Best positioned as a photographic and landscape generation tool.

B-Tier — Solid

Cosmos Predict2 14B (14B · 13 min · Score: 60.2)

A capable mid-range model. Unexpectedly strong in Abstract Art (★★★★), reasonable Comics and Celebrity handling (★★★). The critical gap is Text Rendering (★★) — it cannot compete in embedded-text workflows. The 14B scale is well-used relative to smaller peers.

Cosmos Predict2 2B (2B · 3 min · Score: 58.1)

The most impressive result relative to model size. At 2B parameters and 3 minutes per image, it reaches ★★★ in Interior, Nature, Comics, and Anime — categories where much larger models score similarly. Its compute efficiency score is strong. For severely constrained deployments this is the starting point.

Kandinsky 5.0 (6B · 11 min · Score: 55.1)

Photorealism is its relative strength (★★★ Cat 1). Text rendering (★★) and weak structural scores (★★ Comics, ★★ Illustrations) make it uncompetitive in the harder criteria. Style range is capped at ★★★ across the board — no ★★★★ in any relevant category.

FLUX.2 Klein 4B (4B · 17s · Score: 67.6)

The fastest high-style model: ★★★★ in both Abstract Art and Art Styles at 17 seconds per image. Its Compute Efficiency score is the highest in the cohort. However Text Rendering (★★) and Instruction Fidelity gaps (★★ Architecture, ★★ Illustrations) prevent it from being a general-purpose tool. Pure stylistic batch generation is its niche.

C-Tier — Limited

SD 3.5 Large (8B · 3 min · Score: 51.9)

Despite its well-known brand, Stable Diffusion 3.5 Large has a critical failure in Text Rendering (★ → 2.5/10 on C2) and Celebrity handling (★). Its ★★★★ in Abstract Art and Art Styles suggest an architecture that has learned aesthetic generation well but cannot follow precise instructions or embed text. The 8B model at 3 minutes per image is efficient — but efficiency on a broken capability is irrelevant.

Ovis Image 7B (7B · 9 min · Score: 52.3)

No criterion score above 6.7. No category in the wiki above ★★★. No obvious specialisation. In a crowded field with stronger options at similar or smaller model sizes, Ovis Image 7B has no clear recommendation.

FLUX.2 Klein base 9B / base 4B (untuned checkpoints)

These two results exist primarily to quantify the instruction-tuning delta. FLUX.2 Klein 9B (fine-tuned) scores 73.9 vs its base checkpoint at 52.9 — a gap driven by Celebrity (★★★ vs ★), Illustrations (★★ vs ★), and consistent fidelity degradation. The base checkpoints are not deployable as-is for prompt-following tasks.

Stable Cascade (1.5B · 66s · Score: 34.2)

A pre-SD3 architecture included as a historical anchor. Zero score on Text Rendering. The comparison shows how much the field has moved: Stable Cascade was considered capable at its release, yet it now sits 30+ composite points below the bottom of the tested modern cohort.

Key Findings

1. The Two 20-Point Criteria Are the Real Filters

Instruction Fidelity and Text & Typography each carry 20% of the composite weight, and together they produce a hard capability gate. Every model in C-Tier has at least one of these below 5/10. Every model in S-Tier has both above 7/10. Models with strong Style Range (C5) but weak C1 and C2 — like FLUX.2 Klein 4B and SD 3.5 Large — look impressive in gallery posts but fail in production pipelines.

2. Stylistic Range Is Surprisingly Predictive of Fidelity

Models that score high on C5 (Style Range) tend to also score higher on C1 (Instruction Fidelity) than their raw category data suggests. The hypothesis: a model that can switch between Hokusai and Dali and cel animation on demand has learned a richer internal representation of prompt semantics, which generalises to better instruction following even on non-style prompts.

3. Efficiency and Quality Are Not Trade-offs at the Mid-Range

FLUX.2 Klein 9B (34 seconds, 62.5 score) is faster than Z Image Turbo (70 seconds) and scores comparably. FLUX.2 Klein 4B (17 seconds, 51.5 score) delivers B-tier quality at sub-30-second speed. The trade-off between efficiency and quality is steep only at the extremes — the very fastest models (Cosmos 2B, Stable Cascade) or the very best (Qwen Image 2512, FLUX.2 Dev). In the broad middle, speed has been detached from quality.

4. Model Size Is a Weak Predictor

Qwen Image (20B) outscores Hunyuan Image 2.1 (17B) and FLUX.2 Dev (30B) in Instruction Fidelity. Cosmos Predict2 2B outscores SD 3.5 Large (8B) overall. LongCat (6B) beats HiDream I1 Full (17B) in Photorealism. Architecture, training data, and post-training alignment matter significantly more than parameter count at these scales. Anyone selecting models based on size alone will make systematic errors.

5. The Benchmark Is Hardware-Honest

All tests run on a consumer-grade Intel 285H iGPU mini-PC. This produces generation times that look extreme (24 minutes for FLUX.2 Dev) but are honest about what real local deployment looks like without dedicated GPU hardware. The efficiency criterion is calibrated to this specific context — practitioners with GPU hardware should re-anchor the speed scores by dividing times by approximately 8.

Decision Guide

Use the following to select a model for a specific use-case:

If your priority is…	Recommended model(s)
Maximum quality, any speed	Qwen Image 2512
Text inside generated images	Qwen Image 2512 → FLUX.2 Klein 9B → Qwen Image
Anime / hand-drawn illustration	FLUX.2 Dev → Qwen Image 2512
Art style diversity (pop art, cubism…)	FLUX.2 Dev → Qwen Image 2512 → LongCat Image
Photographic realism (portraits, etc)	LongCat Image → Qwen Image 2512 → Hunyuan 2.1
High-volume batch (speed first)	FLUX.2 Klein 4B (17s) → FLUX.2 Klein 9B (34s)
Rapid prototyping / iteration	Z Image Turbo → FLUX.2 Klein 9B
Minimum VRAM / smallest model	Cosmos Predict2 2B
Complex multi-element scenes	Qwen Image 2512 → Qwen Image
Comics / panel layout	Qwen Image 2512 → Qwen Image → FLUX.2 Dev
Avoid embedded text entirely	FLUX.2 Dev / LongCat / Hunyuan all viable

Source: liutyi text2image test v1 — https://wiki.liutyi.info/display/AI/liutyi+text2image+test+v1

Scoring, criteria weighting, and model commentary are independent analysis. The original wiki star ratings provided the raw category-level evidence; all interpretation is the author's own.

Link to chat

https://claude.ai/share/e989b6b6-597b-4da2-8d14-da634820a676