285H Time: 8m 2.08s | total 484.49 pipeline 480.23 te 1.98 decode 1.81 gc 0.43 | GPU 31740 MB 25% | RAM 47.39 GB 39%

Info

https://github.com/meituan-longcat/LongCat-Image

...

Code Block

import torch
from diffusers import LongCatImagePipeline

if __name__ == '__main__':
    device = torch.device('cuda')

    pipe = LongCatImagePipeline.from_pretrained("meituan-longcat/LongCat-Image", torch_dtype= torch.bfloat16 )
    # pipe.to(device, torch.bfloat16)  # Uncomment for high VRAM devices (Faster inference)
    pipe.enable_model_cpu_offload()  # Offload to CPU to save VRAM (Required ~17 GB); slower but prevents OOM

    prompt = '一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。'
    
    image = pipe(
        prompt,
        height=768,
        width=1344,
        guidance_scale=4.0,
        num_inference_steps=50,
        num_images_per_prompt=1,
        generator=torch.Generator("cpu").manual_seed(43),
        enable_cfg_renorm=True,
        enable_prompt_rewrite=True
    ).images[0]

    image.save('./t2i_example.png')

Test 0 - Different seed variations

Prompt: photorealistic girl in bookshop choosing the book in romantic stories shelf. smiling

...

CFG4, STEP50	Seed: 1620085323	Seed:1931701040	Seed:4075624134	Seed:2736029172
bookshop girl
hand and face
legs and shoes

Test 1 - Bookshop

Prompt: photorealistic girl in bookshop choosing the book in romantic stories shelf. smiling

...

	4	8	16	32	50
CFG1
CFG2
CFG3
CFG4
CFG5
CFG6
CFG8

Test 2 - Face and hand

Prompt: Create a close-up photograph of a woman's face and hand, with her hand raised to her chin. She is wearing a white blazer and has a gold ring on her finger. Her nails are neatly manicured and her hair is pulled back into a low bun. She is smiling and has a radiant expression on her face. The background is a plain light gray color. The overall mood of the photo is elegant and sophisticated. The photo should have a soft, natural light and a slight warmth to it. The woman's hair is dark brown and pulled back into a low bun, with a few loose strands framing her face.

...

	8	16	20	32	50
CFG1
CFG2
CFG3
CFG4
CFG5
CFG6
CFG7

Test 3 - Legs

Prompt: Prompt: Generate a photo of a woman's legs, with her feet crossed and wearing white high-heeled shoes with ribbons tied around her ankles. The shoes should have a pointed toe and a stiletto heel. The woman's legs should be smooth and tanned, with a slight sheen to them. The background should be a light gray color. The photo should be taken from a low angle, looking up at the woman's legs. The ribbons should be tied in a bow shape around the ankles. The shoes should have a red sole. The woman's legs should be slightly bent at the knee.

...

	8	16	32	64
CFG2
CFG4
CFG6
CFG8
CFG10

Test 4 - Empty prompts

1024x1024, Steps 50

seed 1	seed 2	seed 3	seed 4	seed 5
seed 6	seed 7	seed 8	seed 9	seed 10
seed 21	seed 38	seed 42	sweed 68	seed 2025

Test 5 - Other Models cover

Test 6 - Art Prompts

System info

Code Block

Config

Code Block
{ }

Model info

Module	Class	Device	Dtype	Quant	Params	Modules	Config
vae	AutoencoderKL	xpu:0	torch.bfloat16	None	83819683	241	FrozenDict({'in_channels': 3, 'out_channels': 3, 'down_block_types': ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D'], 'up_block_types': ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D'], 'block_out_channels': [128, 256, 512, 512], 'layers_per_block': 2, 'act_fn': 'silu', 'latent_channels': 16, 'norm_num_groups': 32, 'sample_size': 1024, 'scaling_factor': 0.3611, 'shift_factor': 0.1159, 'latents_mean': None, 'latents_std': None, 'force_upcast': True, 'use_quant_conv': False, 'use_post_quant_conv': False, 'mid_block_add_attention': True, '_class_name': 'AutoencoderKL', '_diffusers_version': '0.30.0.dev0', '_name_or_path': '/mnt/models/Diffusers/models--meituan-longcat--LongCat-Image/snapshots/d2ea50b79a930074c37b9b97ce45e3b2ea8cf4d8/vae'})
text_encoder	Qwen2_5_VLForConditionalGeneration	xpu:0	torch.bfloat16	None	8292166656	763	Qwen2_5_VLConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "dtype": "bfloat16", "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 128000, "max_window_layers": 28, "model_type": "qwen2_5_vl", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "text_config": { "_name_or_path": "hunyuanvideo-community/HunyuanImage-2.1-Diffusers", "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "dtype": "bfloat16", "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "layer_types": [ "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention" ], "max_position_embeddings": 128000, "max_window_layers": 28, "model_type": "qwen2_5_vl_text", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": null, "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 }, "tie_word_embeddings": false, "transformers_version": "4.57.3", "use_cache": true, "use_sliding_window": false, "vision_config": { "depth": 32, "dtype": "bfloat16", "fullatt_block_indexes": [ 7, 15, 23, 31 ], "hidden_act": "silu", "hidden_size": 1280, "in_channels": 3, "in_chans": 3, "initializer_range": 0.02, "intermediate_size": 3420, "model_type": "qwen2_5_vl", "num_heads": 16, "out_hidden_size": 3584, "patch_size": 14, "spatial_merge_size": 2, "spatial_patch_size": 14, "temporal_patch_size": 2, "tokens_per_second": 2, "window_size": 112 }, "vision_token_id": 151654, "vocab_size": 152064 }
tokenizer	Qwen2Tokenizer	None	None	None	0	0	None
transformer	LongCatImageTransformer2DModel	xpu:0	torch.bfloat16	None	6270668864	677	FrozenDict({'patch_size': 1, 'in_channels': 64, 'num_layers': 10, 'num_single_layers': 20, 'attention_head_dim': 128, 'num_attention_heads': 24, 'joint_attention_dim': 3584, 'pooled_projection_dim': 3584, 'axes_dims_rope': [16, 56, 56], '_use_default_values': ['axes_dims_rope'], '_class_name': 'LongCatImageTransformer2DModel', '_diffusers_version': '0.30.0.dev0', 'guidance_embeds': False, '_name_or_path': 'meituan-longcat/LongCat-Image'})
scheduler	FlowMatchEulerDiscreteScheduler	None	None	None	0	0	FrozenDict({'num_train_timesteps': 1000, 'shift': 3.0, 'use_dynamic_shifting': True, 'base_shift': 0.5, 'max_shift': 1.15, 'base_image_seq_len': 256, 'max_image_seq_len': 4096, 'invert_sigmas': False, 'shift_terminal': None, 'use_karras_sigmas': False, 'use_exponential_sigmas': False, 'use_beta_sigmas': False, 'time_shift_type': 'exponential', 'stochastic_sampling': False, '_use_default_values': ['shift_terminal', 'stochastic_sampling', 'time_shift_type', 'use_exponential_sigmas', 'use_karras_sigmas', 'use_beta_sigmas', 'invert_sigmas'], '_class_name': 'FlowMatchEulerDiscreteScheduler', '_diffusers_version': '0.30.0.dev0'})
text_processor	Qwen2VLProcessor	None	None	None	0	0	None

...

Page tree

Versions Compared

Old Version 4

New Version 5

Key

Info

Test 0 - Different seed variations

Test 1 - Bookshop

Test 2 - Face and hand

Test 3 - Legs

Test 4 - Empty prompts

Test 5 - Other Models cover

Test 6 - Art Prompts

System info

Config

Model info

Page tree

Page History

Versions Compared

Old Version 4

New Version 5

Key

Info

Test 0 - Different seed variations

Test 1 - Bookshop

Test 2 - Face and hand

Test 3 - Legs

Test 4 - Empty prompts

Test 5 - Other Models cover

Test 6 - Art Prompts

System info

Config

Model info