Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Info

https://huggingface.co/Wan-AI/Wan2.12-TI2V-T2V5B-14BDiffusers
Note:
If you are using the 
T2V-1.3B model, we recommend setting the parameter --sample_guide_scale 6.
The 
--sample_shift parameter can be adjusted within the range of 8 to 12 based on the performance.
Code Block
model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
height = 704
width = 1280
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0

prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
#negative_prompt = "Vibrant colors, overexposed, static, blurry details, subtitles, style, artwork, painting, image, still, overall grayish, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, distorted limbs, fingers fused together, static image, cluttered background, three legs, many people in the background, walking backwards."

Test 0 - Different seed variations and resolutions

Prompt: Create a close-up photograph of a woman's face and hand, with her hand raised to her chin. She is wearing a white blazer and has a gold ring on her finger. Her nails are neatly manicured and her hair is pulled back into a low bun. She is smiling and has a radiant expression on her face. The background is a plain light gray color. The overall mood of the photo is elegant and sophisticated. The photo should have a soft, natural light and a slight warmth to it. The woman's hair is dark brown and pulled back into a low bun, with a few loose strands framing her face.

Negative: Vibrant colors, overexposed, static, blurry details, subtitles, style, artwork, painting, image, still, overall grayish, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, distorted limbs, fingers fused together, static image, cluttered background, three legs, many people in the background, walking backwards.

Parameters: Steps: 50| Size: 704x704| Seed: 2736029172| CFG scale: 5| App: SD.Next| Version: 7644432| Pipeline: WanPipeline| Operations: txt2img| Model: Wan2.2-TI2V-5B-Diffusers

Time: 1m 36.82s | total 99.75 pipeline 96.79 te 1.37 vae 1.36 | GPU 29768 MB 24% | RAM 38.62 GB 31%

CFG5, STEP20Seed: 1620085323Seed:1931701040Seed:4075624134Seed:2736029172
bookshop girl



hand and face



legs and shoes




Test 1 - Bookshop

Prompt: photorealistic girl in bookshop choosing the book in romantic stories shelf. smiling

...

Execution: Time: 3m 54.34s | total 439.88 pipeline 233.25 preview 194.33 te 5.20 offload 5.07 vae 1.20 decode 0.80 post 0.27 gc



8162032

CFG4





CFG6





CFG8





CFG10

Test 4 - Different seed variations and resolutions

Prompt: Create a close-up photograph of a woman's face and hand, with her hand raised to her chin. She is wearing a white blazer and has a gold ring on her finger. Her nails are neatly manicured and her hair is pulled back into a low bun. She is smiling and has a radiant expression on her face. The background is a plain light gray color. The overall mood of the photo is elegant and sophisticated. The photo should have a soft, natural light and a slight warmth to it. The woman's hair is dark brown and pulled back into a low bun, with a few loose strands framing her face.

Parameters: Steps: 20| Size: 768x768| Seed: 2736029172| CFG scale: 6| App: SD.Next| Version: 7644432| Pipeline: WanPipeline| Operations: txt2img| Model: Wan2.2-TI2V-5B-Diffusers

Time: 45.33s | total 48.41 pipeline 45.30 vae 1.66 te 1.36 | GPU 24466 MB 19% | RAM 34.07 GB 28%

1280px
CFG6, STEP 20Seed: 1620085323Seed:1931701040Seed:4075624134Seed:2736029172

512px

combined

Image Removed

Image Removed

Image Removed

Image Removed

768px

Image Removed

Image Removed

Image Removed

Image Removed

1024px

Image Removed

Image Removed







System info


Code Block
app: sdnext.git updated: 2025-0712-2106 hash: 34031f54764443213 url: https://github.com/vladmandicliutyi/sdnext.git/tree/devpytorch
arch: x86_64 cpu: x86_64 system: Linux release: 6.1417.0-247-generic
python: 3.12.3 Torch 2.79.1+xpu
ram: free:114.91 used:8.17 total:123.07
device: Intel(R) Arc(TM) Graphics (1) ipex:  
xformers:  diffusers: 0.3536.0.dev0 transformers: 4.5357.21
active: xpu dtype: torch.bfloat16 vae: torch.bfloat16 unet: torch.bfloat16
base: Diffusers/Wan-AI/Wan2.12-T2VTI2V-1.3B5B-Diffusers [0fad780a53b8fff7315c] refiner: none vae: none te: none unet: none


Config

Code Block
{
"sd_model_checkpoint": "Diffusers/Wan-AI/Wan2.1-T2V-1.3B-Diffusers [0fad780a53]",
  "diffusers_version": "9c13f8657986e68f5f05987912c54432fd28d86f",
  "sd_checkpoint_hash": null,
  "diffusers_offload_min_gpu_memory": 0.05,
  "diffusers_offload_max_gpu_memory": 0.95,
  "diffusers_vae_tiling": true,
  "diffusers_vae_tile_size": 512,
  "dynamic_attention_slice_rate": 1,
  "dynamic_attention_trigger_rate": 2,
  "samples_filename_pattern": "[seq]-[date]-[model_name]-[height]x[width]-STEP[steps]-CFG[cfg]-Seed[seed]"
}


Model info

DType
ModuleClassDeviceDtypeQuantParamsModulesConfig
vaeAutoencoderKLWanxpu:0torch.bfloat16None126892531704688668260272

FrozenDict({'base_dim': 160, 'decoder_base_dim': 96256, 'z_dim': 1648, 'dim_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_scales': [], 'temperal_downsample': [False, True, True], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921-0.2289, -0.0052, -0.1323, -0.2339, -0.2799, 0.0174, 0.1838, 0.1557, -0.1382, 0.0542, 0.2813, 0.0891, 0.157, -0.0098, 0.0375, -0.1825, -0.2246, -0.1207, -0.0698, 0.5109, 0.2665, -0.2108, -0.2158, 0.2502, -0.2055, -0.0322, 0.1109, 0.1567, -0.0729, 0.0899, -0.2799, -0.123, -0.0313, -0.1649, 0.0117, 0.0723, -0.2839, -0.2083, -0.052, 0.3748, 0.0152, 0.1957, 0.1433, -0.2944, 0.3573, -0.0548, -0.1681, -0.0667], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916]0.4765, 1.0364, 0.4514, 1.1677, 0.5313, 0.499, 0.4818, 0.5013, 0.8158, 1.0344, 0.5894, 1.0901, 0.6885, 0.6165, 0.8454, 0.4978, 0.5759, 0.3523, 0.7135, 0.6804, 0.5833, 1.4146, 0.8986, 0.5659, 0.7069, 0.5338, 0.4889, 0.4917, 0.4069, 0.4999, 0.6866, 0.4093, 0.5709, 0.6065, 0.6415, 0.4944, 0.5726, 1.2042, 0.5458, 1.6887, 0.3971, 1.06, 0.3943, 0.5537, 0.5444, 0.4089, 0.7468, 0.7744], 'is_residual': True, 'in_channels': 12, 'out_channels': 12, 'patch_size': 2, 'scale_factor_temporal': 4, 'scale_factor_spatial': 16, '_class_name': 'AutoencoderKLWan', '_diffusers_version': '0.3335.0.dev0', 'clip_output': False, '_name_or_path': '/mnt/models/Diffusers/models--Wan-AI--Wan2.12-T2VTI2V-1.3B5B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5bb8fff7315c768468a5333511427288870b2e9635/vae'})

text_encoderUMT5EncoderModelcpuxpu:0torch.bfloat165680910336None5680910336486

UMT5Config { "architectures": [ "UMT5EncoderModel" ], "classifier_dropout": 0.0, "d_ff": 10240, "d_kv": 64, "d_model": 4096, "decoder_start_token_id": 0, "dense_act_fn": "gelu_new", "dropout_rate": 0.1, "dtype": "bfloat16", "eos_token_id": 1, "feed_forward_proj": "gated-gelu", "initializer_factor": 1.0, "is_encoder_decoder": true, "is_gated_act": true, "layer_norm_epsilon": 1e-06, "model_type": "umt5", "num_decoder_layers": 24, "num_heads": 64, "num_layers": 24, "output_past": true, "pad_token_id": 0, "relative_attention_max_distance": 128, "relative_attention_num_buckets": 32, "scalable_attention": true, "tie_word_embeddings": false, "tokenizer_class": "T5Tokenizer", "torch_dtype": "bfloat16", "transformers_version": "4.5357.21", "use_cache": true, "vocab_size": 256384 }

tokenizerT5TokenizerFastNoneNoneNone00

None

transformerWanTransformer3DModelxpu:0torch.bfloat16None49997877121418996800858

FrozenDict({'patch_size': [1, 2, 2], 'num_attention_heads': 1224, 'attention_head_dim': 128, 'in_channels': 1648, 'out_channels': 1648, 'text_dim': 4096, 'freq_dim': 256, 'ffn_dim': 896014336, 'num_layers': 30, 'cross_attn_norm': True, 'qk_norm': 'rms_norm_across_heads', 'eps': 1e-06, 'image_dim': None, 'added_kv_proj_dim': None, 'rope_max_seq_len': 1024, 'pos_embed_seq_len': None, '_use_default_values ': ['pos_embed_seq_len'], '_class_name': 'WanTransformer3DModel', '_diffusers_version': '0.3335.0.dev0', '_name_or_path': 'Wan-AI/Wan2.12-T2VTI2V-1.3B5B-Diffusers'})

schedulerUniPCMultistepSchedulerNoneNoneNone00

FrozenDict({'num_train_timesteps': 1000, 'beta_start': 0.0001, 'beta_end': 0.02, 'beta_schedule': 'linear', 'trained_betas': None, 'solver_order': 2, 'prediction_type': 'flow_prediction', 'thresholding': False, 'dynamic_thresholding_ratio': 0.995, 'sample_max_value': 1.0, 'predict_x0': True, 'solver_type': 'bh2', 'lower_order_final': True, 'disable_corrector': [], 'solver_p': None, 'use_karras_sigmas': False, 'use_exponential_sigmas': False, 'use_beta_sigmas': False, 'use_flow_sigmas': True, 'flow_shift': 35.0, 'timestep_spacing': 'linspace', 'steps_offset': 0, 'final_sigmas_type': 'zero', 'rescale_betas_zero_snr': False, 'use_dynamic_shifting': False, 'time_shift_type': 'exponential', '_use_default_values': ['use_dynamic_shifting', 'time_shift_type'], '_class_name': 'UniPCMultistepScheduler', '_diffusers_version': '0.3335.0.dev0'})

transformer_2NoneTypeNone

_name_or_path

strNoneNone00

None

boundary_ratioNoneTypeNone

_class_name

strNoneNone00

None

expand_timestepsboolNone

_diffusers_version

strNoneNone00

None