Walkyrie 1.3B — Text-to-Image

Walkyrie is a Text-to-Image diffusion model derived from Wan2.1-T2V-1.3B.
The text encoder (UMT5) was pruned to ~1B parameters and the model was re-trained for image generation, converting the original Text-to-Video architecture into a high-quality Text-to-Image pipeline.

Version	Release repo
Preview1.0	kpsss34/Walkyrie-1.3B-v1.0
anime style	Coming soon
Turbo	Coming soon

UPDATE

MAY 11, 2026 = Re-Training (Reduce the amount of plastic color, lower the CFG to 2.5-3.0 for realism, and reduce the infer_steps to only 20 steps.)

Test my Custom_nodes in ComfyUI (incredible quality)

[ComfyUI Custom node]

USE

git clone https://github.com/kpsss34/walkyrie.git

1.Download merge models to ComfyUI/models/checkpoints NAME: Walkyrie_bf16.safetensors or Walkyrie_fp8.safetensors

2.Make sure re-install diffusers version 0.33.0 Again

3.Open ComfyUI and search for node Walkyrie

Install dependencies

pip install git+https://github.com/huggingface/diffusers.git transformers accelerate torch torchvision ftfy
git clone https://github.com/kpsss34/Walkyrie-1.3B.git
cd Walkyrie-1.3B

Basic inference

import torch
from pipeline_walkyrie import pipeline_walkyrie
from diffusers import AutoencoderKLWan
from PIL import Image


device = "cuda" if torch.cuda.is_available() else "cpu"
model_dtype = torch.bfloat16
model_id = "kpsss34/Walkyrie-1.3B-v1.0"

pipe = pipeline_walkyrie.from_pretrained(
    model_id,
    torch_dtype=model_dtype
)
pipe.enable_model_cpu_offload() #pipe.to(device)

prompt = "a portrait of a young woman in a nightclub, cinematic film still, ultra wide aspect ratio, oval bokeh, soft highlight bloom, teal orange grading, film grain, moody lighting"

negative_prompt = ""

height = 1024
width = 1024
num_inference_steps = 20
guidance_scale = 3.0

generator = torch.Generator(device=device).manual_seed(0)
output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_inference_steps=num_inference_steps,
    guidance_scale=guidance_scale,
    generator=generator,
    output_type="pil"
).frames[0]

output.save("output.png")

Memory-efficient inference (CPU offload)

pipe.enable_model_cpu_offload()

Model Details

Property	Value
Base model	Wan2.1-T2V-1.3B
Task	Text-to-Image
Text Encoder	UMT5 (pruned to ~1B)
VAE	AutoencoderKLWan
Scheduler	FlowMatchEulerDiscreteScheduler
Precision	bfloat16
Resolution	1024×768, 768x1024 (recommended)

What's Different from Wan2.1

Text encoder pruned — UMT5 reduced to ~1B parameters for faster inference and lower VRAM usage
Re-trained for T2I — fine-tuned specifically for image generation instead of video

Hardware Requirements

VRAM	Setting
16 GB+	Full precision bfloat16
6–8 GB	`enable_model_cpu_offload()`

License

This model is released under the Apache 2.0 License.
Free to use for both research and commercial purposes.

Citation

If you use this model in your work, please credit:

Walkyrie 1.3B — Text-to-Image model derived from Wan2.1-T2V-1.3B
https://huggingface.co/kpsss34/Walkyrie-1.3B-v1.0
https://github.com/kpsss34
https://huggingface.co/kpsss34

Downloads last month: 689