Qwen-Image-i2L: Training Strategies for Image-to-LoRA Generation
We're excited to announce the release of Qwen-Image-i2L, an "Image-to-LoRA" model.
Yes, you read that right—input an image, output a LoRA model's weights. End-to-end, direct generation.
In this post, we'll share how we designed and trained the Qwen-Image-i2L (Image to LoRA) model. We'll document the detours we took during our experiments, hoping this can inspire more valuable research.
Technical Approach
Feasibility Analysis
Training a LoRA from one or several images on a GPU over several hours has become an essential skill for every Diffusion model enthusiast. The ModelScope community has long provided high-quality free LoRA training capabilities, fostering a thriving ecosystem of community LoRA models.
The "Image-to-LoRA" model is an incredibly ambitious idea: since image data can be trained into LoRA models, can we compress the hours-long LoRA training process into a single forward pass of a model?
- Theoretically, this is entirely feasible. The model's inputs and outputs have an extremely strong correlation—or rather, not just "correlation," but a stronger "causal relationship."
- Practically, it's also completely viable. LoRA model weights are tensors, tensors can propagate gradients, and gradients mean trainability—everything checks out.
Model Architecture Design
We encountered numerous challenges in model architecture design. According to OpenAI's Scaling Law, the greater the parameter count, the stronger the generalization capability. However, due to computational resource constraints, we needed to achieve the strongest possible generalization with limited parameters.
LoRA models consist of a series of tensors, where each dimension of each tensor serves a completely different purpose. If we serialize these tensors into embeddings (like most large language models) and process them with a transformer architecture, the model struggles to produce valuable outputs.
In our early experiments, we used a transformer architecture and experimented on a small dataset of tens of thousands of images, training with 8×A100 GPUs. The result? The model degraded into a generic LoRA—regardless of input image, the output LoRA effects were similar. Essentially, we trained a regular static LoRA using those tens of thousands of images.
We were uncertain whether this model structure required more training resources to reach an "aha moment." Computational constraints forced us to switch to other model structures with faster convergence. The key insight was that each tensor in the LoRA weights should be processed by different layers, since each tensor serves a completely different purpose. A simple, direct solution is using single-layer fully-connected layers, but this introduces another problem—parameter explosion. LoRA weight dimensions are very large; even with single-layer fully-connected layers, parameters easily exceed 100B.
We adopted a more lightweight approach: replacing single-layer fully-connected layers with two-layer fully-connected layers and reducing intermediate dimensions. This significantly reduced parameters. However, this reintroduced a new problem—insufficient generalization. So, we used stronger image encoding models at the input end, including SigLIP2, DINOv2, and later adding Qwen-VL.
Thus, we arrived at our final Image-to-LoRA model structure: multiple image encoders + multiple two-layer fully-connected layers.
Training Approach
We trained based on the open-source DiffSynth-Studio framework (https://github.com/modelscope/DiffSynth-Studio). To reduce VRAM consumption and accelerate training, we adopted two-stage split training (see https://github.com/modelscope/DiffSynth-Studio/blob/main/docs/en/Training/Split_Training.md), processing image encoder and text encoder computations offline.
Due to workload considerations, the training code for Qwen-Image-i2L hasn't been released yet. We'll organize and release the code later (we definitely will—after all, we're an open-source community team!).
Model Version Iterations
Qwen-Image-i2L-Style
We trained multiple model versions. The first version is Qwen-Image-i2L-Style, with 2.4B parameters excluding image encoders. This model used a patchwork dataset including:
- EliGen: https://www.modelscope.cn/datasets/DiffSynth-Studio/EliGenTrainSet
- Qwen-Image-Self-Generated-Dataset: https://www.modelscope.cn/datasets/DiffSynth-Studio/Qwen-Image-Self-Generated-Dataset
- JourneyDB: https://www.modelscope.cn/datasets/AI-ModelScope/JourneyDB
- LAION: https://laion.ai/
We randomly sampled 250K images from each dataset, totaling 1 million images, and trained for two weeks on 8×A100 GPUs. All datasets are open-source—you can try reproducing it if interested.
This model's LoRA has extremely poor memory capability for image content. When you input an image to the model and generate images using the produced LoRA, you'll find the generated images only maintain semantic similarity to the input. In other words, when inputting a cat image, the output LoRA can indeed generate a cat, but it's not the same cat. However, we discovered this model has excellent style extraction capabilities—we can easily use it to produce style LoRAs.
For example, using these images as input:
Using the generated LoRA to produce images, all with random seed 0:
Wait—how do we handle multiple input images? It's simple: each image produces one LoRA, and we concatenate all LoRAs along the rank dimension.
The results look promising—we achieved a milestone!
Qwen-Image-i2L-Coarse
Next, we scaled up. We wanted to achieve stronger detail preservation, such as using a cat to generate a LoRA that produces images of the same cat performing different actions. Building on Qwen-Image-i2L-Style, we increased parameters to 7.9B, training data to 20 million images, and the GPU cluster to 80×AMD MI308X.
The next version—Qwen-Image-i2L-Coarse—arrived. Good news: the cats are looking more similar.
Bad news: style preservation capability dropped sharply. The model's LoRA tends to reproduce content from input images, exhibiting some "semantic invasion" phenomena. When we produce a LoRA using anime-style cat images, everything generated looks like a cat. Although semantic invasion can be mitigated by adding more input images, if we need many images to get a style LoRA, the model loses its value.
Qwen-Image-i2L-Fine
However, we believe the decline in style preservation is normal—generated content being very similar to input images means the model has indeed learned better detail preservation. At this point, we continued improving the model. After this model converged, we borrowed from some MoE architecture models, such as DeepSeekMoE.
LoRA models themselves enable multi-model fusion by simply concatenating along the rank dimension. Unlike traditional MoE architectures, we have every Image-to-LoRA expert model participate in computation, with each expert responsible for different ranks. During the next expert model's training, Qwen-Image-i2L-Coarse parameters were frozen, training only the new expert.
Fortunately, the AMD MI308X GPU's 192GB VRAM lets us forget about VRAM shortages. During inference, DiffSynth-Studio's new Disk Offload capability allows low-VRAM GPUs to perform inference (see https://github.com/modelscope/DiffSynth-Studio/blob/main/docs/en/Pipeline_Usage/VRAM_management.md).
The next version—Qwen-Image-i2L-Fine—arrived. Good news: the cats look even more similar. Bad news: the cats got uglier.
No! God! Please! No!
Qwen-Image-i2L-Bias
We must salvage this! The main reason for this problem is that our training dataset distribution differs significantly from Qwen-Image's pretraining dataset distribution, so we should realign the dataset distribution to Qwen-Image.
DPO is a viable approach, but its data efficiency is too low. We decided to use differential training (see https://github.com/modelscope/DiffSynth-Studio/blob/main/docs/en/Training/Differential_LoRA.md), just as we did with ArtAug (https://modelscope.cn/models/DiffSynth-Studio/ArtAug-lora-FLUX.1dev-v1). We collected 100 images generated by Qwen-Image, used Qwen-Image-i2L-Coarse and Qwen-Image-i2L-Fine to produce LoRAs, then trained another LoRA to redirect generated images back toward Qwen-Image's own generated images. This new model is a static, regular LoRA we named Qwen-Image-i2L-Bias.
Thus, this patched-together MoE architecture of Coarse + Fine + Bias finally barely meets our expectations. What can this model be used for? Unfortunately, compared to conventionally trained LoRAs, the LoRAs it generates still have gaps, but we can use it as initialization weights for LoRA training.
Wait—the model's generated LoRA rank is fixed; how can it be used for LoRA training initialization? Simple: perform PCA matrix decomposition on the LoRA weights to reset the rank to any value. See https://github.com/modelscope/DiffSynth-Studio/blob/main/diffsynth/utils/lora/reset_rank.py
Now let's try using it to initialize LoRA training weights.
Training data:
Sample generation during training:
| Training Steps | Random Initialization | Image-to-LoRA Initialization |
|---|---|---|
| 100 steps | ![]() |
![]() |
| 200 steps | ![]() |
![]() |
| 300 steps | ![]() |
![]() |
| 400 steps | ![]() |
![]() |
| 500 steps | ![]() |
![]() |
As we can see, when using the Image-to-LoRA model for LoRA training initialization, the model can already generate objects similar to the training data in the early stages of training.
Future Work
The "Image-to-LoRA" concept is incredibly ambitious. After going all-in, we've achieved decent model performance, but we clearly feel there's enormous room for improvement. Therefore, we'll continue exploring the potential of such models. As for what cooler things we'll do next—we'll keep that suspenseful. Stay tuned for future surprises!















































