Diffusion based virtual try on : DiMVTON

Loading...
Thumbnail Image

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Diffusion models have recently set new standards for realism in virtual try-on tasks, yet most existing systems are burdened by the need for additional modules such as Reference Networks, complex image/text encoders, and heavy preprocessing pipelines. These extra components substantially increase the number of trainable parameters, GPU memory consumption, and overall computational cost. In this paper, we introduce DiMVTON, a highly efficient diffusion-based framework for virtual try-on that rethinks this complexity. Instead of relying on external conditioning networks, DiMVTON simply concatenates person and garment inputs along the spatial dimension and feeds them directly into a streamlined denoising UNet. Our approach is driven by three main efficiency goals: (1) Compact architecture - DiMVTON uses only a VAE and a minimal UNet without cross-attention or external encoders, achieving a total model size of 894.29 million parameters. (2) Selective fine-tuning - comprehensive studies reveal that the UNet’s self-attention layers are the critical elements for aligning garments onto individuals. Fine-tuning only these layers enables strong performance with just 0.39 million trainable parameters (around 0.04% of the backbone), as its further optimized through Low-Rank Adaptation (LoRA) techniques. (3) Minimal inference overhead – Unlike other diffusion-based models that require auxiliary information like human parsing maps, pose annotations, or textual descriptions, DiMVTON needs only a person image, a garment reference, and a simple mask, cutting memory usage by over 90%. Despite being trained on a relatively small dataset of 13,000 samples, DiMVTON achieves competitive qualitative and quantitative results and shows strong generalization in real-world scenarios. Our findings suggest that high-quality virtual try-on is possible without complex architectures, provided that fine-tuning is applied strategically to key network components.

Description

Citation

De Zoysa, R.S.N. (2025). Diffusion based virtual try on : DiMVTON [Master’s theses, University of Moratuwa]. Institutional Repository University of Moratuwa. https://dl.lib.uom.lk/handle/123/24527

DOI

Endorsement

Review

Supplemented By

Referenced By