Enhancing video generation based on text-to-image diffusion models using a multimodal approach

Sanjula Appuhamy, KD; Miriya Thanthrige, USKP

Enhancing video generation based on text-to-image diffusion models using a multimodal approach

dc.contributor.author	Sanjula Appuhamy, KD
dc.contributor.author	Miriya Thanthrige, USKP
dc.date.accessioned	2026-01-14T06:56:37Z
dc.date.issued	2025
dc.description.abstract	The rapid evolution of Artificial Intelligence (AI) has revolutionized multimedia creation, particularly through Textto-Image (T2I) diffusion models, which synthesize images from textual descriptions with impressive capabilities. Building on this, enhancing video GIF generation has become a promising frontier. However, the field of Text-to-Video (T2V) GIFs remains underexplored, with limited research addressing this specific application. This synthesis method expands creative expression and finds utility in education, marketing, and entertainment. The lack of focused work highlights both novelty and opportunity for impactful contributions. This study focuses on optimizing T2V GIF generation in terms of computational and parameter efficiency. The resulting model maintains a parameter count below 1 billion, enabling faster training, reduced inference time, lower memory usage, and compatibility with low-end hardware. Inspired by previous work using 2 × 2 grid diffusion and frame interpolation models, this research proposes a simplified approach using a single Stable Diffusion model. It generates all 16 frames of an animated GIF within a 4×4 grid, eliminating prior post processing steps. Given that the GIF format emphasizes animation over fine detail, this parameter efficient method is well suited.
dc.identifier.conference	Moratuwa Engineering Research Conference 2025
dc.identifier.department	Engineering Research Unit, University of Moratuwa
dc.identifier.email	dilankasanjula@gmail.com
dc.identifier.email	sampathk@uom.lk
dc.identifier.faculty	Engineering
dc.identifier.isbn	979-8-3315-6724-8
dc.identifier.pgnos	pp. 304-309
dc.identifier.proceeding	Proceedings of Moratuwa Engineering Research Conference 2025
dc.identifier.uri	https://dl.lib.uom.lk/handle/123/24726
dc.language.iso	en
dc.publisher	IEEE
dc.subject	text-to-image
dc.subject	text-to-video
dc.subject	text-to-video gif.
dc.title	Enhancing video generation based on text-to-image diffusion models using a multimodal approach
dc.type	Conference-Full-text

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 1571153356.pdf
Size:: 2.88 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

MERCon - 2025