Enhancing video generation based on text-to-image diffusion models using a multimodal approach

Loading...
Thumbnail Image

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

IEEE

Abstract

The rapid evolution of Artificial Intelligence (AI) has revolutionized multimedia creation, particularly through Textto-Image (T2I) diffusion models, which synthesize images from textual descriptions with impressive capabilities. Building on this, enhancing video GIF generation has become a promising frontier. However, the field of Text-to-Video (T2V) GIFs remains underexplored, with limited research addressing this specific application. This synthesis method expands creative expression and finds utility in education, marketing, and entertainment. The lack of focused work highlights both novelty and opportunity for impactful contributions. This study focuses on optimizing T2V GIF generation in terms of computational and parameter efficiency. The resulting model maintains a parameter count below 1 billion, enabling faster training, reduced inference time, lower memory usage, and compatibility with low-end hardware. Inspired by previous work using 2 × 2 grid diffusion and frame interpolation models, this research proposes a simplified approach using a single Stable Diffusion model. It generates all 16 frames of an animated GIF within a 4×4 grid, eliminating prior post processing steps. Given that the GIF format emphasizes animation over fine detail, this parameter efficient method is well suited.

Description

Citation

DOI

Collections

Endorsement

Review

Supplemented By

Referenced By