Summarization of large-scale videos to text format using supervised based simple rule - based machine learning models

Loading...
Thumbnail Image

Date

2022

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Video Summarization has been one of the most interested research and development field since the late 2000s, thanks to the evolution of social media and the internet, due to the influence to provide a concise and meaningful summary of large-scale video. Even though the video summarization has been elongated through several non-ML and traditional based techniques and ML-based techniques, generation of correct and required summaries from the video is yet a limitation. To overcome this concern, different techniques have been attempted including vision-based approaches and NLP related approaches. With the inspiration of NLP related Transformer networks, researchers are looking to integrate such sequence-based learning algorithm into the video dimension as to apply spatiotemporal extractions. Despite the VS implementations, another extension of VS has been exponentially emphasized, namely TVS which generates the summaries of the video via a text format. Simply the evolution of VS towards TVS is not a straightforward journey since a lot of blockers have been eliminated using UL, RL, and SL based frameworks. When it comes to the STOA methods in TVS, Transformer based methods are eventually highlighted along the T5 based NLP frameworks. Since this area is still at the ground level, a lot of unknow facts and issues can be explored. Especially the attention-based sequence modelling of the learning algorithm should be carefully imitated to achieve the best accuracy improvements. All the improvements are subjected to apply into a real-time application ulteriorly. To tackle such improvements, a novel standalone method should be introduced with the simplest network layout which can be applicable to the embedded devices. This is where the Simple Rule-based Machine Learning Network to Text-based Video Summarization (SiRuML-TVS) has been unveiled. Though the network contains a single input of large-scale video and a single output of meaningful description for the given video, the high-level network layout compromises three ML modules for Video Recognition, Object Detection, and finally Text Generation. Each module is subjected to different evaluation criterions however, the end-to-end full network is evaluated on a single metric. Different combination of each module can be affected to the performance of the entire pipeline however, the combination of Transformers and CNNs provide the better tradeoff between accuracy and the computational inferencing. This makes a hope to deploy the proposed method in an edged device thus, the gap between theoretical explanation to practical implementation will be filled.

Description

Keywords

TEXT-BASED VIDEO SUMMARIZATION, DEEP LEARNING MODELS, VIDEO SUMMARIZATION, INFORMATION TECHNOLOGY -Dissertation, ARTIFICIAL INTELLIGENCE -Dissertation, COMPUTATIONAL MATHEMATICS -Dissertation

Citation

Sugathadasa, U.K.H.A. (2022). Summarization of large-scale videos to text format using supervised based simple rule - based machine learning models [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/21478

DOI