Cross-vit: cross-attention vision transformer for image duplicate detection

Loading...
Thumbnail Image

Date

2023-12-07

Journal Title

Journal ISSN

Volume Title

Publisher

Information Technology Research Unit, Faculty of Information Technology, University of Moratuwa.

Abstract

Duplicate detection in image databases has immense significance across diverse domains. Its utility transcends specific applications, adapting seamlessly to a range of use cases, either as a standalone process or an integrated component within broader workflows. This study explores cutting-edge vision transformer architecture to revolutionize feature extraction in the context of duplicate image identification. Our proposed framework combines the conventional transformer architecture with a groundbreaking cross-attention layer developed specifically for this study. This unique cross-attention transformer processes pairs of images as input, enabling intricate cross-attention operations that delve into the interconnections and relationships between the distinct features in the two images. Through meticulous iterations of Cross-ViT, we assess the ranking capabilities of each version, highlighting the vital role played by the integrated cross-attention layer between transformer blocks. Our research culminates in recommending a final optimal model that capitalizes on the synergies between higher-dimensional hidden embeddings and mid-size ViT variations, thereby optimizing image pair ranking. In conclusion, this study unveils the immense potential of the vision transformer and its novel cross-attention layer in the domain of duplicate image detection. The performance of the proposed framework was assessed through a comprehensive comparative evaluation against baseline CNN models using various benchmark datasets. This evaluation further underscores the transformative power of our approach. Notably, our innovation in this study lies not in the introduction of new feature extraction methods but in the introduction of a novel cross-attention layer between transformer blocks grounded in the scaled dot-product attention mechanism.

Description

Citation

DOI

Collections

Endorsement

Review

Supplemented By

Referenced By