Cross-vit: cross-attention vision transformer for image duplicate detection

dc.contributor.authorChandrasiri, MDN
dc.contributor.authorTalagala, PD
dc.contributor.editorPiyatilake, ITS
dc.contributor.editorThalagala, PD
dc.contributor.editorGanegoda, GU
dc.contributor.editorThanuja, ALARR
dc.contributor.editorDharmarathna, P
dc.date.accessioned2024-02-06T08:36:41Z
dc.date.available2024-02-06T08:36:41Z
dc.date.issued2023-12-07
dc.description.abstractDuplicate detection in image databases has immense significance across diverse domains. Its utility transcends specific applications, adapting seamlessly to a range of use cases, either as a standalone process or an integrated component within broader workflows. This study explores cutting-edge vision transformer architecture to revolutionize feature extraction in the context of duplicate image identification. Our proposed framework combines the conventional transformer architecture with a groundbreaking cross-attention layer developed specifically for this study. This unique cross-attention transformer processes pairs of images as input, enabling intricate cross-attention operations that delve into the interconnections and relationships between the distinct features in the two images. Through meticulous iterations of Cross-ViT, we assess the ranking capabilities of each version, highlighting the vital role played by the integrated cross-attention layer between transformer blocks. Our research culminates in recommending a final optimal model that capitalizes on the synergies between higher-dimensional hidden embeddings and mid-size ViT variations, thereby optimizing image pair ranking. In conclusion, this study unveils the immense potential of the vision transformer and its novel cross-attention layer in the domain of duplicate image detection. The performance of the proposed framework was assessed through a comprehensive comparative evaluation against baseline CNN models using various benchmark datasets. This evaluation further underscores the transformative power of our approach. Notably, our innovation in this study lies not in the introduction of new feature extraction methods but in the introduction of a novel cross-attention layer between transformer blocks grounded in the scaled dot-product attention mechanism.en_US
dc.identifier.conference8th International Conference in Information Technology Research 2023en_US
dc.identifier.departmentInformation Technology Research Unit, Faculty of Information Technology, University of Moratuwa.en_US
dc.identifier.emaildncnawodya@gmail.comen_US
dc.identifier.emailpriyangad@uom.lken_US
dc.identifier.facultyITen_US
dc.identifier.pgnospp. 1-6en_US
dc.identifier.placeMoratuwa, Sri Lankaen_US
dc.identifier.proceedingProceedings of the 8th International Conference in Information Technology Research 2023en_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/22194
dc.identifier.year2023en_US
dc.language.isoenen_US
dc.publisherInformation Technology Research Unit, Faculty of Information Technology, University of Moratuwa.en_US
dc.subjectDuplicate image detectionen_US
dc.subjectVision transformersen_US
dc.subjectAttentionen_US
dc.titleCross-vit: cross-attention vision transformer for image duplicate detectionen_US
dc.typeConference-Full-texten_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Cross-ViT.pdf
Size:
926.34 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections