Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Kreutzer, J.; Caswell, I; Wang, L; Wahab, A.; van Esch, D; Ulzii-Orshikh, N; Tapo, A; Subramani, N; Sokolov, A; Sikasote, C; Setyawan, M; Sarin, S.; Samb, S.; Sagot, B; Rivera, C.; Rios, A; Papadimitriou, I.; Osei, S.; Suarez, P. O; Adeyemi, M

UoM IR
→
Research Publications
→
Journals and Magazines
→
Articles authored by UoM staff (Publish in scimago's Q1 journals)
→
View Item

dc.contributor.author	Kreutzer, J.
dc.contributor.author	Caswell, I
dc.contributor.author	Wang, L
dc.contributor.author	Wahab, A.
dc.contributor.author	van Esch, D
dc.contributor.author	Ulzii-Orshikh, N
dc.contributor.author	Tapo, A
dc.contributor.author	Subramani, N
dc.contributor.author	Sokolov, A
dc.contributor.author	Sikasote, C
dc.contributor.author	Setyawan, M
dc.contributor.author	Sarin, S.
dc.contributor.author	Samb, S.
dc.contributor.author	Sagot, B
dc.contributor.author	Rivera, C.
dc.contributor.author	Rios, A
dc.contributor.author	Papadimitriou, I.
dc.contributor.author	Osei, S.
dc.contributor.author	Suarez, P. O
dc.contributor.author	Adeyemi, M
dc.date.accessioned	2023-11-24T05:46:00Z
dc.date.available	2023-11-24T05:46:00Z
dc.date.issued	2022
dc.identifier.citation	Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447	en_US
dc.identifier.issn	2307387X	en_US
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/21721
dc.description.abstract	With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.	en_US
dc.language.iso	en	en_US
dc.publisher	MIT Press	en_US
dc.title	Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets	en_US
dc.type	Article-Full-text	en_US
dc.identifier.year	2022	en_US
dc.identifier.journal	Transactions of the Association for Computational Linguistics	en_US
dc.identifier.volume	10	en_US
dc.identifier.pgnos	50-72	en_US
dc.identifier.doi	https://doi.org/10.1162/tacl_a_00447	en_US