Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

dc.contributor.authorKreutzer, J.
dc.contributor.authorCaswell, I
dc.contributor.authorWang, L
dc.contributor.authorWahab, A.
dc.contributor.authorvan Esch, D
dc.contributor.authorUlzii-Orshikh, N
dc.contributor.authorTapo, A
dc.contributor.authorSubramani, N
dc.contributor.authorSokolov, A
dc.contributor.authorSikasote, C
dc.contributor.authorSetyawan, M
dc.contributor.authorSarin, S.
dc.contributor.authorSamb, S.
dc.contributor.authorSagot, B
dc.contributor.authorRivera, C.
dc.contributor.authorRios, A
dc.contributor.authorPapadimitriou, I.
dc.contributor.authorOsei, S.
dc.contributor.authorSuarez, P. O
dc.contributor.authorAdeyemi, M
dc.date.accessioned2023-11-24T05:46:00Z
dc.date.available2023-11-24T05:46:00Z
dc.date.issued2022
dc.description.abstractWith the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.en_US
dc.identifier.citationKreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447en_US
dc.identifier.doihttps://doi.org/10.1162/tacl_a_00447en_US
dc.identifier.issn2307387Xen_US
dc.identifier.journalTransactions of the Association for Computational Linguisticsen_US
dc.identifier.pgnos50-72en_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/21721
dc.identifier.volume10en_US
dc.identifier.year2022en_US
dc.language.isoenen_US
dc.publisherMIT Pressen_US
dc.titleQuality at a Glance: An Audit of Web-Crawled Multilingual Datasetsen_US
dc.typeArticle-Full-texten_US

Files