Show simple item record

dc.contributor.author Kreutzer, J.
dc.contributor.author Caswell, I
dc.contributor.author Wang, L
dc.contributor.author Wahab, A.
dc.contributor.author van Esch, D
dc.contributor.author Ulzii-Orshikh, N
dc.contributor.author Tapo, A
dc.contributor.author Subramani, N
dc.contributor.author Sokolov, A
dc.contributor.author Sikasote, C
dc.contributor.author Setyawan, M
dc.contributor.author Sarin, S.
dc.contributor.author Samb, S.
dc.contributor.author Sagot, B
dc.contributor.author Rivera, C.
dc.contributor.author Rios, A
dc.contributor.author Papadimitriou, I.
dc.contributor.author Osei, S.
dc.contributor.author Suarez, P. O
dc.contributor.author Adeyemi, M
dc.date.accessioned 2023-11-24T05:46:00Z
dc.date.available 2023-11-24T05:46:00Z
dc.date.issued 2022
dc.identifier.citation Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447 en_US
dc.identifier.issn 2307387X en_US
dc.identifier.uri http://dl.lib.uom.lk/handle/123/21721
dc.description.abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases. en_US
dc.language.iso en en_US
dc.publisher MIT Press en_US
dc.title Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets en_US
dc.type Article-Full-text en_US
dc.identifier.year 2022 en_US
dc.identifier.journal Transactions of the Association for Computational Linguistics en_US
dc.identifier.volume 10 en_US
dc.identifier.pgnos 50-72 en_US
dc.identifier.doi https://doi.org/10.1162/tacl_a_00447 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record