Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
dc.contributor.author | Kreutzer, J. | |
dc.contributor.author | Caswell, I | |
dc.contributor.author | Wang, L | |
dc.contributor.author | Wahab, A. | |
dc.contributor.author | van Esch, D | |
dc.contributor.author | Ulzii-Orshikh, N | |
dc.contributor.author | Tapo, A | |
dc.contributor.author | Subramani, N | |
dc.contributor.author | Sokolov, A | |
dc.contributor.author | Sikasote, C | |
dc.contributor.author | Setyawan, M | |
dc.contributor.author | Sarin, S. | |
dc.contributor.author | Samb, S. | |
dc.contributor.author | Sagot, B | |
dc.contributor.author | Rivera, C. | |
dc.contributor.author | Rios, A | |
dc.contributor.author | Papadimitriou, I. | |
dc.contributor.author | Osei, S. | |
dc.contributor.author | Suarez, P. O | |
dc.contributor.author | Adeyemi, M | |
dc.date.accessioned | 2023-11-24T05:46:00Z | |
dc.date.available | 2023-11-24T05:46:00Z | |
dc.date.issued | 2022 | |
dc.description.abstract | With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases. | en_US |
dc.identifier.citation | Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447 | en_US |
dc.identifier.doi | https://doi.org/10.1162/tacl_a_00447 | en_US |
dc.identifier.issn | 2307387X | en_US |
dc.identifier.journal | Transactions of the Association for Computational Linguistics | en_US |
dc.identifier.pgnos | 50-72 | en_US |
dc.identifier.uri | http://dl.lib.uom.lk/handle/123/21721 | |
dc.identifier.volume | 10 | en_US |
dc.identifier.year | 2022 | en_US |
dc.language.iso | en | en_US |
dc.publisher | MIT Press | en_US |
dc.title | Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets | en_US |
dc.type | Article-Full-text | en_US |