A large-scale and multilingual dataset generated from 40 monthly Common Crawl snapshots.

Community OSCAR: A Community Effort for Multilingual Web Data

Data is a key ingredient and differentiator for deep learning models such as large language models. Despite this fact being well known in the industry, the data topic is still pretty much under-explored by the academic and open research commmunity. Only recently, the effect of data on large-scale models getting investigated with more attention. Notable examples are Scaling Data-Constrained Language Models (Muennighoff et al., 2023), OLMo 1.7–7B: A 24 point improvement on MMLU, or HuggingFace’s FineWeb....

August 26, 2024 · Manuel Brack, Malte Ostendorff, Pedro Ortiz Suarez and other contributors