Data is a key ingredient and differentiator for deep learning models such as large language models. Despite this fact being well known in the industry, the data topic is still pretty much under-explored by the academic and open research commmunity. Only recently, the effect of data on large-scale models getting investigated with more attention. Notable examples are Scaling Data-Constrained Language Models (Muennighoff et al., 2023), OLMo 1.7–7B: A 24 point improvement on MMLU, or HuggingFace’s FineWeb. However, most data work is focussed on the English language. To change this and make more multilingual data available to the community, we are releasing Community-OSCAR as a large-scale multilingual dataset.
Community-OSCAR is an unofficial version of the OSCAR Corpus created by community members. The annotation schema follows the OSCAR 23.01 release but is based on 40 monthly dumps of Common Crawl ranging from 2024-22 to 2014-42. With these forty dumps, Community-OSCAR is the largest release of the OSCAR Corpus so far and available for download on Huggingface.
About OSCAR
The OSCAR project (Open Super-large Crawled Aggregated coRpus) is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project focuses specifically in providing large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. The OSCAR project has developed high-performance data pipelines specifically conceived to classify and filter large amounts of web data. The project has also put special attention in improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.
Community Effort
Community-OSCAR is a release created by members of the OSCAR community and part of an ongoing effort in close colaboration with the Occiglot research collective. We are working on extending this release to all publicly available CommonCrawl dumps and have plenty of ideas on further improvements. If you want to support our activities and collaborate with us, please join the Discord server from the OSCAR project or Occiglot research collective.
Next steps
Building the Community-OSCAR dataset is a critical step in advancing core projects at Occiglot. We realized that larger amounts of raw text data were needed to achieve our goal of training dedicated LLMs for underrepresented European languages. In the release of our Occiglot-fineweb data curation pipeline, we discussed that many languages needed more (clean) data for meaningful continual pre-training. Community-OSCAR closes that gap for multiple languages.
Consequently, we are actively working on cleaning all of this new data and deduplicating it against other data sources from our multilingual corpus. We expect to release an updated version of Occiglot-fineweb soon.
Initial experiments on continual-pretraining of LLama-3.1 using this new Occiglot-fineweb data delivered promising results. Stay tuned for future model releases.
Data Usage
OSCAR is mainly intended to pre-train language models and word representations.
NOTE: Community-OSCAR contains the raw unfiltered Common Crawl text data but with quality annotations. For language model training, we highly recommend filtering the data first with these annotations. A prefiltered version of the dataset will be released in the near future (following the approach from Occiglot-FineWeb).
Data Annotations
Each sample comes with a series of annotations that allow the removal of low quality data.
identification
: Language identification based on fastText.harmful_pp
: This perplexity comes from a KenLM model trained on harmful content, previously gathered by using the adult annotation in OSCAR 22.01. In other terms, the lower it is, the more likely a given document contains harmful/adult content.tlsh
: We use TLSH to compute a hash for each document. Locality sensitive hashing is a hashing method that computes similar hashes for similar documents.quality_warnings
: Computed through heuristics (see below).categories
: Content categories arom a URL-based blocklist
The annotation schema is the same as in the OSCAR 23.01 release.
Quality Warnings
tiny
: The document has a low (<5) number of lines.short_sentences
: The document has a high number (>50%) of short lines (<400 bytes)header
: The document has a high number of short lines at its head, suggesting the presence of low quality content.footer
: The document has a high number of short lines at its tail, suggesting the presence of low quality content.noisy
: The document has a high percentage of punctuation (>50%)adult
: The document contains adult content. This annotation uses a blocklist and labels a tiny part of the corpus: It does not catch most of the adult content.
More information about the thresholds and annotators are present in the OSCAR paper.
Data Format
The data is stored as ZSTD-compressed JSON line files. Each individual data sample has the following JSON schema:
{
"content":"English sentence\nphrase en français\n????????????", // (1)
"warc_headers":{ // (2)
"warc-identified-content-language":"fra,eng",
"warc-target-uri":"https://fr.wikipedia.org/wiki/...",
"warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>",
"warc-type":"conversion",
"content-length":"35298", // (3)
"warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>",
"warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3)
"warc-date":"2022-11-26T09:45:47Z",
"content-type":"text/plain"
},
"metadata":{
"identification":{ // (4)
"label":"fr",
"prob":0.8938327
},
"harmful_pp":4063.1814, // (5)
"tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6)
"quality_warnings":[ // (7)
"short_sentences",
"header",
"footer"
],
"categories":[ // (8)
"examen_pix",
"liste_bu"
],
"sentence_identifications":[ // (9)
{
"label":"fr",
"prob":0.99837273
},
{
"label":"en",
"prob":0.9992377
},
null
]
}
}
Language statistics
All the data is distributed by language and release name. Up to 151 different languages are available. The table below provides the language code as well as the number of documents and data sizes as number of bytes. The statistics are computed based on uncompressed data and on estimates calculated on a subset of 10 releases and extrapolated to all 40 releases (snapshots).
Lang Code | Language | Data (avg./release) | #Docs (avg./release) | Data (Total) | #Docs (Total) |
---|---|---|---|---|---|
Total | 8.86TiB | 1.15B | 345.68TiB | 44.96B | |
en | English | 3.45TiB | 494.16M | 134.37TiB | 19.27B |
ru | Russian | 1.20TiB | 96.07M | 46.92TiB | 3.75B |
zh | Chinese | 613.63GiB | 62.11M | 23.37TiB | 2.42B |
de | German | 561.88GiB | 81.08M | 21.40TiB | 3.16B |
es | Spanish | 475.72GiB | 60.63M | 18.12TiB | 2.36B |
fr | French | 419.17GiB | 58.92M | 15.96TiB | 2.30B |
it | Italian | 249.10GiB | 32.29M | 9.49TiB | 1.26B |
ja | Japanese | 219.77GiB | 43.40M | 8.37TiB | 1.69B |
pt | Portuguese | 203.89GiB | 27.28M | 7.77TiB | 1.06B |
pl | Polish | 170.47GiB | 23.17M | 6.49TiB | 903.45M |
nl | Dutch | 130.30GiB | 24.96M | 4.96TiB | 973.49M |
vi | Vietnamese | 114.41GiB | 12.21M | 4.36TiB | 476.32M |
tr | Turkish | 96.24GiB | 13.81M | 3.67TiB | 538.74M |
th | Thai | 88.60GiB | 5.70M | 3.37TiB | 222.36M |
el | Greek | 86.20GiB | 7.55M | 3.28TiB | 294.35M |
fa | Persian | 85.95GiB | 9.06M | 3.27TiB | 353.36M |
ar | Arabic | 79.63GiB | 8.85M | 3.03TiB | 345.20M |
cs | Czech | 79.57GiB | 13.36M | 3.03TiB | 520.91M |
hu | Hungarian | 61.75GiB | 7.64M | 2.35TiB | 298.04M |
sv | Swedish | 58.76GiB | 9.12M | 2.24TiB | 355.79M |
uk | Ukrainian | 55.63GiB | 5.35M | 2.12TiB | 208.73M |
ro | Romanian | 44.18GiB | 4.78M | 1.68TiB | 186.58M |
fi | Finnish | 41.05GiB | 5.54M | 1.56TiB | 216.22M |
bg | Bulgarian | 40.78GiB | 3.50M | 1.55TiB | 136.56M |
ko | Korean | 40.72GiB | 6.64M | 1.55TiB | 258.87M |
he | Hebrew | 35.28GiB | 3.67M | 1.34TiB | 143.27M |
hi | Hindi | 26.86GiB | 1.91M | 1.02TiB | 74.41M |
id | Indonesian | 19.90GiB | 2.97M | 776.11GiB | 115.74M |
sk | Slovak | 16.86GiB | 2.93M | 657.67GiB | 114.44M |
lt | Lithuanian | 16.51GiB | 2.30M | 644.06GiB | 89.69M |
bn | Bangla | 16.01GiB | 1.34M | 624.28GiB | 52.43M |
ca | Catalan | 15.52GiB | 2.80M | 605.22GiB | 109.04M |
da | Danish | 15.48GiB | 2.92M | 603.77GiB | 113.89M |
ta | Tamil | 12.22GiB | 592959 | 476.59GiB | 23.13M |
multi | (Multilingual | 11.25GiB | 1.23M | 438.72GiB | 48.10M |
et | Estonian | 9.69GiB | 1.58M | 377.73GiB | 61.53M |
lv | Latvian | 9.21GiB | 1.17M | 359.30GiB | 45.64M |
sr | Serbian | 8.40GiB | 707589 | 327.71GiB | 27.60M |
ka | Georgian | 8.21GiB | 584674 | 320.12GiB | 22.80M |
hy | Armenian | 5.36GiB | 429595 | 209.22GiB | 16.75M |
ml | Malayalam | 5.18GiB | 309609 | 202.15GiB | 12.07M |
az | Azerbaijani | 4.69GiB | 623101 | 182.77GiB | 24.30M |
te | Telugu | 4.20GiB | 275599 | 163.96GiB | 10.75M |
ne | Nepali | 4.07GiB | 444325 | 158.86GiB | 17.33M |
kk | Kazakh | 3.88GiB | 328963 | 151.50GiB | 12.83M |
mr | Marathi | 3.77GiB | 274081 | 146.84GiB | 10.69M |
sq | Albanian | 3.45GiB | 517937 | 134.54GiB | 20.20M |
ur | Urdu | 3.40GiB | 362277 | 132.57GiB | 14.13M |
mk | Macedonian | 3.36GiB | 400036 | 131.09GiB | 15.60M |
no | Norwegian | 3.14GiB | 1.14M | 122.33GiB | 44.36M |
gu | Gujarati | 2.80GiB | 136843 | 109.38GiB | 5.34M |
my | Burmese | 2.65GiB | 179399 | 103.43GiB | 7.00M |
is | Icelandic | 2.58GiB | 481072 | 100.54GiB | 18.76M |
kn | Kannada | 2.58GiB | 163184 | 100.44GiB | 6.36M |
be | Belarusian | 2.48GiB | 238007 | 96.75GiB | 9.28M |
mn | Mongolian | 2.24GiB | 210386 | 87.31GiB | 8.21M |
km | Khmer | 2.24GiB | 149373 | 87.30GiB | 5.83M |
si | Sinhala | 2.09GiB | 118158 | 81.58GiB | 4.61M |
sl | Slovenian | 1.46GiB | 446273 | 56.82GiB | 17.40M |
tg | Tajik | 1.24GiB | 77984 | 48.36GiB | 3.04M |
eu | Basque | 1.13GiB | 263109 | 44.00GiB | 10.26M |
tt | Tatar | 920.15MiB | 84012 | 35.04GiB | 3.28M |
pa | Punjabi | 898.29MiB | 72439 | 34.21GiB | 2.83M |
ckb | Central Kurdish | 750.60MiB | 92285 | 28.59GiB | 3.60M |
tl | Filipino | 631.67MiB | 80646 | 24.06GiB | 3.15M |
ky | Kyrgyz | 623.06MiB | 77011 | 23.73GiB | 3.00M |
eo | Esperanto | 580.29MiB | 108817 | 22.10GiB | 4.24M |
am | Amharic | 547.46MiB | 48292 | 20.85GiB | 1.88M |
or | Odia | 496.36MiB | 60706 | 18.90GiB | 2.37M |
ps | Pashto | 387.50MiB | 47916 | 14.76GiB | 1.87M |
lo | Lao | 372.95MiB | 39067 | 14.20GiB | 1.52M |
cy | Welsh | 372.43MiB | 76253 | 14.18GiB | 2.97M |
bo | Tibetan | 357.33MiB | 21258 | 13.61GiB | 829092 |
gl | Galician | 297.21MiB | 97402 | 11.32GiB | 3.80M |
as | Assamese | 272.93MiB | 21615 | 10.39GiB | 843002 |
dv | Divehi | 260.77MiB | 31759 | 9.93GiB | 1.24M |
ug | Uyghur | 242.46MiB | 20923 | 9.23GiB | 816010 |
ba | Bashkir | 217.16MiB | 27124 | 8.27GiB | 1.06M |
yi | Yiddish | 192.94MiB | 20338 | 7.35GiB | 793216 |
ku | Kurdish | 170.66MiB | 33380 | 6.50GiB | 1.30M |
sd | Sindhi | 134.92MiB | 15493 | 5.14GiB | 604257 |
hr | Croatian | 127.55MiB | 14853 | 4.86GiB | 579271 |
sa | Sanskrit | 89.94MiB | 8055 | 3.43GiB | 314149 |
fy | Western Frisian | 75.62MiB | 23863 | 2.88GiB | 930683 |
pnb | Western Panjabi | 75.06MiB | 9212 | 2.86GiB | 359285 |
sah | Yakut | 69.11MiB | 8513 | 2.63GiB | 332024 |
cv | Chuvash | 55.44MiB | 7122 | 2.11GiB | 277788 |
ga | Irish | 52.85MiB | 15642 | 2.01GiB | 610051 |
br | Breton | 51.74MiB | 23122 | 1.97GiB | 901792 |
af | Afrikaans | 40.90MiB | 12674 | 1.56GiB | 494307 |
ceb | Cebuano | 39.70MiB | 4826 | 1.51GiB | 188248 |
os | Ossetic | 31.11MiB | 7109 | 1.18GiB | 277251 |
uz | Uzbek | 28.14MiB | 14415 | 1.07GiB | 562198 |
azb | South Azerbaijani | 24.39MiB | 8158 | 951.16MiB | 318188 |
lb | Luxembourgish | 20.03MiB | 6532 | 781.04MiB | 254748 |
mg | Malagasy | 14.80MiB | 4148 | 577.37MiB | 161776 |
mhr | Eastern Mari | 14.14MiB | 2538 | 551.30MiB | 98990 |
ce | Chechen | 13.20MiB | 3722 | 514.95MiB | 145192 |
nds | Low German | 12.37MiB | 1912 | 482.44MiB | 74572 |
xmf | Mingrelian | 11.04MiB | 2970 | 430.42MiB | 115864 |
nn | Norwegian Nynorsk | 10.52MiB | 8843 | 410.29MiB | 344877 |
ms | Malay | 8.62MiB | 4859 | 336.00MiB | 189522 |
sh | Serbian | 6.55MiB | 1053 | 255.53MiB | 41080 |
la | Latin | 5.68MiB | 4928 | 221.41MiB | 192196 |
new | Newari | 5.42MiB | 764 | 211.23MiB | 29796 |
min | Minangkabau | 5.29MiB | 1186 | 206.41MiB | 46267 |
tk | Turkmen | 4.04MiB | 1755 | 157.44MiB | 68458 |
arz | Egyptian Arabic | 3.44MiB | 1406 | 134.19MiB | 54838 |
mt | Maltese | 3.35MiB | 2699 | 130.67MiB | 105291 |
gom | Goan Konkani | 2.83MiB | 158 | 110.20MiB | 6192 |
bpy | Bishnupriya | 2.56MiB | 373 | 99.73MiB | 14581 |
pms | Piedmontese | 2.03MiB | 392 | 79.18MiB | 15288 |
ast | Asturian | 1.99MiB | 1856 | 77.55MiB | 72392 |
jbo | Lojban | 1.67MiB | 241 | 65.21MiB | 9425 |
oc | Occitan | 1.64MiB | 229 | 63.88MiB | 8944 |
vo | Volapük | 1.44MiB | 502 | 55.97MiB | 19612 |
sw | Swahili | 924.46KiB | 699 | 35.21MiB | 27274 |
war | Waray | 920.54KiB | 227 | 35.06MiB | 8883 |
lez | Lezghian | 634.65KiB | 128 | 24.17MiB | 5009 |
mrj | Western Mari | 538.12KiB | 124 | 20.49MiB | 4870 |
gsw | Swiss German | 521.38KiB | 164 | 19.86MiB | 6422 |
wuu | Wu Chinese | 376.25KiB | 93 | 14.33MiB | 3631 |
gd | Scottish Gaelic | 289.86KiB | 250 | 11.04MiB | 9758 |
wa | Walloon | 258.48KiB | 38 | 9.84MiB | 1490 |
hsb | Upper Sorbian | 207.47KiB | 139 | 7.90MiB | 5425 |
su | Sundanese | 116.86KiB | 15 | 4.45MiB | 619 |
ia | Interlingua | 113.60KiB | 42 | 4.33MiB | 1668 |
mzn | Mazanderani | 99.49KiB | 29 | 3.79MiB | 1165 |
krc | Karachay-Balkar | 98.21KiB | 55 | 3.74MiB | 2149 |
lmo | Lombard | 54.95KiB | 47 | 2.09MiB | 1859 |
av | Avaric | 54.20KiB | 32 | 2.06MiB | 1256 |
kv | Komi | 46.52KiB | 32 | 1.77MiB | 1274 |
yo | Yoruba | 41.61KiB | 39 | 1.58MiB | 1529 |
bar | Bavarian | 38.47KiB | 42 | 1.47MiB | 1664 |
jv | Javanese | 36.89KiB | 33 | 1.41MiB | 1304 |
bxr | Russia Buriat | 25.13KiB | 24 | 980.09KiB | 962 |
ilo | Iloko | 23.76KiB | 19 | 926.45KiB | 771 |
mai | Maithili | 21.95KiB | 8 | 855.96KiB | 312 |
io | Ido | 20.53KiB | 22 | 800.52KiB | 866 |
an | Aragonese | 12.92KiB | 11 | 503.92KiB | 437 |
so | Somali | 12.23KiB | 13 | 476.94KiB | 526 |
bs | Bosnian | 12.01KiB | 8 | 468.58KiB | 333 |
bh | Bihari languages | 10.55KiB | 9 | 411.39KiB | 380 |
nah | Nahuatl languages | 9.60KiB | 10 | 374.53KiB | 406 |
xal | Kalmyk | 9.33KiB | 8 | 363.76KiB | 312 |
ie | Interlingue | 4.34KiB | 1 | 169.12KiB | 58 |
li | Limburgish | 3.32KiB | 3 | 129.31KiB | 130 |
gn | Guarani | 2.96KiB | 2 | 115.35KiB | 107 |
dsb | Lower Sorbian | 1.88KiB | 2 | 73.47KiB | 78 |
kw | Cornish | 1.65KiB | 1 | 64.50KiB | 66 |
ht | Haitian Creole | 1.57KiB | 1 | 61.20KiB | 39 |
qu | Quechua | 1.45KiB | 1 | 56.41KiB | 61 |
x-eml | Unknown language [x-eml] | 1.36KiB | 1 | 53.02KiB | 39 |
rue | Rusyn | 1.18KiB | 1 | 46.20KiB | 39 |
lrc | Northern Luri | 1.15KiB | 1 | 44.67KiB | 39 |
diq | Dimli (individual language) | 120.75B | 1 | 36.79KiB | 39 |
rm | Romansh | 119.75B | 1 | 36.49KiB | 39 |
scn | Sicilian | 118.75B | 1 | 36.18KiB | 39 |
Community-OSCAR was put together by community members in close collaboration with the Occiglot research collective. The main contributors are Manuel Brack, Pedro Ortiz Suarez, Malte Ostendorff, Patrick Schramowski, Georg Rehm, Kristian Kersting, Jose Javier Saiz, Iñaki Lacunza Castilla, Alexander Shvets, Jorge Palomar-Giner, and Marta Villegas. Moreover, this release is supported by and was enabled by contributions from the OSCAR team at Inria (project-team ALMAnaCH), specially by Julien Abadji, Rua Ismail and Benoit Sagot, the Common Crawl Foundation, the SLT and SAINT teams at DFKI, TU Darmstadt, the LangTech unit at the Barcelona Supercomputing Center, the 42 supercomputer and Hessian AI, the OpenGPT-X project, Fraunhofer, Jülich Supercomputing Centre, TU Dresden, Deutsche Telekom, as well as by members of the OSCAR community, in particular Sotaro Takeshita, Sebastian Nagel.
More Information
More information can be found on the paper and the dataset card on Huggingface.