Data is a key ingredient and differentiator for deep learning models such as large language models. Despite this fact being well known in the industry, the data topic is still pretty much under-explored by the academic and open research commmunity. Only recently, the effect of data on large-scale models getting investigated with more attention. Notable examples are Scaling Data-Constrained Language Models (Muennighoff et al., 2023), OLMo 1.7–7B: A 24 point improvement on MMLU, or HuggingFace’s FineWeb. However, most data work is focussed on the English language. To change this and make more multilingual data available to the community, we are releasing Community-OSCAR as a large-scale multilingual dataset.

Community-OSCAR is an unofficial version of the OSCAR Corpus created by community members. The annotation schema follows the OSCAR 23.01 release but is based on 40 monthly dumps of Common Crawl ranging from 2024-22 to 2014-42. With these forty dumps, Community-OSCAR is the largest release of the OSCAR Corpus so far and available for download on Huggingface.

About OSCAR

The OSCAR project (Open Super-large Crawled Aggregated coRpus) is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project focuses specifically in providing large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. The OSCAR project has developed high-performance data pipelines specifically conceived to classify and filter large amounts of web data. The project has also put special attention in improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.

Community Effort

Community-OSCAR is a release created by members of the OSCAR community and part of an ongoing effort in close colaboration with the Occiglot research collective. We are working on extending this release to all publicly available CommonCrawl dumps and have plenty of ideas on further improvements. If you want to support our activities and collaborate with us, please join the Discord server from the OSCAR project or Occiglot research collective.

Next steps

Building the Community-OSCAR dataset is a critical step in advancing core projects at Occiglot. We realized that larger amounts of raw text data were needed to achieve our goal of training dedicated LLMs for underrepresented European languages. In the release of our Occiglot-fineweb data curation pipeline, we discussed that many languages needed more (clean) data for meaningful continual pre-training. Community-OSCAR closes that gap for multiple languages.

Consequently, we are actively working on cleaning all of this new data and deduplicating it against other data sources from our multilingual corpus. We expect to release an updated version of Occiglot-fineweb soon.

Initial experiments on continual-pretraining of LLama-3.1 using this new Occiglot-fineweb data delivered promising results. Stay tuned for future model releases.

Data Usage

OSCAR is mainly intended to pre-train language models and word representations.

NOTE: Community-OSCAR contains the raw unfiltered Common Crawl text data but with quality annotations. For language model training, we highly recommend filtering the data first with these annotations. A prefiltered version of the dataset will be released in the near future (following the approach from Occiglot-FineWeb).

Data Annotations

Each sample comes with a series of annotations that allow the removal of low quality data.

  • identification: Language identification based on fastText.
  • harmful_pp: This perplexity comes from a KenLM model trained on harmful content, previously gathered by using the adult annotation in OSCAR 22.01. In other terms, the lower it is, the more likely a given document contains harmful/adult content.
  • tlsh: We use TLSH to compute a hash for each document. Locality sensitive hashing is a hashing method that computes similar hashes for similar documents.
  • quality_warnings: Computed through heuristics (see below).
  • categories: Content categories arom a URL-based blocklist

The annotation schema is the same as in the OSCAR 23.01 release.

Quality Warnings

  • tiny: The document has a low (<5) number of lines.
  • short_sentences: The document has a high number (>50%) of short lines (<400 bytes)
  • header: The document has a high number of short lines at its head, suggesting the presence of low quality content.
  • footer: The document has a high number of short lines at its tail, suggesting the presence of low quality content.
  • noisy: The document has a high percentage of punctuation (>50%)
  • adult: The document contains adult content. This annotation uses a blocklist and labels a tiny part of the corpus: It does not catch most of the adult content.

More information about the thresholds and annotators are present in the OSCAR paper.

Data Format

The data is stored as ZSTD-compressed JSON line files. Each individual data sample has the following JSON schema:

{
   "content":"English sentence\nphrase en français\n????????????", // (1)
   "warc_headers":{ // (2)
      "warc-identified-content-language":"fra,eng",
      "warc-target-uri":"https://fr.wikipedia.org/wiki/...",
      "warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>",
      "warc-type":"conversion",
      "content-length":"35298", // (3)
      "warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>",
      "warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3)
      "warc-date":"2022-11-26T09:45:47Z",
      "content-type":"text/plain"
   },
   "metadata":{
      "identification":{ // (4)
         "label":"fr",
         "prob":0.8938327
      },
      "harmful_pp":4063.1814, // (5)
      "tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6)
      "quality_warnings":[ // (7)
         "short_sentences",
         "header",
         "footer"
      ],
      "categories":[ // (8)
         "examen_pix",
         "liste_bu"
      ],
      "sentence_identifications":[ // (9)
         {
            "label":"fr",
            "prob":0.99837273
         },
         {
            "label":"en",
            "prob":0.9992377
         },
         null
      ]
   }
}

Language statistics

All the data is distributed by language and release name. Up to 151 different languages are available. The table below provides the language code as well as the number of documents and data sizes as number of bytes. The statistics are computed based on uncompressed data and on estimates calculated on a subset of 10 releases and extrapolated to all 40 releases (snapshots).

Lang CodeLanguageData (avg./release)#Docs (avg./release)Data (Total)#Docs (Total)
Total8.86TiB1.15B345.68TiB44.96B
enEnglish3.45TiB494.16M134.37TiB19.27B
ruRussian1.20TiB96.07M46.92TiB3.75B
zhChinese613.63GiB62.11M23.37TiB2.42B
deGerman561.88GiB81.08M21.40TiB3.16B
esSpanish475.72GiB60.63M18.12TiB2.36B
frFrench419.17GiB58.92M15.96TiB2.30B
itItalian249.10GiB32.29M9.49TiB1.26B
jaJapanese219.77GiB43.40M8.37TiB1.69B
ptPortuguese203.89GiB27.28M7.77TiB1.06B
plPolish170.47GiB23.17M6.49TiB903.45M
nlDutch130.30GiB24.96M4.96TiB973.49M
viVietnamese114.41GiB12.21M4.36TiB476.32M
trTurkish96.24GiB13.81M3.67TiB538.74M
thThai88.60GiB5.70M3.37TiB222.36M
elGreek86.20GiB7.55M3.28TiB294.35M
faPersian85.95GiB9.06M3.27TiB353.36M
arArabic79.63GiB8.85M3.03TiB345.20M
csCzech79.57GiB13.36M3.03TiB520.91M
huHungarian61.75GiB7.64M2.35TiB298.04M
svSwedish58.76GiB9.12M2.24TiB355.79M
ukUkrainian55.63GiB5.35M2.12TiB208.73M
roRomanian44.18GiB4.78M1.68TiB186.58M
fiFinnish41.05GiB5.54M1.56TiB216.22M
bgBulgarian40.78GiB3.50M1.55TiB136.56M
koKorean40.72GiB6.64M1.55TiB258.87M
heHebrew35.28GiB3.67M1.34TiB143.27M
hiHindi26.86GiB1.91M1.02TiB74.41M
idIndonesian19.90GiB2.97M776.11GiB115.74M
skSlovak16.86GiB2.93M657.67GiB114.44M
ltLithuanian16.51GiB2.30M644.06GiB89.69M
bnBangla16.01GiB1.34M624.28GiB52.43M
caCatalan15.52GiB2.80M605.22GiB109.04M
daDanish15.48GiB2.92M603.77GiB113.89M
taTamil12.22GiB592959476.59GiB23.13M
multi(Multilingual11.25GiB1.23M438.72GiB48.10M
etEstonian9.69GiB1.58M377.73GiB61.53M
lvLatvian9.21GiB1.17M359.30GiB45.64M
srSerbian8.40GiB707589327.71GiB27.60M
kaGeorgian8.21GiB584674320.12GiB22.80M
hyArmenian5.36GiB429595209.22GiB16.75M
mlMalayalam5.18GiB309609202.15GiB12.07M
azAzerbaijani4.69GiB623101182.77GiB24.30M
teTelugu4.20GiB275599163.96GiB10.75M
neNepali4.07GiB444325158.86GiB17.33M
kkKazakh3.88GiB328963151.50GiB12.83M
mrMarathi3.77GiB274081146.84GiB10.69M
sqAlbanian3.45GiB517937134.54GiB20.20M
urUrdu3.40GiB362277132.57GiB14.13M
mkMacedonian3.36GiB400036131.09GiB15.60M
noNorwegian3.14GiB1.14M122.33GiB44.36M
guGujarati2.80GiB136843109.38GiB5.34M
myBurmese2.65GiB179399103.43GiB7.00M
isIcelandic2.58GiB481072100.54GiB18.76M
knKannada2.58GiB163184100.44GiB6.36M
beBelarusian2.48GiB23800796.75GiB9.28M
mnMongolian2.24GiB21038687.31GiB8.21M
kmKhmer2.24GiB14937387.30GiB5.83M
siSinhala2.09GiB11815881.58GiB4.61M
slSlovenian1.46GiB44627356.82GiB17.40M
tgTajik1.24GiB7798448.36GiB3.04M
euBasque1.13GiB26310944.00GiB10.26M
ttTatar920.15MiB8401235.04GiB3.28M
paPunjabi898.29MiB7243934.21GiB2.83M
ckbCentral Kurdish750.60MiB9228528.59GiB3.60M
tlFilipino631.67MiB8064624.06GiB3.15M
kyKyrgyz623.06MiB7701123.73GiB3.00M
eoEsperanto580.29MiB10881722.10GiB4.24M
amAmharic547.46MiB4829220.85GiB1.88M
orOdia496.36MiB6070618.90GiB2.37M
psPashto387.50MiB4791614.76GiB1.87M
loLao372.95MiB3906714.20GiB1.52M
cyWelsh372.43MiB7625314.18GiB2.97M
boTibetan357.33MiB2125813.61GiB829092
glGalician297.21MiB9740211.32GiB3.80M
asAssamese272.93MiB2161510.39GiB843002
dvDivehi260.77MiB317599.93GiB1.24M
ugUyghur242.46MiB209239.23GiB816010
baBashkir217.16MiB271248.27GiB1.06M
yiYiddish192.94MiB203387.35GiB793216
kuKurdish170.66MiB333806.50GiB1.30M
sdSindhi134.92MiB154935.14GiB604257
hrCroatian127.55MiB148534.86GiB579271
saSanskrit89.94MiB80553.43GiB314149
fyWestern Frisian75.62MiB238632.88GiB930683
pnbWestern Panjabi75.06MiB92122.86GiB359285
sahYakut69.11MiB85132.63GiB332024
cvChuvash55.44MiB71222.11GiB277788
gaIrish52.85MiB156422.01GiB610051
brBreton51.74MiB231221.97GiB901792
afAfrikaans40.90MiB126741.56GiB494307
cebCebuano39.70MiB48261.51GiB188248
osOssetic31.11MiB71091.18GiB277251
uzUzbek28.14MiB144151.07GiB562198
azbSouth Azerbaijani24.39MiB8158951.16MiB318188
lbLuxembourgish20.03MiB6532781.04MiB254748
mgMalagasy14.80MiB4148577.37MiB161776
mhrEastern Mari14.14MiB2538551.30MiB98990
ceChechen13.20MiB3722514.95MiB145192
ndsLow German12.37MiB1912482.44MiB74572
xmfMingrelian11.04MiB2970430.42MiB115864
nnNorwegian Nynorsk10.52MiB8843410.29MiB344877
msMalay8.62MiB4859336.00MiB189522
shSerbian6.55MiB1053255.53MiB41080
laLatin5.68MiB4928221.41MiB192196
newNewari5.42MiB764211.23MiB29796
minMinangkabau5.29MiB1186206.41MiB46267
tkTurkmen4.04MiB1755157.44MiB68458
arzEgyptian Arabic3.44MiB1406134.19MiB54838
mtMaltese3.35MiB2699130.67MiB105291
gomGoan Konkani2.83MiB158110.20MiB6192
bpyBishnupriya2.56MiB37399.73MiB14581
pmsPiedmontese2.03MiB39279.18MiB15288
astAsturian1.99MiB185677.55MiB72392
jboLojban1.67MiB24165.21MiB9425
ocOccitan1.64MiB22963.88MiB8944
voVolapük1.44MiB50255.97MiB19612
swSwahili924.46KiB69935.21MiB27274
warWaray920.54KiB22735.06MiB8883
lezLezghian634.65KiB12824.17MiB5009
mrjWestern Mari538.12KiB12420.49MiB4870
gswSwiss German521.38KiB16419.86MiB6422
wuuWu Chinese376.25KiB9314.33MiB3631
gdScottish Gaelic289.86KiB25011.04MiB9758
waWalloon258.48KiB389.84MiB1490
hsbUpper Sorbian207.47KiB1397.90MiB5425
suSundanese116.86KiB154.45MiB619
iaInterlingua113.60KiB424.33MiB1668
mznMazanderani99.49KiB293.79MiB1165
krcKarachay-Balkar98.21KiB553.74MiB2149
lmoLombard54.95KiB472.09MiB1859
avAvaric54.20KiB322.06MiB1256
kvKomi46.52KiB321.77MiB1274
yoYoruba41.61KiB391.58MiB1529
barBavarian38.47KiB421.47MiB1664
jvJavanese36.89KiB331.41MiB1304
bxrRussia Buriat25.13KiB24980.09KiB962
iloIloko23.76KiB19926.45KiB771
maiMaithili21.95KiB8855.96KiB312
ioIdo20.53KiB22800.52KiB866
anAragonese12.92KiB11503.92KiB437
soSomali12.23KiB13476.94KiB526
bsBosnian12.01KiB8468.58KiB333
bhBihari languages10.55KiB9411.39KiB380
nahNahuatl languages9.60KiB10374.53KiB406
xalKalmyk9.33KiB8363.76KiB312
ieInterlingue4.34KiB1169.12KiB58
liLimburgish3.32KiB3129.31KiB130
gnGuarani2.96KiB2115.35KiB107
dsbLower Sorbian1.88KiB273.47KiB78
kwCornish1.65KiB164.50KiB66
htHaitian Creole1.57KiB161.20KiB39
quQuechua1.45KiB156.41KiB61
x-emlUnknown language [x-eml]1.36KiB153.02KiB39
rueRusyn1.18KiB146.20KiB39
lrcNorthern Luri1.15KiB144.67KiB39
diqDimli (individual language)120.75B136.79KiB39
rmRomansh119.75B136.49KiB39
scnSicilian118.75B136.18KiB39

Community-OSCAR was put together by community members in close collaboration with the Occiglot research collective. The main contributors are Manuel Brack, Pedro Ortiz Suarez, Malte Ostendorff, Patrick Schramowski, Georg Rehm, Kristian Kersting, Jose Javier Saiz, Iñaki Lacunza Castilla, Alexander Shvets, Jorge Palomar-Giner, and Marta Villegas. Moreover, this release is supported by and was enabled by contributions from the OSCAR team at Inria (project-team ALMAnaCH), specially by Julien Abadji, Rua Ismail and Benoit Sagot, the Common Crawl Foundation, the SLT and SAINT teams at DFKI, TU Darmstadt, the LangTech unit at the Barcelona Supercomputing Center, the 42 supercomputer and Hessian AI, the OpenGPT-X project, Fraunhofer, Jülich Supercomputing Centre, TU Dresden, Deutsche Telekom, as well as by members of the OSCAR community, in particular Sotaro Takeshita, Sebastian Nagel.

More Information

More information can be found on the paper and the dataset card on Huggingface.