Today we release a preliminary artifact of our ongoing effort in curating a strong multilingual datasets. In this early form, the dataset contains roughly 230M heavily cleaned documents from 10 languages. Occiglot Fineweb builds on our existing collection of curated datasets and pre-filtered web data. Subsequently, all documents were filtered with language-specific derivatives of the fine-web processing pipeline and globally deduplicated.

The current version of the dataset (v0.5) is available on huggingface and we will publicly release our datatrove based pipeline shortly.

In collaboration with DiscoResearch we also release a set of strong German models based on Llama-3 that were continual pre-trained on the German split of occiglot-fineweb-v0.5. More information on the model release is available here.

Pipeline Details

We utilized two main datasources in our collection process. From LLM-Datasets we took all available datasets for the considered languages (excluding OSCAR). Additionally, we sourced web-crawled data from 12 Common-Crawl releases from 2005 until 2023. All releases were then processed with OSCAR’s Ungoliant pipeline. In this form the dataset largely overlaps with the training data used for initial release of Occiglot models.

All data was rigorously filtered using language-specific pipelines built upon Huggingface’s fine-web filters. In addition to some minor hyperparameter adjustments we mainly modified 3 aspects to ensure language-specific quality filtering.

  1. Adjust average-word length filters according to linguistic characteristics of each language
  2. Add language-specific stop words
  3. Add a language-specific policy filter for policy and cookie filtering

Lastly, we performed minhash deduplication on all data of each language separately. Importantly, we always retain the duplicate not contained in the web-crawled data. For example, if a wikipedia page is also contained in OSCAR, we drop the OSCAR duplicate, thus keeping the wikipedia subset complete. This dataset structure allows to reliably over- or undersample the custom subsets without some of the respective documents re-appearing elsewhere in the data.

Insights and Next steps

One the key takeaways from analyzing the cleanup process was the amount of duplicates over the entire data. While some overlap is always to be expected, prior research had suggested that different CommonCrawl releases were largely disjunct. Therefore, we did not deduplicate our OSCAR data for the initial OcciGlot release. However, we observed substantial amounts of duplicates in our dataset. Interestingly, though there are significant differences between languages.

LanguageDuplicate Documents# Total Documents (after filtering)
Czech15.19%38.71M
Greek25.10%17.01M
Portuguese35.21%34.85M
Spanish41.74%72.17M
Italian45.43%31.75M
Polish46.35%18.68M
French49.13%61.80M
Dutch50.20%32.42M
German50.92%88.43M
Slovak66.23%8.47M

The origin of these large differences remains unclear and should warrant further investigation. Further, we observed a consistent improvement in the data quality of CommonCrawl over time. The change in quality becomes most evident when considering the percentage of documents dropped in the filtering process. We showcase exemplary numbers for German, but these observations generally hold for most languages:

CommonCrawl Release (OSCAR split)Dropped Documents (Bad Quality)# Total Documents (before filtering)
2015-1433.84%796292
2016-4025.45%2499685
2017-4310.29%7959532
2018-4711.53%7901961
2019-2212.40%8597472
2020-2413.49%8025944
2020-4513.01%7242192
2021-4912.77%8784646
2022-2712.22%9515644
2022-4911.48%11127806
2023-1410.99%10156164
2023-2310.52%11078020

We are actively working on extending this preliminary dataset. For one, our unfiltered dataset contains an additional 20 languages for which we are building dedicated filters. Further, we are sourcing more data by processing additional CommonCrawl released and investigating other data sources.

We are actively seeking collaborations, so please feel free to reach out via mail or join our Discord Server.