Datasets

Today we release a preliminary artifact of our ongoing effort in curating a strong multilingual datasets. In this early form, the dataset contains roughly 230M heavily cleaned documents from 10 languages. Occiglot Fineweb builds on our existing collection of curated datasets and pre-filtered web data. Subsequently, all documents were filtered with language-specific derivatives of the fine-web processing pipeline and globally deduplicated. The current version of the dataset (v0.5) is available on huggingface and we will publicly release our datatrove based pipeline shortly....