Posts

A large-scale and multilingual dataset generated from 40 monthly Common Crawl snapshots.

Community OSCAR: A Community Effort for Multilingual Web Data

Data is a key ingredient and differentiator for deep learning models such as large language models. Despite this fact being well known in the industry, the data topic is still pretty much under-explored by the academic and open research commmunity. Only recently, the effect of data on large-scale models getting investigated with more attention. Notable examples are Scaling Data-Constrained Language Models (Muennighoff et al., 2023), OLMo 1.7–7B: A 24 point improvement on MMLU, or HuggingFace’s FineWeb....

Announcing Occiglot-Fineweb

Today we release a preliminary artifact of our ongoing effort in curating a strong multilingual datasets. In this early form, the dataset contains roughly 230M heavily cleaned documents from 10 languages. Occiglot Fineweb builds on our existing collection of curated datasets and pre-filtered web data. Subsequently, all documents were filtered with language-specific derivatives of the fine-web processing pipeline and globally deduplicated. The current version of the dataset (v0.5) is available on huggingface and we will publicly release our datatrove based pipeline shortly....

New Set of German Language Models

In a joint effort with DiscoResearch we release at set of new German language models, available on huggingface. All models are based on Llama-3-8B and were continually pre-trained on 65B high-quality German tokens from our occiglot-fineweb dataset. Similar to our prior releases, we provide both a base and instruction-tuned versions of the model. In addition to these variants that were solely trained on 8k context, we also release a long context variant (DiscoResearch/Llama3_German_8B_32k)....

Tokenizer Evaluation on European Languages

Intro The tokenizer is a vital component of any LLM, encoding sequences of text into a pre-defined set of tokens. However, the tokenizer is built seperately from the LLM itself and undergoes a seperate training phase with its own training data. Consequently, the tokenizers of most commercial models are heavily optimized for English text with varying performance for non-english languages. Since Occiglot is building LLMs for non-english languages based on existing models and tokenizers, we need to gain a thorough understanding of their inherent performance on the languages we aim to support....

A polyglot language model for the Occident.

Announcing Occiglot: Polyglot Language Models for the Occident

Mission Statement Recent advancements in transformer-based language models have demonstrated the potentially disruptive impact of this technology. Unfortunately, the high cost and required skill sets associated with training Large Language Models (LLM) leave the field dominated by a handful of big tech companies and deep tech startups, making core European values such as linguistic diversity, multilingualism, and cultural richness an afterthought of economically driven decisions. Occiglot strongly believes that dedicated language modeling solutions are key to maintaining Europe’s academic and economic competitiveness and AI sovereignty....

Technical Report

TBA