Announcing Occiglot: Polyglot Language Models for the Occident

Mission Statement

Recent advancements in transformer-based language models have demonstrated the potentially disruptive impact of this technology. Unfortunately, the high cost and required skill sets associated with training Large Language Models (LLM) leave the field dominated by a handful of big tech companies and deep tech startups, making core European values such as linguistic diversity, multilingualism, and cultural richness an afterthought of economically driven decisions.

Occiglot strongly believes that dedicated language modeling solutions are key to maintaining Europe’s academic and economic competitiveness and AI sovereignty. They are also necessary to achieve the long-term goal of digital language equality in Europe. Crucially, high-quality, fundamental research and IP-driven technological applications require direct access to these models and the data that went into training them. As an academic, non-profit research collective, Occiglot is committed to open science and, thus, open-source LLM development.

Model Release v0.1

As part of our commitment to transparent research, today we release ten intermediary 7B model checkpoints. This first release focuses on the five largest European languages: English, German, French, Spanish, and Italian.
We started from Mistral-7B – an existing pre-trained model for English and performed bi-lingual continual pre-training and subsequent instruction tuning for each language. Additionally, we also trained a multilingual model covering all five languages. All pre-trained and instruction-tuned checkpoints are available on Hugging Face under Apache 2.0 license.

In total, we used 700B additional multilingual tokens during continual pre-training and roughly 1B tokens for instruction tuning. For more details, check out our technical report.

Call for Collaboration

We are actively seeking collaborations within the community and feedback from those that our work aims to benefit. Occiglot will mainly operate through our public discord server to exchange ideas, discuss research, share findings, and coordinate projects. We strongly believe that the (academic and non-academic) Machine Learning, AI and NLP communities can only benefit from openly sharing insights, so we aim to provide a hub for this exchange. In addition to building upon and working with our models, we see three major opportunities for collaboration:

Large-scale training data. (Continual) pre-training requires large amounts of high-quality text data in the target language, which is especially challenging for low-resource languages.
Instruction tuning data. Chat and instruction-following capabilities can only be created with corresponding examples in the target language. We hope to find partners for every European language that can help us create and curate these datasets.
Model evaluation. Evaluating the capabilities of LLMs is crucial to making informed decisions about their development and deployment. However, in non-English settings, specifically, evaluation relies mostly on auto-translated, noisy benchmarks that do not accurately reflect model performance. An intimate understanding of the language and culture is required for building better evaluation suites, for which we seek local collaborators.

Furthermore, we welcome any ideas for collaboration that align with our mission and encourage researchers to reach out. Additionally, Occiglot is calling upon the existing localized projects to connect with the community for mutual benefit.

Roadmap

The main focus of Occiglot in the upcoming months will be the creation of one cohesive language modeling approach supporting all 24 official languages within the European Union and multiple unofficial and/or regional languages. Towards that target, we have already collected roughly 1 trillion tokens of non-English pre-training data. This corpus is continuously being expanded through additional data gathered by our collaborators and further web crawling. Furthermore, hessian.AI is committed to supporting the initiative by providing a significant amount of compute in 2024 on their AI supercomputer fortytwo (42).

We envision the creation of a European approach to follow three distinct phases.

Bilingual Models. Following the approach of the initial model release, we will train and open-source similar bilingual models for all European languages with enough readily available data. At the time of writing, these are: Dutch, Portuguese, Ukrainian, Czech, Slovak, Polish, Hungarian, Greek, and Bulgarian.
Research Questions. In the second phase, we will systematically investigate important research questions leading towards a multilingual European model. Our research will mainly focus on a) tokenizer design and extension for multilingual models, b) inter-lingual effects of joint multilingual model training for language clustering, and c) routing for multilingual mixtures of experts to, e.g., transfer cultural knowledge.
Mixture of European Experts. The efforts of phases one and two will culminate in creating a sophisticated mixture of dedicated language experts.