247 1 year ago

INSAIT introduces BgGPT-Gemma-2-9B-IT-v1.0, a state-of-the-art Bulgarian language model based on google/gemma-2-9b and google/gemma-2-9b-it. BgGPT-Gemma-2-9B-IT-v1.0 is free to use and distributed under the Gemma Terms of Use.

Models

View all →

Readme

INSAIT introduces BgGPT-Gemma-2-9B-IT-v1.0, a state-of-the-art Bulgarian language model based on google/gemma-2-9b and google/gemma-2-9b-it. BgGPT-Gemma-2-9B-IT-v1.0 is free to use and distributed under the Gemma Terms of Use. This model was created by INSAIT, part of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria.

Model description

The model was built on top of Google’s Gemma 2 9B open models. It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at EMNLP’24, allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance. During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute, and machine translations of popular English datasets. The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations.