BgGPT is a Bulgarian language model built on top of Google’s Gemma 2.

811 2 months ago

Readme

BgGPT logo

BgGPT

Meet BgGPT, a Bulgarian language model built on top of Google’s Gemma 2. BgGPT is distributed under Gemma Terms of Use.

Versions 0.1 and 0.2 of the model were built on top of Mistral 0.1 and 0.2.

This model was created by INSAIT Institute, part of Sofia University, in Sofia, Bulgaria.

Model description

The model was built on top of Google’s Gemma 2 2B, 9B and 27B open models. It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at EMNLP’24, allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance. During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute, and machine translations of popular English datasets. The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations. For more information check our blogpost.

Usage

CLI

ollama run todorov/bggpt

API

Example:

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "todorov/bggpt",
  "prompt":"Кога е основан Софийският университет?"
 }'

References

BgGPT
BgGPT-Gemma-2-2.6B-IT-v1.0 on Hugging Face
BgGPT-Gemma-2-9B-IT-v1.0 on Hugging Face
BgGPT-Gemma-2-27B-IT-v1.0 on Hugging Face