BgGPT is a Bulgarian language model built on top of Google’s Gemma 2.
811 Pulls Updated 2 months ago
Updated 2 months ago
2 months ago
fa7a6f723030 · 19GB
Readme
BgGPT
Meet BgGPT, a Bulgarian language model built on top of Google’s Gemma 2. BgGPT is distributed under Gemma Terms of Use.
Versions 0.1 and 0.2 of the model were built on top of Mistral 0.1 and 0.2.
This model was created by INSAIT Institute, part of Sofia University, in Sofia, Bulgaria.
Model description
The model was built on top of Google’s Gemma 2 2B, 9B and 27B open models. It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at EMNLP’24, allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance. During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute, and machine translations of popular English datasets. The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations. For more information check our blogpost.
Usage
CLI
ollama run todorov/bggpt
API
Example:
curl -X POST http://localhost:11434/api/generate -d '{
"model": "todorov/bggpt",
"prompt":"Кога е основан Софийският университет?"
}'
References
BgGPT
BgGPT-Gemma-2-2.6B-IT-v1.0 on Hugging Face
BgGPT-Gemma-2-9B-IT-v1.0 on Hugging Face
BgGPT-Gemma-2-27B-IT-v1.0 on Hugging Face