80 4 months ago

BgGPT-v1.0, a state-of-the-art Bulgarian language model based on google/gemma-2-2b and google/gemma-2-2b-it.

2.6b 9b 27b

Models

View all →

Readme

BgGPT-v1.0

image/png

BgGPT-v1.0 is a Bulgarian language model based on Google’s Gemma 2 architecture. The model is free to use and distributed under the Gemma Terms of Use. This model was developed by INSAIT, part of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria.

Model Description

This model was built on top of Google’s Gemma 2 open models through continuous pre-training on approximately 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy. The training allows the model to develop Bulgarian cultural and linguistic capabilities while maintaining English performance.

The pre-training utilized various datasets including Bulgarian web crawl data, Wikipedia, specialized Bulgarian datasets, and machine translations of popular English datasets. The model was then instruction-fine-tuned on a Bulgarian instruction dataset created from real-world conversations.

Benchmarks and Results

image/png

image/png

The model has been evaluated on standard English benchmarks, their Bulgarian translations, and Bulgarian-specific benchmarks including:

  • Winogrande challenge: World knowledge and understanding
  • Hellaswag: Sentence completion
  • ARC Easy/Challenge: Logical reasoning
  • TriviaQA: Trivia knowledge
  • GSM-8k: High-school mathematics
  • Exams: High school problems from natural and social sciences
  • MON: Exams for grades 4 to 12

Performance comparisons show the model competing with other small open language models while retaining English performance from the original Gemma 2 base models.

Available Models

Multiple model sizes and quantizations are available:

Model Size Context Quantization
BgGPT-v1.0:2.6b 1.7GB 8K Q4_K_M
BgGPT-v1.0:2.6b-q8 2.8GB 8K Q8_0
BgGPT-v1.0:9b 5.8GB 8K Q4_K_M
BgGPT-v1.0:9b-q8 9.8GB 8K Q8_0
BgGPT-v1.0:27b 17GB 8K Q4_K_M
BgGPT-v1.0:27b-q8 29GB 8K Q8_0

Usage with Ollama

To use this model with Ollama, you can pull it using:

# 2.6B model
ollama pull s_emanuilov/BgGPT-v1.0:2.6b

# 9B model
ollama pull s_emanuilov/BgGPT-v1.0:9b

# 27B model
ollama pull s_emanuilov/BgGPT-v1.0:27b

# Q8 quantized versions (higher quality)
ollama pull s_emanuilov/BgGPT-v1.0:2.6b-q8
ollama pull s_emanuilov/BgGPT-v1.0:9b-q8
ollama pull s_emanuilov/BgGPT-v1.0:27b-q8

Then run it with:

ollama run s_emanuilov/BgGPT-v1.0:2.6b

Instruction format

In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token <bos> and be formatted in the Gemma 2 chat template. <bos> should only be the first token in a chat sequence.

E.g.

<bos><start_of_turn>user
Кога е основан Софийският университет?<end_of_turn>
<start_of_turn>model

Recommended Parameters

For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:

  • max_new_tokens: 2048
  • temperature: 0.1
  • top_k: 25
  • top_p: 1
  • repetition_penalty: 1.1