The model was built on top of Google’s Gemma 2 2B, 9B and 27B open models. It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at EMNLP’24, allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance. During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute, and machine translations of popular English datasets. The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations. For more information check our blogpost.

Usage

CLI

ollama run todorov/bggpt

API

Example:

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "todorov/bggpt",
  "prompt":"Кога е основан Софийският университет?"
 }'

References

BgGPT
BgGPT-Gemma-2-2.6B-IT-v1.0 on Hugging Face
BgGPT-Gemma-2-9B-IT-v1.0 on Hugging Face
BgGPT-Gemma-2-27B-IT-v1.0 on Hugging Face

<h1>BgGPT</h1>

<p>Meet BgGPT, a Bulgarian language model built on top of Google’s Gemma 2. BgGPT is distributed under <a href="https://ai.google.dev/gemma/terms" rel="nofollow">Gemma Terms of Use</a>.</p>
<p>Versions 0.1 and 0.2 of the model were built on top of Mistral 0.1 and 0.2.</p>

<p>This model was created by <a href="https://insait.ai/" rel="nofollow">INSAIT Institute</a>, part of <a href="https://www.uni-sofia.bg/index.php/eng" rel="nofollow">Sofia University</a>, in Sofia, Bulgaria.</p>

<h2>Model description</h2>

<p>The model was built on top of Google’s Gemma 2 2B, 9B and 27B open models. It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at <a href="https://aclanthology.org/2024.findings-emnlp.1000/" rel="nofollow">EMNLP’24</a>, allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance. During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute, and machine translations of popular English datasets. The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations. For more information check our <a href="https://models.bggpt.ai/blog/" rel="nofollow">blogpost</a>.</p>

<h2>Usage</h2>

<pre><code>ollama run todorov/bggpt
</code></pre>

<p>Example:</p>

<pre><code>curl -X POST http://localhost:11434/api/generate -d &#39;{
  &#34;model&#34;: &#34;todorov/bggpt&#34;,
  &#34;prompt&#34;:&#34;Кога е основан Софийският университет?&#34;
 }&#39;
</code></pre>

<h2>References</h2>

<p><a href="https://bggpt.ai/" rel="nofollow">BgGPT</a><br>
<a href="https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0" rel="nofollow">BgGPT-Gemma-2-2.6B-IT-v1.0 on Hugging Face</a><br>
<a href="https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0" rel="nofollow">BgGPT-Gemma-2-9B-IT-v1.0 on Hugging Face</a><br>
<a href="https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-27B-IT-v1.0" rel="nofollow">BgGPT-Gemma-2-27B-IT-v1.0 on Hugging Face</a></p>

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)