10 Downloads Updated 1 week ago
ollama run msallai02/racka:4b-4qkm
Racka (Regionális Adatokon Célzottan Kialakított Alapmodell) is a continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages. It employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a Qwen3-4B (reasoning/instruct) backbone.
The model was trained on a mixture of 160B tokens (44% Hungarian, 24% German, 21% English, 11% Code) on the Komondor HPC. To better match the training distribution, Racka uses an adapted tokenizer that achieves substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German.
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
model_name = "elte-nlp/Racka-4B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
messages = [
{"role": "system", "content": "You are a helpful Hungarian assistant."},
{"role": "user", "content": "Magyarázd el a gépi tanulás lényegét óvodásoknak egy mondatban!"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generation_config = GenerationConfig(
do_sample=True,
temperature=0.6,
top_p=0.8,
top_k=50,
repetition_penalty=1.1,
presence_penalty=1.1,
)
# conduct text completion
generated_ids = model.generate(
input_ids = model_inputs["input_ids"],
attention_mask = model_inputs["attention_mask"],
max_new_tokens=32768,
generation_config=generation_config
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
vllm serve elte-nlp/Racka-4B --tokenizer elte-nlp/Racka-4B --dtype float16 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 --reasoning-parser qwen3
The model was trained on a 160B token corpus designed to mitigate catastrophic forgetting via data replay:
| Language | BPE Tokens | Ratio | Sources |
|---|---|---|---|
| Hungarian | ~70B | 44% | Common Crawl (heavily filtered), News, Wikipedia, Court Rulings, Subtitles, Academic Repositories. |
| English | ~38B | 24% | The Pile, FineWeb. |
| German | ~34B | 21% | Occiglot-FineWeb. |
| Code | ~18B | 11% | The Stack v2. |
The vocabulary was extended by 32,000 new Hungarian tokens initialized via VIPI (Vocabulary Initialization with Partial Inheritance). This reduced Hungarian subword fertility by ~47%. This fertility reduction results in proportional processing time reduction.
| Language | Qwen-3 4B Fertility | Racka-4B Fertility | Change |
|---|---|---|---|
| Hungarian | 3.13 | 1.66 | -46.96% |
| English | 1.57 | 1.94 | +23.44% |
| German | 2.05 | 2.31 | +12.62% |
The following tables present the performance of Racka-4B compared to its base models (Qwen3-4B and Qwen3-4B-Base) and the SOTA 8B Hungarian model PULI-LlumiX-Llama-3.1 8B.
Performance on the Hungarian Language Understanding (HULU) benchmark suite. Results represent the average of multiple runs, taking the best result between LoRA and full fine-tuning.
| Dataset | Metric | Qwen3-4B | Racka-4B | Qwen3-4B-Base | PULI-LlumiX-Llama-3.1 8B |
|---|---|---|---|---|---|
| HuCOLA | ACC | 0.8109 | 0.8624 | 0.8254 | 0.8989 |
| MCC | 0.3482 | 0.5657 | 0.4044 | 0.6920 | |
| F1 | 0.7840 | 0.8563 | 0.8027 | 0.8969 | |
| HuCOPA | ACC | 0.5589 | 0.7990 | 0.5845 | 0.9359 |
| MCC | 0.1181 | 0.5998 | 0.1705 | 0.8720 | |
| F1 | 0.5584 | 0.7988 | 0.5837 | 0.9359 | |
| HuSST | ACC | 0.7517 | 0.7603 | 0.7539 | 0.7804 |
| MCC | 0.5022 | 0.5137 | 0.5082 | 0.5598 | |
| F1 | 0.7433 | 0.7511 | 0.7513 | 0.7698 | |
| HuRTE | ACC | 0.9078 | 0.8790 | 0.8872 | 0.8979 |
| MCC | 0.8142 | 0.7553 | 0.7719 | 0.7936 | |
| F1 | 0.9078 | 0.8790 | 0.8872 | 0.8977 | |
| HuWNLI | ACC | 0.5033 | 0.5666 | 0.5366 | 0.3800 |
| MCC | -0.0980 | 0.1031 | -0.0600 | -0.2815 | |
| F1 | 0.3862 | 0.4548 | 0.4069 | 0.3668 | |
| HuCB | ACC | 0.7378 | 0.6388 | 0.6291 | 0.4854 |
| MCC | 0.6078 | 0.4741 | 0.4733 | 0.2742 | |
| F1 | 0.7316 | 0.6373 | 0.6112 | 0.4594 | |
| Overall | Avg ACC | 0.711 | 0.751 | 0.702 | 0.729 |
| Avg MCC | 0.382 | 0.502 | 0.378 | 0.485 | |
| Avg F1 | 0.685 | 0.7295 | 0.673 | 0.721 |
Evaluation on Hungarian reading comprehension, generation, and reasoning tasks. Qwen and Racka models use a patched implementation of OpenHuEval for compatibility.
| Metric | Qwen3-4B | Racka-4B | Qwen3-4B-Base | PULI-LlumiX 8B |
|---|---|---|---|---|
| HuWildBench (WBScore) | 63.03 | 57.17 | 52.59 | 17.77 |
| HuSimpleQA (Acc) | 7.30 | 10.05 | 5.90 | 20.03 |
| HuProverbRea (Acc OE) | 62.47 | 61.94 | 41.15 | 75.86 |
| HuProverbRea (Acc 2CQ) | 74.98 | 77.53 | 0.00 | 77.36 |
| HuMatchingFIB (B Acc) | 39.59 | 38.93 | 42.30 | 33.54 |
| HuMatchingFIB (Q Acc) | 5.94 | 4.68 | 5.58 | 3.96 |
| HuStandardFIB (B Acc) | 13.20 | 18.98 | 0.00 | 29.16 |
| HuStandardFIB (Q Acc) | 1.08 | 2.15 | 0.00 | 2.15 |
| Overall | 33.44 | 33.93 | 18.44 | 32.47 |
Few-shot evaluation on standard benchmarks translated to Hungarian. Best results are kept (with chat template for Racka-4B and without for others).
| Dataset (Metric) | Qwen3-4B | Racka-4B | Qwen3-4B-Base | PULI-LlumiX 8B |
|---|---|---|---|---|
| Arc_hu (Acc) | 0.3202 | 0.3450 | 0.3792 | 0.3861 |
| Arc_hu (Acc_norm) | 0.3844 | 0.4101 | 0.4169 | 0.4323 |
| Hellaswag_hu (Acc) | 0.3369 | 0.3656 | 0.3610 | 0.4241 |
| Hellaswag_hu (Acc_norm) | 0.4095 | 0.4510 | 0.4557 | 0.5606 |
| MMLU_hu (Acc) | 0.5427 | 0.5378 | 0.5965 | 0.5310 |
| TruthfulQA_hu_mc1 (Acc) | 0.3177 | 0.3644 | 0.3281 | 0.3035 |
| TruthfulQA_hu_mc2 (Acc) | 0.5102 | 0.5493 | 0.5045 | 0.4883 |
| GSM8K_hu (Strict-match) | 0.6330 | 0.5299 | 0.6398 | 0.4761 |
| GSM8K_hu (Flexible extract) | 0.6285 | 0.5329 | 0.6421 | 0.4791 |
| Overall | 0.453 | 0.454 | 0.4805 | 0.4546 |
In alphabetical order:
We acknowledge the Digital Government Development and Project Management Ltd. for awarding us access to the Komondor HPC facility based in Hungary.
This research was supported by the EKÖP-24 University Excellence Scholarship Program of the Ministry for Culture and Innovation, funded by the National Research, Development and Innovation Fund.
The authors acknowledge the support of the National Laboratory for Digital Heritage. Project no. 2022-2.1.1-NL-2022-00009 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the 2022-2.1.1-NL funding scheme.
We would like to thank Levente Szabados for the name idea and initial informal discussions.
@article{racka2026,
title={Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure},
author={Csibi, Zsolt and Gortka, Bence Gy\"orgy and Nagy, Korn\'el and Nemeskey, D\'avid M\'ark and Sallai, Martin and Simonyi, Andr\'as and Szekeres, Andr\'as M\'ark and Palk\'o, G\'abor},
journal={Proceedings of the XXII. Hungarian Computational Linguistics Conference},
year={2026}
}