110 Downloads Updated 6 months ago
license: llama3.1
language: - el - en
pipeline_tag: text-generation
library_name: transformers
tags: - text-generation-inference
base_model: - ilsp/Llama-Krikri-8B-Base
🚨 PLEASE USE THE OFFICIAL QUANTIZED VERSIONS: GGUF OR REQUEST A SPECIFIC ONE 🚨
🚨 There is no guarantee that you are using the latest improved versions from 3rd party quantizations as we have updated the model’s weights 🚨
Following the release of Meltemi-7B on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs. Krikri is built on top of Llama-3.1-8B, extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Instruct, along with the base model, Llama-Krikri-8B-Base
Sub-corpus | # Tokens | Percentage |
---|---|---|
Greek | 56.7 B | 62.3 % |
English | 21.0 B | 23.1 % |
Parallel | 5.5 B | 6.0 % |
Math/Code | 7.8 B | 8.6 % |
Total | 91 B | 100% |
Chosen subsets of the 91 billion corpus were upsampled resulting in a size of 110 billion tokens.
Llama-Krikri-8B-Instruct is the result of post-training Llama-Kriki-8B-Base and features: - Enhanced chat capabilities and instruction-following in both Greek and English. - Document translation from Greek to English, French, German, Italian, Portuguese, Spanish and vice versa. - Great performance on generation, comprehension, and editing tasks, such as summarization, creative content creation, text modification, entity recognition, sentiment analysis, etc. - Domain-specifc expertise for specialized legal, financial, medical, and scientific applications. - Retrieval-Augmented Generation (RAG) utilizing multiple documents with 128k context length. - Improved coding and agentic capabilities with correct formatting and tool use. - Conversion or structured extraction (e.g., XML, JSON) in data-to-text & text-to-data settings. - Analytical thinking and Chain-of-Thought (CoT) reasoning for problem-solving.
We used a multi-stage process in order to build Llama-Krikri-8B-Instruct which includes: - 2-stage Supervised Fine-Tuning with a combination of Greek & English instruction-response pairs - Stage 1: 856,946 instruction-response pairs (371,379 Greek + 485,567 English) - Stage 2: 638,408 instruction-response pairs (279,948 Greek + 358,460 English) - Alignment with a combination of Greek & English preference triplets - Length Normalized DPO: 92,394 preference triplets (47,132 Greek + 45,262 English)
To build the SFT & DPO data, we utilized various methodologies including: - Collecting existing high-quality datasets such as Tulu 3, SmolTalk, MAGPIE Ultra, Orca Agent Instruct, IFEval Like Data, UltraFeedback, NVIDIA HelpSteer2, Intel Orca, UltraMedical, and other datasets focused on safety, truthfulness, and instruction-following. - Translating various data into Greek using an in-house translation tool. - Distilling (with the MAGPIE methodology) models which exhibit strong performance in Greek, such as Gemma 2 27B IT. - Scoring data with the Skywork Reward Gemma 2 27B v0.2 Reward Model and filtering using rule-based filters. - Creating data for sentence and document translation using high-quality parallel corpora mainly from ELRC-SHARE. - Synthetically extracting question-answer pairs (RAG) and multi-turn dialogues from diverse sources such as Wikipedia, EUR-LEX, Greek School Books, and Kallipos.
🚨 More information on post-training, methdology, and evaluation coming soon. 🚨
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")
model.to(device)
system_prompt = "Είσαι το Κρικρί, ένα εξαιρετικά ανεπτυγμένο μοντέλο Τεχνητής Νοημοσύνης για τα ελληνικα και εκπαιδεύτηκες από το ΙΕΛ του Ε.Κ. \"Αθηνά\"."
user_prompt = "Σε τι διαφέρει ένα κρικρί από ένα λάμα;"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)
print(tokenizer.batch_decode(outputs)[0])
vllm serve ilsp/Llama-Krikri-8B-Instruct \
--enforce-eager \
--dtype 'bfloat16' \
--api-key token-abc123
Then, the model can be used through Python using:
from openai import OpenAI
api_key = "token-abc123"
base_url = "http://localhost:8000/v1"
client = OpenAI(
api_key=api_key,
base_url=base_url,
)
system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει με λίστες Python. Δεν γράφεις τίποτα άλλο στις απαντήσεις σου πέρα από τις μεταφρασμένες λίστες."
user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
response = client.chat.completions.create(model="ilsp/Llama-Krikri-8B-Instruct",
messages=messages,
temperature=0.0,
top_p=0.95,
max_tokens=8192,
stream=False)
print(response.choices[0].message.content)
# ['Ηθική καθήκοντος', 'Μεταμοντέρνα ηθική', 'Συνεπειοκρατική ηθική', 'Ωφελιμιστική ηθική', 'Δεοντολογική ηθική', 'Ηθική αρετών', 'Σχετικιστική ηθική']
🚨 Instruction following and chat capability evaluation benchmarks coming soon. 🚨
The ILSP team utilized Amazon’s cloud computing services, which were made available via GRNET under the OCRE Cloud framework, providing Amazon Web Services for the Greek Academic and Research Community.