897 Downloads Updated 1 year ago
This project fine-tunes the Qwen2-1.5B model for Arabic language tasks using Quantized LoRA (QLoRA).
Eevaluation of the Qwen-Arabic language model (1.5B parameters) on the ArabicMMLU benchmark. The model demonstrates strong parameter efficiency while maintaining competitive performance across various knowledge domains.
Qwen-Arabic is a 1.5B parameter language model fine-tuned for Arabic language tasks. It is based on the Qwen architecture and optimized using QLoRA (Quantized Low-Rank Adaptation) techniques.
| Category | Accuracy (%) |
|---|---|
| STEM | 42.2 |
| Social Science | 46.1 |
| Humanities | 41.8 |
| Arabic Language | 37.8 |
| Other | 42.9 |
| Average | 42.3 |
| Model | Parameters | Average Accuracy | Efficiency Score |
|---|---|---|---|
| GPT-4 | ~1000B | 72.5% | 0.072 |
| Jais-chat | 30B | 62.3% | 2.077 |
| AceGPT-chat | 13B | 52.6% | 4.046 |
| Qwen-Arabic | 1.5B | 42.3% | 28.200 |
Clone this repository:
git clone https://github.com/prakash-aryan/qwen-arabic-project.git
cd qwen-arabic-project
Create and activate a virtual environment:
python3.10 -m venv qwen_env
source qwen_env/bin/activate
Install the required packages:
pip install --upgrade pip
pip install -r requirements.txt
Install PyTorch with CUDA support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
qwen-arabic-project/
├── data/
│ └── arabic_instruction_dataset/
├── models/
├── results/
├── src/
│ ├── compare_qwen_models.py
│ ├── evaluate_arabic_model.py
│ ├── finetune_qwen.py
│ ├── get_datasets.py
│ ├── load_and_merge_model.py
│ ├── preprocess_datasets.py
│ └── validate_dataset.py
├── tools/
│ └── llama-quantize
├── requirements.txt
├── run_pipeline.sh
├── Modelfile
└── README.md
Download and prepare datasets:
python src/get_datasets.py
Preprocess and combine datasets:
python src/preprocess_datasets.py
Validate the dataset:
python src/validate_dataset.py
Fine-tune the model:
python src/finetune_qwen.py --data_path ./data/arabic_instruction_dataset --output_dir ./models/qwen2_arabic_finetuned --num_epochs 3 --batch_size 1 --gradient_accumulation_steps 16 --learning_rate 2e-5
Load and merge the fine-tuned model:
python src/load_and_merge_model.py
Convert to GGUF format:
python src/convert_hf_to_gguf.py ./models/qwen2_arabic_merged_full --outfile ./models/qwen_arabic_merged_full.gguf
Quantize the model:
./tools/llama-quantize ./models/qwen_arabic_merged_full.gguf ./models/qwen_arabic_merged_full_q4_k_m.gguf q4_k_m
Create Ollama model:
ollama create qwen-arabic-custom -f Modelfile
Evaluate the model:
python src/evaluate_arabic_model.py
Compare models:
python src/compare_qwen_models.py
To run the entire pipeline from data preparation to model evaluation, use the provided shell script:
chmod +x run_pipeline.sh
./run_pipeline.sh
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
This means: - You can use, modify, and distribute this software. - If you distribute modified versions, you must also distribute them under the GPL-3.0. - You must include the original copyright notice and the license text. - You must disclose your source code when you distribute the software. - There’s no warranty for this free software.
For more details, see the LICENSE file in this repository or visit GNU GPL v3.0.
This project uses the following main libraries and tools: - Transformers by Hugging Face - PyTorch - PEFT (Parameter-Efficient Fine-Tuning) - Ollama - GGUF (for model conversion)