This project fine-tunes the Qwen2-1.5B model for Arabic language tasks using Quantized LoRA (QLoRA).
31 Pulls Updated 5 weeks ago
Updated 5 weeks ago
5 weeks ago
0d5f4d35d6b6 · 986MB
Readme
Qwen Arabic Fine-tuning Project
This project fine-tunes the Qwen2-1.5B model for Arabic language tasks using Quantized LoRA (QLoRA).
Qwen-Arabic Evaluation on ArabicMMLU
Eevaluation of the Qwen-Arabic language model (1.5B parameters) on the ArabicMMLU benchmark. The model demonstrates strong parameter efficiency while maintaining competitive performance across various knowledge domains.
Model Overview
Qwen-Arabic is a 1.5B parameter language model fine-tuned for Arabic language tasks. It is based on the Qwen architecture and optimized using QLoRA (Quantized Low-Rank Adaptation) techniques.
Performance Results
Overall Performance
- Average Accuracy: 42.3%
- Best Category: Social Science (46.1%)
- Most Challenging: Arabic Language (37.8%)
Category-wise Performance
Category | Accuracy (%) |
---|---|
STEM | 42.2 |
Social Science | 46.1 |
Humanities | 41.8 |
Arabic Language | 37.8 |
Other | 42.9 |
Average | 42.3 |
Efficiency Analysis
- Performance per Billion Parameters: 28.20 accuracy points
- 389.0x more parameter-efficient than GPT-4
- Achieves 58.3% of GPT-4’s performance with only 0.15% of parameters
Comparison with Other Models
Model | Parameters | Average Accuracy | Efficiency Score |
---|---|---|---|
GPT-4 | ~1000B | 72.5% | 0.072 |
Jais-chat | 30B | 62.3% | 2.077 |
AceGPT-chat | 13B | 52.6% | 4.046 |
Qwen-Arabic | 1.5B | 42.3% | 28.200 |
Prerequisites
- Ubuntu (or similar Linux distribution)
- Python 3.10
- CUDA-compatible GPU with at least 4GB VRAM
- At least 12GB system RAM
- Ollama installed and configured
Setup
Clone this repository:
git clone https://github.com/prakash-aryan/qwen-arabic-project.git cd qwen-arabic-project
Create and activate a virtual environment:
python3.10 -m venv qwen_env source qwen_env/bin/activate
Install the required packages:
pip install --upgrade pip pip install -r requirements.txt
Install PyTorch with CUDA support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Project Structure
qwen-arabic-project/
├── data/
│ └── arabic_instruction_dataset/
├── models/
├── results/
├── src/
│ ├── compare_qwen_models.py
│ ├── evaluate_arabic_model.py
│ ├── finetune_qwen.py
│ ├── get_datasets.py
│ ├── load_and_merge_model.py
│ ├── preprocess_datasets.py
│ └── validate_dataset.py
├── tools/
│ └── llama-quantize
├── requirements.txt
├── run_pipeline.sh
├── Modelfile
└── README.md
Usage
Download and prepare datasets:
python src/get_datasets.py
Preprocess and combine datasets:
python src/preprocess_datasets.py
Validate the dataset:
python src/validate_dataset.py
Fine-tune the model:
python src/finetune_qwen.py --data_path ./data/arabic_instruction_dataset --output_dir ./models/qwen2_arabic_finetuned --num_epochs 3 --batch_size 1 --gradient_accumulation_steps 16 --learning_rate 2e-5
Load and merge the fine-tuned model:
python src/load_and_merge_model.py
Convert to GGUF format:
python src/convert_hf_to_gguf.py ./models/qwen2_arabic_merged_full --outfile ./models/qwen_arabic_merged_full.gguf
Quantize the model:
./tools/llama-quantize ./models/qwen_arabic_merged_full.gguf ./models/qwen_arabic_merged_full_q4_k_m.gguf q4_k_m
Create Ollama model:
ollama create qwen-arabic-custom -f Modelfile
Evaluate the model:
python src/evaluate_arabic_model.py
Compare models:
python src/compare_qwen_models.py
Running the Full Pipeline
To run the entire pipeline from data preparation to model evaluation, use the provided shell script:
chmod +x run_pipeline.sh
./run_pipeline.sh
Notes
- Ensure you have sufficient disk space for the datasets and model files.
- The fine-tuning process can take several hours to days, depending on your hardware.
- Monitor GPU memory usage during fine-tuning and adjust batch size or gradient accumulation steps if necessary.
- Make sure to have Ollama installed for the model creation and evaluation steps.
Troubleshooting
- If you encounter CUDA out-of-memory errors, try reducing the batch size or increasing gradient accumulation steps.
- For any other issues, please check the error logs or open an issue in the GitHub repository.
License
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
This means: - You can use, modify, and distribute this software. - If you distribute modified versions, you must also distribute them under the GPL-3.0. - You must include the original copyright notice and the license text. - You must disclose your source code when you distribute the software. - There’s no warranty for this free software.
For more details, see the LICENSE file in this repository or visit GNU GPL v3.0.
Acknowledgements
This project uses the following main libraries and tools: - Transformers by Hugging Face - PyTorch - PEFT (Parameter-Efficient Fine-Tuning) - Ollama - GGUF (for model conversion)