11 Downloads Updated 2 weeks ago
ollama run mkchaou/calm_q3_32b_r16_a32_lr2e-5_e15
ollama launch claude --model mkchaou/calm_q3_32b_r16_a32_lr2e-5_e15
ollama launch codex --model mkchaou/calm_q3_32b_r16_a32_lr2e-5_e15
ollama launch opencode --model mkchaou/calm_q3_32b_r16_a32_lr2e-5_e15
ollama launch openclaw --model mkchaou/calm_q3_32b_r16_a32_lr2e-5_e15
Base Model: Qwen3-32b-instruct
Fine-tuning Data: 12,756 valid pairs from the Climsight dataset
Task: SFT for Climsight
Fine-tuning with LoRA To maintain efficiency and prevent catastrophic forgetting, the model was trained using LoRA (Low-Rank Adaptation) with the following configuration:
Target Modules: All linear layers (all-linear)
Rank (r): 16
Alpha (alpha): 32
Dropout: 0.05
Mixed Precision: bf16.
Trainable Parameters: 134,217,728 (approx.0.4080 of total parameters)|| all params: 32,893,606,912
Total Duration: ~29 hours. -Total Epochs: 15.0 (with EArly stopping patience of 3)
Hardware: 10 nodes, each with 4 GPUs (Total 40 GPUs).
Optimizer: adamw_torch (with 4-bit NF4 quantization)
Learning Rate: 2e-5 with cosine scheduler and 50 warmup steps
Weight Decay: 0.01
Batch Size: 1 per device (Global batch size of 240 via 6 accumulation steps across 40 GPUs).
attn_implementation=‘sdpa’
Gradient Management: _Gradient Checkpointing: Enabled (use_reentrant=True)
Gradient Accumulation: 6 steps
Distributed Framework: Deepspeed / PyTorch DDP / SLURM.
RoPE Scaling: YaRN RoPE Scaling applied with a factor of 4.0 to support extended context windows up to 32,768 tokens natively.
Max Length: 131,072 tokens (dynamically calculated based on dataset max combined length, capped at 131k).
Merging: Post-training, LoRA weights were merged back into the base model to consolidate the final weights for deployment.
Formatting: Used Qwen ChatML-style prompting: <|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n{completion}<|im_end|>.
Split: 80% Train (10,204), 10% Val (1,276), 10% Test (1,276).
Prompt length: avg=24414, min=16461, max=35660
Completion length: avg=6875, min=931, max=14068
ClimaQA Benchmark ClimaQA is a benchmark proposed in 2025. It is built from questions derived from graduate-level climate science textbooks.
The evaluation dataset The datasets are available on HuggingFace. The evaluation dataset is divided by 3 tasks: MCQ, freeformQA, closedQA. The data is also devided by complexity level (base, reasoning, hypothetical). In total there are 3633 rows.
Evaluation Metrics
MCQ: In the Multiple-Choice Question (MCQ) task, the model is required to select the correct answer from a set of predefined options. This task evaluates the model’s factual knowledge as well as its ability to make accurate decisions under constrained answer conditions.
Cloze : In the Cloze task, the model fills in blanks with appropriate scientific terms, evaluating its contextual understanding and use of domain-specific vocabulary.
FreeForm: In the FreeForm task, the model generates detailed, structured responses, testing its ability to reason logically and produce scientifically sound explanations.
Climsight is a system that integrates large language models with high-resolution climate model outputs, scientific literature, and heterogeneous climate-related databases to produce accurate, localized, and context-aware climate assessments. The original Climsight evaluation is conducted on a set of 30 question–answer (QA) pairs, each consisting of a question and a corresponding ground-truth answer.
Our initial objective was to directly use the same evaluation methodology as Climsight. However, the original Climsight prompt used to generate model responses is not publicly available. As a result, it was not possible to reproduce the full end-to-end Climsight evaluation pipeline.
As an alternative, we constructed a dedicated evaluation set using 100 QA pairs that were explicitly excluded from training, ensuring a strict train-test split. This evaluation set contains the question, the context / prompt, and the ground-truth answer. Using this setup, we generated responses with our fine-tuned model that could be evaluated under the same assessment criteria.
To ensure comparability and methodological consistency, we adopt the Climsight Evaluation framework for assessing the generated answers. Specifically, we follow the same LLM-as-a-judge paradigm and use the identical evaluation prompt defined in the original Climsight work.
In this setup, model-generated answers are evaluated by a large language model acting as an automatic judge. Due to the lack of access to GPT-based models, we employ Meta-Llama-3.1-8B-Instruct as the judge model. The judge evaluates each response by comparing it against the corresponding reference answer along five qualitative dimensions: Completeness, Accuracy, Relevance, Clarity, and Coherence. The generated answers are then assessed using the Climsight evaluation prompt in combination with the Meta-Llama-3.1-8B-Instruct judge model. The aggregated evaluation results across all dimensions are reported below.
Additionally, we computed the following metrics for this evaluation dataset.