mkchaou/climsight-calm_ft_Q3

This model is Qwen4b-Instruct-2507 fine-tuned on ~13k Climsight data which are pairs of prompts/completions extracted from climsight pipeline. Below are all the technical specs of the fine-tuning process along with the evaluation used.

Model Details

Base Model: Qwen4b-Instruct-2507
Fine-tuning Data: 12,756 valid pairs from the Climsight dataset
Task: SFT for Climsight

Technical Specs & Hardware

Fine-tuning with LoRA To maintain efficiency and prevent catastrophic forgetting, the model was trained using LoRA (Low-Rank Adaptation) with the following configuration:
Target Modules: All linear layers (all-linear)
Rank (r): 16
Alpha (alpha): 32
Dropout: 0.05
Mixed Precision: bf16.
Trainable Parameters: ~33 Million (approx. 0.81% of total parameters).
Total Duration: ~59 hours (Combined runs).
Final Training State (at ~14.91 Epochs): - Loss: 0.3604 - Mean Token Accuracy: 89.26% - Processed Tokens: ~307 Million
Hardware: 2 nodes, each with 4 GPUs (Total 8 GPUs).
Optimizer: paged_adamw_8bit (to reduce VRAM footprint)
Learning Rate: 1e-5 with a cosine scheduler and 50 warmup steps
Weight Decay: 0.01
Batch Size: 1 per device (Global batch size of 64 via 8 accumulation steps across 8 GPUs).
Gradient Management: Gradient Checkpointing: Enabled to save memory during backpropagation
Gradient Accumulation: 8 steps
Distributed Framework: DeepSpeed (ZeRO Stage 3) with gradient checkpointing for multi-node/multi-GPU scaling.
Merging: Post-training, LoRA weights were merged back into the base model to consolidate the final weights for deployment.
Custom Tokenization: To prevent Hugging Face’s default truncation on long-context climate data, a custom tokenization pipeline was used.
Max Length: Dynamically calculated based on the dataset, ensuring no loss of context for prompts (avg. ~24k tokens), up to 35k tokens without truncation.
Formatting: Used ChatML-style prompting: <|im_start|>system…<|im_end|> (for Qwen).

Dataset Statistics

Split: 80% Train (10,204), 10% Val (1,276), 10% Test (1,276).
Prompt length: avg=24414, min=16461, max=35660
Completion length: avg=6875, min=931, max=14068

Evaluation Results - ClimaQA

ClimaQA Benchmark ClimaQA is a benchmark proposed in 2025. It is built from questions derived from graduate-level climate science textbooks.
The evaluation dataset The datasets are available on HuggingFace. The evaluation dataset is divided by 3 tasks: MCQ, freeformQA, closedQA. The data is also devided by complexity level (base, reasoning, hypothetical). In total there are 3633 rows.
Evaluation Metrics MCQ: In the Multiple-Choice Question (MCQ) task, the model is required to select the correct answer from a set of predefined options. This task evaluates the model’s factual knowledge as well as its ability to make accurate decisions under constrained answer conditions.

Cloze : In the Cloze task, the model fills in blanks with appropriate scientific terms, evaluating its contextual understanding and use of domain-specific vocabulary.

FreeForm: In the FreeForm task, the model generates detailed, structured responses, testing its ability to reason logically and produce scientifically sound explanations.

Evaluation Results - Climsight Evaluation

Climsight is a system that integrates large language models with high-resolution climate model outputs, scientific literature, and heterogeneous climate-related databases to produce accurate, localized, and context-aware climate assessments. The original Climsight evaluation is conducted on a set of 30 question–answer (QA) pairs, each consisting of a question and a corresponding ground-truth answer.

Our initial objective was to directly use the same evaluation methodology as Climsight. However, the original Climsight prompt used to generate model responses is not publicly available. As a result, it was not possible to reproduce the full end-to-end Climsight evaluation pipeline.

As an alternative, we constructed a dedicated evaluation set using 100 QA pairs that were explicitly excluded from training, ensuring a strict train-test split. This evaluation set contains the question, the context / prompt, and the ground-truth answer. Using this setup, we generated responses with our fine-tuned model that could be evaluated under the same assessment criteria.

To ensure comparability and methodological consistency, we adopt the Climsight Evaluation framework for assessing the generated answers. Specifically, we follow the same LLM-as-a-judge paradigm and use the identical evaluation prompt defined in the original Climsight work.

In this setup, model-generated answers are evaluated by a large language model acting as an automatic judge. Due to the lack of access to GPT-based models, we employ Meta-Llama-3.1-8B-Instruct as the judge model. The judge evaluates each response by comparing it against the corresponding reference answer along five qualitative dimensions: Completeness, Accuracy, Relevance, Clarity, and Coherence. The generated answers are then assessed using the Climsight evaluation prompt in combination with the Meta-Llama-3.1-8B-Instruct judge model. The aggregated evaluation results across all dimensions are reported below. Additionally, we computed the following metrics for this evaluation dataset.

This model is Qwen4b-Instruct-2507 fine-tuned on ~13k Climsight data which are pairs of prompts/completions extracted from climsight pipeline. Below are all the technical specs of the fine-tuning process along with the evaluation used.

Models

Readme

Model Details

Technical Specs & Hardware

Dataset Statistics

Evaluation Results - ClimaQA

Evaluation Results - Climsight Evaluation