EmbeddingGemma-300M-LawVault is a high-performance embedding model fine-tuned specifically for Chinese Legal RAG (Retrieval-Augmented Generation) scenarios.

EmbeddingGemma-300M-LawVault (Chinese Legal RAG)

📖 Model Introduction

EmbeddingGemma-300M-LawVault is a high-performance embedding model fine-tuned specifically for Chinese Legal RAG (Retrieval-Augmented Generation) scenarios.

Fine-tuned on Google’s embeddinggemma-300m, this model employs a rigorous contrastive learning approach using MultipleNegativesRankingLoss and MatryoshkaLoss. It was trained on a high-quality dataset of over 60,000 (Query, Positive, Hard Negative) triplets to significantly improve retrieval accuracy for legal statutes, colloquial legal inquiries, and noise resistance compared to the base model.

Note: This model is fine-tuned exclusively on Chinese laws and regulations. Its performance on other languages or non-legal domains has not been evaluated and is not guaranteed.

Key Highlights

Domain Specialization: Specifically addresses the pain point where general models fail to distinguish between “National Laws” and “Local Regulations/Administrative Rules” with similar wording.
Anti-Interference: Trained with “Source-aware Hard Negatives”—using the base model’s incorrect top retrievals for the same query as hard negatives—enabling the model to precisely filter out confusingly similar but incorrect clauses.
Colloquial Understanding: The training set includes queries generated by LLMs to simulate real-world user questions, bridging the semantic gap between formal legal terminology and everyday language.
Matryoshka Embeddings: Supports flexible output vector dimensions (768, 512, 256, 128), allowing for significantly reduced storage costs without major performance loss.

📊 Evaluation Performance

The model was evaluated on a held-out test set constructed from real legal scenarios (containing 120 unseen colloquial legal queries generated by Deepseek V3.2). The End-to-End RAG retrieval results are as follows:

Metric	Base Model	Finetuned Model (Ours)	Improvement
Hit Rate @ 10	85.0%	98.0%	Significant reduction in “answer not found” cases
Top-1 Accuracy	58.0%	92.0%	Huge Leap (+34%), the vast majority of correct answers are ranked 1st
MRR @ 10	0.78	0.96	Extremely high ranking quality

Note: The test environment used a LanceDB vector database covering a full slice of the Chinese laws and regulations database.

Case Study

User Query	Base Model Rank	Finetuned Rank
“Can the provincial cultural relics bureau directly transfer artifacts unearthed in our area?”	❌ Not Retrieved (10+)	✅ 1st
“What are the legal requirements for merchants when setting product prices?”	❌ Not Retrieved (10+)	✅ 1st
“If land is requisitioned for a large hydropower station, how is compensation calculated?”	2nd	✅ 1st
“How does the government financially support rural revitalization?”	6th	✅ 1st

Training Details (Generated by Trainer)

Dataset

Size: 65,783 training triplets (Anchor, Positive, Hard Negative)
Source: Chinese Laws & Regulations (Civil, Criminal, Administrative, etc.)

Training Hyperparameters

Batch Size: 24 (Effective Batch Size = 144 with Gradient Accumulation)
Learning Rate: 2e-05
Epochs: 3
Precision: bf16 (BFloat16)
Gradient Accumulation: 6 steps
Max Sequence Length: 1024 tokens

Loss Function

MatryoshkaLoss wrapping MultipleNegativesRankingLoss:

{
    "matryoshka_dims": [768, 512, 256, 128],
    "matryoshka_weights": [1, 1, 1, 1]
}

Training Logs

Click to expand detailed logs

| Epoch | Step | Training Loss | |:------:|:----:|:-------------:| | 0.0022 | 1 | 3.5148 | | ... | ... | ... | | 1.0 | 457 | 0.2123 | | 2.0 | 914 | 0.0749 | | 3.0 | 1371 | 0.0369 |

Framework Versions

Python: 3.13.1
Sentence Transformers: 5.1.2
Transformers: 4.57.1
PyTorch: 2.9.1+cu130
Accelerate: 1.12.0
Datasets: 4.4.1
Tokenizers: 0.22.1

Citation

If you use this model, please cite the following:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}