3 Downloads Updated 1 week ago
EmbeddingGemma-300M-LawVault is a high-performance embedding model fine-tuned specifically for Chinese Legal RAG (Retrieval-Augmented Generation) scenarios.
Fine-tuned on Google’s embeddinggemma-300m, this model employs a rigorous contrastive learning approach using MultipleNegativesRankingLoss and MatryoshkaLoss. It was trained on a high-quality dataset of over 60,000 (Query, Positive, Hard Negative) triplets to significantly improve retrieval accuracy for legal statutes, colloquial legal inquiries, and noise resistance compared to the base model.
Note: This model is fine-tuned exclusively on Chinese laws and regulations. Its performance on other languages or non-legal domains has not been evaluated and is not guaranteed.
The model was evaluated on a held-out test set constructed from real legal scenarios (containing 120 unseen colloquial legal queries generated by Deepseek V3.2). The End-to-End RAG retrieval results are as follows:
| Metric | Base Model | Finetuned Model (Ours) | Improvement |
|---|---|---|---|
| Hit Rate @ 10 | 85.0% | 98.0% | Significant reduction in “answer not found” cases |
| Top-1 Accuracy | 58.0% | 92.0% | Huge Leap (+34%), the vast majority of correct answers are ranked 1st |
| MRR @ 10 | 0.78 | 0.96 | Extremely high ranking quality |
Note: The test environment used a LanceDB vector database covering a full slice of the Chinese laws and regulations database.
| User Query | Base Model Rank | Finetuned Rank |
|---|---|---|
| “Can the provincial cultural relics bureau directly transfer artifacts unearthed in our area?” | ❌ Not Retrieved (10+) | ✅ 1st |
| “What are the legal requirements for merchants when setting product prices?” | ❌ Not Retrieved (10+) | ✅ 1st |
| “If land is requisitioned for a large hydropower station, how is compensation calculated?” | 2nd | ✅ 1st |
| “How does the government financially support rural revitalization?” | 6th | ✅ 1st |
MatryoshkaLoss wrapping MultipleNegativesRankingLoss:
{
"matryoshka_dims": [768, 512, 256, 128],
"matryoshka_weights": [1, 1, 1, 1]
}
If you use this model, please cite the following:
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}