German-OCR ist ein fine-tuned Vision-Language-Modell basierend auf Qwen2-VL-2B, optimiert für die präzise Texterkennung aus deutschen Rechnungen, Formularen und Geschäftsdokumenten. Das Modell extrahiert strukturierte Daten im Markdown-Format.

German-OCR

High-performance German document OCR using fine-tuned Qwen2-VL-2B & Qwen2.5-VL-3B vision-language model

Model Description

German-OCR is specifically trained to extract text from German documents including invoices, receipts, forms, and other business documents. It outputs structured text in Markdown format.

Base Model: Qwen/Qwen2-VL-2B-Instruct
Fine-tuning: QLoRA (4-bit quantization)
Training Data: German invoices and business documents
Output Format: Markdown structured text

Model Variants

Model	Size	Base	HuggingFace
german-ocr	4.4 GB	Qwen2-VL-2B	Keyven/german-ocr
german-ocr-3b	7.5 GB	Qwen2.5-VL-3B	Keyven/german-ocr-3b

⚠️ Work in Progress

Dieses Modell befindet sich noch in aktiver Entwicklung. Aktuell gibt es Kompatibilitätsprobleme mit dem Ollama Vision-Adapter.

Für zuverlässige Ergebnisse: HuggingFace-Version verwenden.

Usage

Option 1: Python Package (Recommended)

pip install german-ocr

from german_ocr import GermanOCR

# Using Ollama (fast, local)
ocr = GermanOCR(backend="ollama")
result = ocr.extract("document.png")
print(result)

# Using Transformers (more accurate)
ocr = GermanOCR(backend="transformers")
result = ocr.extract("document.png")
print(result)

Option 2: Ollama

ollama run Keyvan/german-ocr "Extrahiere den Text: image.png"

Option 3: Transformers

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Keyven/german-ocr",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Keyven/german-ocr")

image = Image.open("document.png")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Extrahiere den Text aus diesem Dokument."}
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
result = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True
)[0]
print(result)

Performance

Metric	Value
Base Model	Qwen2-VL-2B-Instruct
Model Size	4.4 GB
VRAM (4-bit)	1.5 GB
Inference Time	~15s (GPU)

Training

Method: QLoRA (4-bit quantization)
Epochs: 3
Learning Rate: 2e-4
LoRA Rank: 64
Target Modules: All linear layers

Limitations

Optimized for German documents
Best results with clear, high-resolution images
May struggle with handwritten text

License

Apache 2.0

Author

Keyvan Hardani - Website: keyvan.ai - LinkedIn: linkedin.com/in/keyvanhardani - GitHub: @Keyvanhardani