16 2 days ago

German-OCR ist ein fine-tuned Vision-Language-Modell basierend auf Qwen2-VL-2B, optimiert für die präzise Texterkennung aus deutschen Rechnungen, Formularen und Geschäftsdokumenten. Das Modell extrahiert strukturierte Daten im Markdown-Format.

vision

Models

View all →

Readme

German-OCR Logo

German-OCR

High-performance German document OCR using fine-tuned Qwen2-VL-2B & Qwen2.5-VL-3B vision-language model

Model Description

German-OCR is specifically trained to extract text from German documents including invoices, receipts, forms, and other business documents. It outputs structured text in Markdown format.

  • Base Model: Qwen/Qwen2-VL-2B-Instruct
  • Fine-tuning: QLoRA (4-bit quantization)
  • Training Data: German invoices and business documents
  • Output Format: Markdown structured text

Model Variants

Model Size Base HuggingFace
german-ocr 4.4 GB Qwen2-VL-2B Keyven/german-ocr
german-ocr-3b 7.5 GB Qwen2.5-VL-3B Keyven/german-ocr-3b

⚠️ Work in Progress

Dieses Modell befindet sich noch in aktiver Entwicklung. Aktuell gibt es Kompatibilitätsprobleme mit dem Ollama Vision-Adapter.

Für zuverlässige Ergebnisse: HuggingFace-Version verwenden.

Usage

Option 1: Python Package (Recommended)

pip install german-ocr
from german_ocr import GermanOCR

# Using Ollama (fast, local)
ocr = GermanOCR(backend="ollama")
result = ocr.extract("document.png")
print(result)

# Using Transformers (more accurate)
ocr = GermanOCR(backend="transformers")
result = ocr.extract("document.png")
print(result)

Option 2: Ollama

ollama run Keyvan/german-ocr "Extrahiere den Text: image.png"

Option 3: Transformers

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Keyven/german-ocr",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Keyven/german-ocr")

image = Image.open("document.png")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Extrahiere den Text aus diesem Dokument."}
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
result = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True
)[0]
print(result)

Performance

Metric Value
Base Model Qwen2-VL-2B-Instruct
Model Size 4.4 GB
VRAM (4-bit) 1.5 GB
Inference Time ~15s (GPU)

Training

  • Method: QLoRA (4-bit quantization)
  • Epochs: 3
  • Learning Rate: 2e-4
  • LoRA Rank: 64
  • Target Modules: All linear layers

Limitations

  • Optimized for German documents
  • Best results with clear, high-resolution images
  • May struggle with handwritten text

License

Apache 2.0

Author

Keyvan Hardani - Website: keyvan.ai - LinkedIn: linkedin.com/in/keyvanhardani - GitHub: @Keyvanhardani

Links