Devstral-Vision-Small-2507

Model Description

Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of Devstral-Small-2507 with the vision understanding of Mistral-Small-3.2-24B-Instruct-2506.

This model enables vision-augmented software engineering tasks, allowing developers to: - Analyze screenshots and UI mockups to generate code - Debug visual rendering issues with actual screenshots - Convert designs and wireframes directly into implementation - Understand and modify codebases with visual context

Model Details

Base Architecture: Mistral Small 3.2 with vision encoder
Parameters: 24B (language model) + vision components
Context Window: 128k tokens
License: Apache 2.0
Language Model: Fine-tuned Devstral weights for superior coding performance
Vision Model: Mistral-Small vision encoder and multimodal projector

How It Was Created

This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:

Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
Replaced only the core language model weights with Devstral-Small-2507’s fine-tuned weights
Preserved Mistral’s vision encoder, multimodal projector, vision-language adapter, and token embeddings
Kept Mistral’s tokenizer to maintain proper image token handling

The result is a model that combines Devstral’s state-of-the-art coding capabilities with Mistral’s vision understanding.

Here is the script

Intended Use

Primary Use Cases

Visual Software Engineering: Analyze UI screenshots, mockups, and designs to generate implementation code
Code Review with Visual Context: Review code changes alongside their visual output
Debugging Visual Issues: Debug rendering problems by analyzing screenshots
Design-to-Code: Convert visual designs directly into code
Documentation with Visual Examples: Generate documentation that references visual elements

Example Applications

Building UI components from screenshots
Debugging CSS/styling issues with visual feedback
Converting Figma/design mockups to code
Analyzing and reproducing visual bugs
Creating visual test cases

Usage

With OpenHands

The model is optimized for use with OpenHands for agentic coding tasks:

# Using vLLM
vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --tensor-parallel-size 2

# Configure OpenHands to use the model
# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
# Set Base URL: http://localhost:8000/v1

With Transformers

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model_id = "cognitivecomputations/Devstral-Vision-Small-2507"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load an image
image = Image.open("screenshot.png")

# Create a prompt
prompt = "Analyze this UI screenshot and generate React code to reproduce it."

# Process inputs
inputs = processor(
    text=prompt,
    images=image,
    return_tensors="pt"
).to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=2000,
    temperature=0.7
)

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Performance Expectations

Coding Performance

Inherits Devstral’s exceptional performance on coding tasks: - 53.6% on SWE-Bench Verified (when used with OpenHands) - Superior performance on multi-file editing and codebase exploration - Excellent tool use and agentic behavior

Vision Performance

Maintains Mistral-Small’s vision capabilities: - Strong understanding of UI elements and layouts - Accurate interpretation of charts, diagrams, and visual documentation - Reliable screenshot analysis for debugging

Hardware Requirements

GPU Memory: ~48GB for full precision, ~24GB with 4-bit quantization
Recommended: 2x RTX 4090 or better for optimal performance
Minimum: Single GPU with 24GB VRAM using quantization

Limitations

Vision capabilities are limited to what Mistral-Small-3.2 supports
Not specifically fine-tuned on vision-to-code tasks (uses Devstral’s text-only fine-tuning)
Large model size may be prohibitive for some deployment scenarios
Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)

Ethical Considerations

This model inherits both the capabilities and limitations of its parent models. Users should: - Review generated code for security vulnerabilities - Verify visual interpretations are accurate - Be aware of potential biases in code generation - Use appropriate safety measures in production deployments

Citation

If you use this model, please cite:

@misc{devstral-vision-2507,
  author = {Hartford, Eric},
  title = {Devstral-Vision-Small-2507},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
}

Acknowledgments

This model builds upon the excellent work by: - Mistral AI for both Mistral-Small and Devstral - All Hands AI for their collaboration on Devstral - The open-source community for testing and feedback

License

Apache 2.0 - Same as the base models

Created with dolphin passion 🐬 by Cognitive Computations

Models

Readme