124 Downloads Updated 3 months ago
Created by Eric Hartford at QuixiAI
Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of Devstral-Small-2507 with the vision understanding of Mistral-Small-3.2-24B-Instruct-2506.
This model enables vision-augmented software engineering tasks, allowing developers to: - Analyze screenshots and UI mockups to generate code - Debug visual rendering issues with actual screenshots - Convert designs and wireframes directly into implementation - Understand and modify codebases with visual context
This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:
The result is a model that combines Devstral’s state-of-the-art coding capabilities with Mistral’s vision understanding.
Here is the script
The model is optimized for use with OpenHands for agentic coding tasks:
# Using vLLM
vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tensor-parallel-size 2
# Configure OpenHands to use the model
# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
# Set Base URL: http://localhost:8000/v1
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
model_id = "cognitivecomputations/Devstral-Vision-Small-2507"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Load an image
image = Image.open("screenshot.png")
# Create a prompt
prompt = "Analyze this UI screenshot and generate React code to reproduce it."
# Process inputs
inputs = processor(
text=prompt,
images=image,
return_tensors="pt"
).to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=2000,
temperature=0.7
)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
Inherits Devstral’s exceptional performance on coding tasks: - 53.6% on SWE-Bench Verified (when used with OpenHands) - Superior performance on multi-file editing and codebase exploration - Excellent tool use and agentic behavior
Maintains Mistral-Small’s vision capabilities: - Strong understanding of UI elements and layouts - Accurate interpretation of charts, diagrams, and visual documentation - Reliable screenshot analysis for debugging
This model inherits both the capabilities and limitations of its parent models. Users should: - Review generated code for security vulnerabilities - Verify visual interpretations are accurate - Be aware of potential biases in code generation - Use appropriate safety measures in production deployments
If you use this model, please cite:
@misc{devstral-vision-2507,
author = {Hartford, Eric},
title = {Devstral-Vision-Small-2507},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
}
This model builds upon the excellent work by: - Mistral AI for both Mistral-Small and Devstral - All Hands AI for their collaboration on Devstral - The open-source community for testing and feedback
Apache 2.0 - Same as the base models
Created with dolphin passion 🐬 by Cognitive Computations