Gemma 4 E2B fine-tuned on 122k microscopy VQA · 145+ genera · 5 categories · runs offline on a sub-$100 phone · Unsloth + llama.cpp · Apache 2.0 · research/educational only, not a medical device

MicroLens v2

Research model · Apache 2.0 · Not a medical device · Not a certified instrument · Use at your own risk. Outputs are statistical pattern matches against training data, not analytical measurements. See full disclaimer below.

A small vision-language model for microscopy. Gemma 4 E2B fine-tuned on 122,399 image-question-answer pairs covering 145+ taxonomic genera across diatoms, freshwater and marine zooplankton, fungal spores, and fish larvae. Q4_K_M GGUF, 3.4 GB. Runs on a phone.

ollama run brinzaengineeringai/microlens-v2

What it does

Give it a microscopy image. Get back one line of structured taxonomic text. Same image, same answer, every time.

This is a diatom of the genus Navicula, specifically Navicula gregaria.

That format is on purpose. v2 is built for pipelines that need to ingest thousands of images and feed the result into a database. No prose, no chain-of-thought, no markdown surprises. If you want a longer scientific description (morphology, habitat, identification cues), use microlens-v3 instead.

Accuracy

Stratified evaluation on 220 held-out validation images.

Category	Category match	Genus match	Notes
Diatoms	100%	~50%	Largest class in training (8k+ samples)
Freshwater zooplankton	97%	~45%	Rotifers, copepods, ciliates
Marine zooplankton	100%	~45%	Copepods, ostracods, krill larvae
Fungal spores	100%	~50%	Plant-pathogenic conidia
Fish larvae	100%	n/a	Pseudo-genus, see Limitations

For reference, random guess across the 145+ genera is around 0.7%.

Performance

Measured on actual hardware:

RTX 3090 Ti: 0.4 to 0.6 seconds per answer
Sub-$100 Android phone with 8 GB RAM: 1.5 to 2.5 seconds
Raspberry Pi 5: about 3 seconds

The Android client uses llama.cpp + mtmd via JNI. Desktop runs llama-server and reads SSE.

Intended use and full disclaimer

MicroLens v2 is a research and educational artefact published under Apache 2.0. It is a fine-tuned neural network, not a regulated instrument.

Designed for:

Citizen-science screening
Taxonomy teaching and student labs
ML research, dataset benchmarking, model comparison
Pre-classification stages of professional pipelines, where every result is verified by a qualified person before any decision is made

This model is NOT, and must not be treated as:

A medical device, in-vitro diagnostic (IVD), or clinical decision-support tool
A regulatory-compliant water-quality measurement instrument (no ISO 17025, EPA, EU WFD, or equivalent certification)
A substitute for a trained taxonomist or accredited laboratory analysis
A calibrated, validated, or peer-reviewed analytical method

The model’s output is a probabilistic pattern match against the training data distribution, not a physical or analytical measurement. The model can be confidently wrong, particularly on:

specimens not represented in training (145+ genera ≠ all microscopic life)
damaged, atypical, or out-of-focus images
subjects from kingdoms or phyla outside the training categories

No warranty. This software is provided “AS IS”, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and non-infringement. In no event shall the author or contributors be liable for any claim, damages or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the model or the use or other dealings in the model.

You assume all risk when downloading, deploying, modifying, or using this model on your own hardware. Always have qualified personnel verify any result that informs a regulatory, environmental, clinical, or health-related decision.

Limitations

A few things to know:

For fish larvae, the underlying dataset has no species-level annotation. The model returns the category name as the “genus” for this class. Don’t import that into a taxonomic database.

The output format is rigid. That’s a feature for parsers and a limitation for humans. Use v3 if you want flowing text.

Long-tail genera — the roughly 100 with fewer than 100 training samples each — score noticeably lower than the 30 most-common ones. Per-genus precision and recall live in the GitHub model card.

There is no uncertainty score in the standard output. If you need confidence values, pull logprobs from llama-server.

Built with

Three pieces did the heavy lifting:

Unsloth for fine-tuning. FastVisionModel with 4-bit QLoRA on a single RTX 3090 Ti. Two-times speedup and half the VRAM compared to vanilla Transformers, which is what made this trainable on consumer hardware in the first place.
llama.cpp + mtmd for inference. The reason this fits on a phone.
Gemma 4 E2B-it as the base. Apache 2.0, multimodal out of the box, small enough to ship.