16.3K Downloads Updated 45 minutes ago
ollama run ssfdre38/gemma4-turbo:26b
Updated 50 minutes ago
50 minutes ago
ab3236d78e85 · 15GB ·
Base Model: google/gemma-4-e4b-it
License: Apache 2.0
Author: ssfdre38
Performance: 🚀 51% faster than stock gemma4:e4b on CPU
This is an optimized version of Google’s Gemma 4 (9B parameter) model, specifically tuned for high-performance CPU inference on Windows systems. Through careful quantization and parameter optimization, this model achieves a 51% speed improvement over the stock gemma4:e4b model while maintaining quality.
| Feature | Value |
|---|---|
| Context Window | 16,384 tokens |
| Quantization | int4 / Q4_K_M |
| Model Size | 9.6 GB |
| Thread Optimization | 8 threads (configurable) |
| Batch Size | 512 tokens |
| Target Hardware | CPU (Windows optimized) |
Compared to stock gemma4:e4b:
- ✅ 51% faster inference on CPU workloads
- ✅ Same quality output
- ✅ Same model capabilities (chat, tool calling, reasoning)
- ✅ Lower latency for interactive applications
Perfect for: - 🤖 Local AI assistants (like Ash bot) - 🔧 Tool calling with Ollama API - 💻 CPU-based inference where GPU is unavailable - ⚡ Low-latency chat applications - 🧠 Semantic memory systems - 🏠 On-premise deployments
# Pull from Ollama registry
ollama pull ssfdre38/gemma4-turbo
# Run interactively
ollama run ssfdre38/gemma4-turbo
Or build from Modelfile:
ollama create gemma4-turbo -f gemma4-turbo.Modelfile
ollama run ssfdre38/gemma4-turbo
curl http://localhost:11434/api/generate -d '{
"model": "ssfdre38/gemma4-turbo",
"prompt": "Explain quantum computing in simple terms"
}'
import ollama
response = ollama.generate(
model='ssfdre38/gemma4-turbo',
prompt='What are the benefits of local AI?'
)
print(response['response'])
# For reliable tool calling, use lower temperature
response = ollama.generate(
model='ssfdre38/gemma4-turbo',
prompt=prompt,
options={'temperature': 0.2}
)
Default parameters (can be overridden):
| Parameter | Default | Purpose |
|---|---|---|
temperature |
0.75 | Creative responses (use 0.2 for tools) |
top_p |
0.9 | Nucleus sampling |
top_k |
40 | Top-K sampling |
repeat_penalty |
1.1 | Reduce repetition |
num_ctx |
16384 | Context window size |
num_thread |
8 | CPU threads |
num_batch |
512 | Batch size |
This model uses: - Quantization: 4-bit integer quantization (Q4_K_M format) - Optimization: CPU-specific inference paths and threading - Threading: Multi-threaded token generation (8 threads default) - Batch Processing: 512 token batches for efficient throughput - Memory Efficiency: Reduced memory footprint vs FP16
Tested on typical Windows CPU (Intel/AMD): - Tokens/second: ~51% improvement over gemma4:e4b - Latency: Significantly reduced response times - Memory: 9.6 GB RAM usage - Quality: Equivalent to base model
The AI system this model was built for. Ash is a self-hosted Discord bot with long-term community memory, 20 built-in tools, and a fully customizable personality — all running locally via Ollama.
ssfdre38/gemma4-turbo is the default and recommended modelgit clone https://github.com/ssfdre38/ash-bot.git
cd ash-bot
setup.bat # Windows
./setup.sh # Linux / macOS
Ash will auto-pull this model on first run if it’s not already installed.
Apache License 2.0 - Same as the base Google Gemma 4 model.
This is a derivative work based on Google’s Gemma 4 (gemma-4-e4b-it). All modifications are released under the same Apache 2.0 license, allowing commercial use, modification, and distribution.
Built with 🦞 by ssfdre38
Optimizing AI for everyone, one model at a time.
The difficult we do immediately. The impossible takes a little longer.