498 Downloads Updated 1 month ago
ollama run fredrezones55/Gemopus-4-it:e4b
Full Ollama native audio and vision support; also including
| SND | SCN |
|---|---|
| These edge models can only hear; attempting to see will crash Ollama, but this can trim down on memory for hearing operations only. | These edge models can only see; attempting to hear will crash Ollama, but this can trim down on memory for viewing operations only. |
Space saving with E4B: Sound only saves 5.24% of storage; while Vision only saves 9.21% of storage. No quality loss can be observed as it’s the tensors that handle their respective inputs is what’s missing; the core text model is left alone.
I still remember the days of running the Llama 3.1 8B Instruct model on my MacBook Air M1. Back then, I could hardly imagine that in just two years, a model with reasoning capabilities comparable to the GPT-4 of that era would be running locally on my phone. Currently, Edge AI is experiencing a paradigm shift, transitioning from the cloud down to local environments. Tech giants are embedding AI capabilities deep into the bedrock of operating systems with unprecedented determination. Without a doubt, this form of local AI, which combines ultra-low latency with absolute privacy, represents the standard paradigm for future end-user devices.
[!NOTE] Following this trend, I created 🪐 Gemopus-4-E4B-it. This is an instruction-tuned model derived from the deep fine-tuning of the latest edge computing large model, Gemma-4-E4B-it.
My core vision is to break down the barriers of expensive GPU computing power, allowing every user with an ordinary iphone, tablet, or thin-and-light Mac (such as acBook Air, MacBook Neo) to fluently run their own powerful AI assistant locally, eliminating the risk of data privacy leaks. By offloading high-frequency basic reasoning tasks (such as text translation, rewriting, summarization, error correction, short text generation, simple Q&A, etc.) to edge devices—especially since these questions often involve personal data that requires the most desensitization—we not only significantly reduce the cost of cloud API calls but also fundamentally guarantee the absolute security of sensitive personal data.
Admittedly, although the official original Gemma 4-E4B-it possesses an excellent foundation for reasoning, its native instruction alignment strategy also introduces extremely localized drawbacks that can be highly frustrating during daily interactions on edge devices:
It is precisely because I do not want a machine locally that merely recites “Wikipedia” stiffly or acts like a cold instruction manual every day, that I was driven to decide on a complete “personality remodeling” and alignment fine-tuning for it.
Currently, the full-modal Gemma 4-E4B-it stands as the optimal choice for an edge instruction model. Empowered by Apple Silicon and its high-speed unified memory architecture, models of this scale exhibit staggering inference performance on edge devices: On the latest iPhone 17 Pro Max, its native inference speed steadily maxes out at 45 ~ 60 tokens/s; while on everyday thin-and-light laptops like the MacBook Air (M3/M4), paired with local frameworks like MLX, it can easily burst out a blazing fast response of 90 ~ 120 tokens/s, truly realizing instantaneous answers that break the shackles of network dependencies.
⚠️ Note: The above performance figures are based on publicly available online benchmarks and community reports. Actual results may vary depending on hardware configuration, runtime environment, and model version—please refer to real-world testing for accurate performance.
However, to transform this cold “hardware speed” into an interaction warmth that end-users can genuinely perceive, Gemopus-4-E4B-it underwent further deep Human Preference Alignment atop this highly efficient base.
I focused on achieving leaps in the user experience across the following three dimensions:
⏳ The current version is still in an early training and evaluation stage. The scores for relevant mainstream benchmark tests (such as MMLU, etc.) are being compiled and calculated. Specific data will be provided in subsequent version iterations.
🚧 I’ll be updating the fine-tuning code for this model very soon—please stay tuned!
👉 GitHub Repository: Jackrong-llm-finetuning-guide Visit the repo to dive into the codebase and reproduce the results locally or on Colab.
🔗 Qwopus3.5-27b Complete Fine-Tuning Guide (PDF) * The Full Pipeline: A step-by-step walkthrough—from downloading the base model and unifying heterogeneous data, to configuring trainer hyperparameters and publishing to Hugging Face. * Beginner Friendly: Includes an introductory guide to getting started with Google Colab and Unsloth. * Feedback welcome! If you spot any areas for improvement, please let me know and I will update it promptly.
A Note: My goal isn’t just to detail a workflow, but to demystify LLM training. Beyond the social media hype, fine-tuning isn’t an unattainable ritual—often, all you need is a Google account, a standard laptop, and relentless curiosity.
No one starts as an expert, but every expert was once brave enough to begin.
All training and testing for this project were self-funded. If you find this model or guide helpful, a Star ⭐️ on GitHub would be the greatest encouragement. Thank you! 🙏
This model adopts a high-standard SFT pipeline with the same specifications as large instruction reasoning models:
Base Model (gemma4-E4B-it)
│
▼
Supervised Fine-Tuning (SFT) + Human Preference
│
▼
Gemopus-4-E4B-it
The fine-tuning process heavily relies on a meticulously constructed high-quality human preference instruction dataset. This dataset not only cleaned and mixed high-quality instruction pairs from the open-source community, but was also specifically injected with a massive amount of interactions, natural dialogues, and challenging deep-analysis samples. This ensures that the model consistently maintains a high level of helpfulness and human touch when deployed on edge devices.
Special thanks to the fellow developers in the open-source community who provided powerful computing resources and base ecosystem support. In particular, thanks to the Unsloth team for providing excellent tools for the efficient fine-tuning of large models, and to Google for open-sourcing the excellent Gemma 4 series base models.