Qwen3 4B outperforms cloud agents on code tasks

Hey everyone in ML. I've been working on Mahoraga, an open-source orchestrator that routes tasks across local and cloud AI agents using a contextual bandit (LinUCB) that learns from every decision.

Context (skip): I only started integrating AI into my workflows in late 2025, so I came on the scene broke with no credits. This left me with local models. However, many students and employees also receive credits from their institution to work with. (I got claude yippee) I wanted to be able to flawlessly route between models when credits ran out, which made me build an orchestrator. I used to use claude more as a chatbot/complete workflow engine, which made it difficult to use local models due to the context window, reasoning, etc. Opus 4.5 running open-source "superpowers" ate my usage every month.

Now I realize that wasn't an effective way to use claude, or AI in general. I was using claude for both heavy planning/brainstorming and minor tasks. How about tasks specifically for code generation? Code generation is a relatively constrained task, with correct answers and short outputs. Surely local models can compete in tasks that don't need cloud? So I switched Mahoraga to an adaptable router.

I ran 192 tasks across 8 agents (4 local Ollama models, 4 cloud CLIs) on a 16GB MacBook Pro, forcing round-robin so every agent got every prompt. Quality is scored by a 4-layer heuristic system (novelty ratio, structural checks, embedding similarity, length ratio). Zero API cost for evaluation, and no LLM-as-judge.

Forced round-robin, no bandit selection. 4-layer heuristic quality scoring. Hardware: MacBook Pro 16GB M-series (Nov 2024).

Qwen3 4B in nothink mode dominates code and refactor at 33.8 t/s and 6.1s average latency. Cloud agents cluster around 0.650 on code. The local model isn't just cheaper; it's actmeasurably better for this task class.

Other findings:

LFM2 hits 77.1 t/s but trades ~5 quality points vs Qwen3 4B
DeepSeek-R1 averages 123.5s per task on 16GB. The reasoning overhead makes it unusable as a default
Security scores are flat at 0.650 across all agents due to my human error—the scorer doesn't capture security-specific signals well.

The bandit (LinUCB) is the only routing strategy with sublinear regret (β=0.659) across a 200-task simulation—it actually converges

The routing works in two stages: the keyword classifier puts the task in a capability bucket (code, plan, research, etc.), and then the bandit picks the best agent within that bucket. 9-dimensional context vector, persistent state across sessions, warm-start from the compatibility matrix.

All local inference, all free. Cloud escalation exists but only fires on retry. Why pay for cloud when a local model handles it better?

Looking for any feedback, any input. Feel free to be critical: I appreciate everyone who interacts on this subreddit. I will continue to work on this in the future.

A star would be appreciated: github.com/pockanoodles/Mahoraga

submitted by /u/Own-Professional3092
[link] [comments]

Qwen3 4B outperforms cloud agents on code tasks—with Mahoraga research [R]

Want to read more?

Tagged with