Phi-4-mini vs Gemma 3 vs Qwen3 vs SmolLM3: On-Device SLMs in 2026
A hands-on comparison of the four small language models I tested in production builds during 2026 — benchmarks, memory footprints, licensing traps, and what broke on real phones.
9 articles matching your search.
A hands-on comparison of the four small language models I tested in production builds during 2026 — benchmarks, memory footprints, licensing traps, and what broke on real phones.
A production-tested comparison of vLLM, SGLang, TensorRT-LLM, and Ollama for self-hosted LLM serving in 2026 — throughput, cold-start, cost math, and decision matrix from running a 4-product AI backend on a shared H100.
I tested four production LLM guardrail stacks across six AI products I shipped. Honest comparison of Lakera, NeMo Guardrails, Guardrails AI, and Pillar Security — latency, pricing, and what I actually run in production.
A working engineer's view of the four libraries that actually solve the malformed-JSON problem in production AI: Instructor, BAML, Outlines, and Pydantic AI. Real benchmark numbers from 1.4M monthly LLM calls.
I ran the same LoRA fine-tune of Llama 3.1 8B on four platforms with 12,400 training pairs from our SmartExam product. Real costs, training times, inference latency, and the multi-adapter math that decided which one we shipped.
A developer ran a 400 billion parameter AI model on an iPhone 17 Pro at 0.6 tokens per second. The headline is impressive but the real story is more nuanced.
Flash-Moe is a pure C/Metal inference engine that runs Qwen3.5-397B on a MacBook Pro with 48GB RAM at 4.4 tokens per second by streaming expert weights from SSD. No Python, no frameworks — just raw performance.
A researcher duplicated 3 specific layers in Devstral-24B and boosted logical deduction from 0.22 to 0.76 — no training, no weight changes. Here is how LLM Circuit Finder works and why it matters.
CanIRun.ai is a free web tool that maps AI model hardware requirements against your machine specs. With 762 upvotes on Hacker News, it covers everything from 0.5 GB edge models to 512 GB monsters.