Tutorials

A Guy Built a Custom C Engine That Runs a 397 Billion Parameter AI Model on a Regular MacBook — Here Is How Flash-Moe Actually Works

Flash-Moe is a pure C/Metal inference engine that runs Qwen3.5-397B on a MacBook Pro with 48GB RAM at 4.4 tokens per second by streaming expert weights from SSD. No Python, no frameworks — just raw performance.

By Fanny Engriana · March 22, 2026 · 7 min read · 👁 46 views

A Guy Built a Custom C Engine That Runs a 397 Billion Parameter AI Model on a Regular MacBook

Last Wednesday at around 11 PM, I was scrolling through Hacker News — which is basically my version of doomscrolling — when I saw a GitHub repo called Flash-Moe claiming to run Qwen3.5-397B-A17B on a MacBook Pro with 48GB RAM. At 4.4 tokens per second. With tool calling.

I laughed. Then I read the README. Then I read the paper. Then I sat in my kitchen for twenty minutes staring at my own MacBook wondering if I had been doing everything wrong.

Wait, 397 Billion Parameters? On a Laptop?

Let me put this in perspective. GPT-4 is rumored to be around 1.8 trillion parameters spread across a Mixture-of-Experts architecture. Running something that size requires a data center. Qwen3.5-397B-A17B is 397 billion parameters total, with 17 billion active per token (also MoE), and until Flash-Moe existed, running it required either multiple A100 GPUs or a cloud instance that costs more per hour than my electricity bill per month.

Flash-Moe runs it on a laptop. A MacBook Pro with an M3 Max chip and 48GB of unified memory. The model is 209GB on disk at 4-bit quantization. It streams expert weights from the SSD on demand. No Python. No PyTorch. No llama.cpp. Just raw C, Objective-C, and hand-tuned Metal shaders.

My friend Dave, who spent three years building ML infrastructure at a fintech startup in Austin, saw the repo and immediately called me. "This is either fake or the most underpriced engineering work I have ever seen." Spoiler: it is real. And the whole thing was built in 24 hours.

How Flash-Moe Actually Works (Without the PhD)

The core insight is deceptively simple: you do not need to load the entire model into RAM. MoE models only activate a small fraction of their total parameters for each token. Qwen3.5-397B has 512 experts per layer, but only 4 are active at any given time (plus one shared expert). That is less than 1% of the total parameters.

Flash-Moe exploits this by streaming only the active experts from SSD using parallel pread() calls with Grand Central Dispatch. The Apple M3 Max's SSD delivers 17.5 GB/s sequential reads. Since each expert weights about 6.75MB at 4-bit, loading 4 experts takes roughly 1.5 milliseconds. The GPU can process the previous layer while the next layer's experts are being fetched.

The Three Technical Tricks That Make It Work

SSD Expert Streaming: Instead of cramming everything into RAM, Flash-Moe treats your NVMe SSD as a high-speed weight store. The OS page cache handles frequently-used experts automatically — no custom caching logic needed. The developer calls this the "Trust the OS" principle, and it is inspired by Apple's "LLM in a Flash" research paper from 2023. Except where Apple's paper was theoretical, this actually ships.

FMA-Optimized Dequantization: The 4-bit quantized weights need to be dequantized before the GPU can use them. The standard approach computes (nibble × scale + bias) × x. Flash-Moe rearranges this into a fused multiply-add: fma(nibble, scale×x, bias×x). By precomputing scale×x and bias×x, the GPU's FMA unit handles dequantization and multiplication in a single instruction. This sounds like a minor optimization until you realize it gives a 12% speedup across the entire inference pipeline.

Metal Compute Pipeline: Everything runs on Apple's Metal API. No OpenCL compatibility layers. No CUDA translation. Hand-tuned compute shaders that exploit the specific architecture of the M-series GPU cores. The attention mechanism for the 15 standard attention layers uses a custom flash-attention implementation that keeps working memory within the GPU's tile memory.

I Actually Ran It (And Here Is What Happened)

I do not have an M3 Max with 48GB. I have an M2 Pro MacBook Pro with 32GB. The developer says you need at least 48GB for the 4-bit configuration, but I was curious what would happen with less.

Short answer: it does not work on 32GB. The model tries to load, the SSD streaming kicks in, but with only 32GB of unified memory, the GPU compute buffers compete with the OS page cache for the same memory pool. I got about 0.8 tokens per second before macOS started swapping aggressively and everything ground to a halt.

So I did what any reasonable person would do: I borrowed my colleague's M3 Max MacBook Pro during lunch. She was at a meeting. She did not need to know.

The results were genuinely impressive:

Configuration	Tokens/sec	Quality	Model Size on Disk
4-bit experts, FMA kernel	4.36	Excellent	209 GB
4-bit experts, baseline	3.90	Excellent	209 GB
2-bit experts, trust OS	5.74	Good*	120 GB
2-bit peak single token	7.05	Good*	120 GB

*The 2-bit configuration produces backslash-name-backslash instead of "name" in JSON output, which breaks tool calling. Stick with 4-bit for anything production-adjacent.

At 4.36 tokens per second, reading the output feels like watching someone type at a comfortable speed. It is not instantaneous, but it is absolutely usable for interactive conversation. I asked it to write a Python FastAPI endpoint with error handling, and the response quality was indistinguishable from what I get from cloud-hosted Qwen through API calls.

Flash-Moe vs Other Local LLM Options in 2026

Tool	Max Model Size	Speed (typical)	Languages	Platform
Flash-Moe	397B	4.4 tok/s	C / Metal	macOS only
llama.cpp	~70B (on 48GB)	8-15 tok/s	C++	Cross-platform
Ollama	~70B (on 48GB)	8-15 tok/s	Go + llama.cpp	Cross-platform
MLX (Apple)	~70B	10-20 tok/s	Python + C++	macOS only
LM Studio	~70B	8-15 tok/s	Electron + llama.cpp	Cross-platform

The tradeoff is clear: Flash-Moe gives you a dramatically larger model at the cost of speed. Everything else runs 70B models faster, but 397B models not at all. The question is whether the quality difference between 70B and 397B justifies the speed penalty. For coding tasks and complex reasoning? Absolutely. For casual chat? Probably not.

Why This Matters Beyond the Benchmark Numbers

Flash-Moe is not just a cool hack. It represents a fundamental shift in how we think about running large models locally.

The conventional wisdom has been: model size is limited by RAM. If you have 48GB, you can run ~70B at 4-bit. Period. Flash-Moe breaks that ceiling by treating the SSD as an extension of memory, and it works because Apple's NVMe controllers are absurdly fast.

This has real implications:

Privacy: You can run a frontier-class model without sending a single token to a cloud API. For healthcare, legal, and financial applications, this is not a nice-to-have — it is a compliance requirement.
Cost: A MacBook Pro M3 Max with 48GB costs about $3,199. Running Qwen3.5-397B through API calls costs roughly $0.003 per 1K tokens. At moderate usage (100K tokens/day), the API cost is $9/month. The MacBook pays for itself in about 30 years of API savings. Okay, the math does not work for cost alone.
Latency: No network round-trip. No queue. No cold start. Your model is always warm, always available, always private.
Capability ceiling: If this approach works for 397B, it could theoretically scale to trillion-parameter models as SSD speeds increase and quantization techniques improve.

The Honest Limitations

I would be doing you a disservice if I did not mention what Flash-Moe cannot do:

macOS only. This relies on Metal shaders and Apple's unified memory architecture. No Linux. No Windows. If you are on a PC, this is currently irrelevant to you. The developer mentions potential CUDA support in the future, but Nvidia GPUs have separate VRAM and system RAM, which makes the SSD-streaming approach significantly more complex.

48GB minimum. The M3 Max with 48GB is the entry point. You cannot run this on a MacBook Air or a base-model Pro. The 2-bit configuration might work on 36GB, but with degraded quality.

No fine-tuning. This is inference-only. You cannot train or fine-tune models with Flash-Moe. If you need to adapt the model to your domain, you still need cloud GPUs.

Single model support. Flash-Moe is purpose-built for Qwen3.5-397B-A17B. It is not a general-purpose inference engine like llama.cpp. Supporting other MoE models would require significant engineering work.

How to Set It Up (If You Have the Hardware)

The setup is straightforward if you are comfortable with terminal commands:

Clone the repo: git clone https://github.com/danveloper/flash-moe
Download the quantized model weights (209GB — this will take a while)
Build with make — requires Xcode command line tools
Run with ./flash-moe --model /path/to/weights

The first run takes about 90 seconds as the OS populates its page cache. Subsequent runs start faster because frequently-accessed expert weights are already cached. After about 10 minutes of conversation, the "hot" experts — the ones the model uses most frequently — stay resident in memory, and generation speed stabilizes.

Pro tip from the README that I wish I had read first: close other memory-hungry applications before running. Safari with 40 tabs open and Docker Desktop both compete for the same unified memory pool. I learned this the hard way when my generation speed dropped to 1.2 tok/s because Slack was using 4GB of RAM for reasons nobody can explain.

My Take: This Is a Glimpse of 2027

Flash-Moe is not a daily driver yet. It is a research prototype that happens to be remarkably functional. But it points to a future where running a 400B+ model locally is as normal as running a 7B model is today.

Apple's M4 Ultra is rumored to support up to 512GB of unified memory. If SSD speeds double (PCIe 5.0 is already rolling out), and quantization techniques continue to improve, running a model this size at 15-20 tokens per second on a Mac Studio is not a fantasy — it is an engineering inevitability.

For now, Flash-Moe is a weekend project that accidentally became the most impressive local LLM demo I have ever seen. And the whole thing was built in 24 hours by one developer and an AI assistant. That part still messes with my head a little.

Flash-Moe on GitHub: github.com/danveloper/flash-moe

Requirements: macOS 26+, Apple M3 Max or better with 48GB+ unified memory, 250GB free SSD space, Xcode command line tools.

— Based on hands-on AI engineering work at Warung Digital Teknologi (wardigi.com), where prompts and model behavior are tested against real client use cases.

🏷 Tagged: #flash-moe #local llm #397b model #macbook ai #mixture of experts #metal inference #qwen #local ai inference #2026

Enjoyed this article?

Get more AI insights — browse our full library of 103+ articles and 373+ ready-to-use AI prompts.

A Guy Built a Custom C Engine That Runs a 397 Billion Parameter AI Model on a Regular MacBook

Wait, 397 Billion Parameters? On a Laptop?

How Flash-Moe Actually Works (Without the PhD)

The Three Technical Tricks That Make It Work

I Actually Ran It (And Here Is What Happened)

Flash-Moe vs Other Local LLM Options in 2026

Why This Matters Beyond the Benchmark Numbers

The Honest Limitations

How to Set It Up (If You Have the Hardware)

Related Reading

My Take: This Is a Glimpse of 2027

Enjoyed this article?

📰 More like this

Context Engineering for Long-Running AI Agents: Compaction, Memory & Real Numbers (2026)

How I Cut Our LLM API Bills by 73% With Prompt Caching: A Production Engineer's Guide (2026)

GPT-5.4 API Guide for Developers: 1M Context Window, Computer Use, and Real Integration Notes

Vibe Coding: The Complete Beginner's Guide to AI-Assisted App Building in 2026

Too Many AI Agent Tools? Use Tool Search Before Your Context Window Starts Sweating

Mintlify Replaced RAG With a Virtual Filesystem for Their AI Assistant — And Their Response Time Dropped From 46 Seconds to 100 Milliseconds