News

Someone Just Ran a 400 Billion Parameter AI Model on an iPhone 17 Pro — And the Real Story Is More Nuanced Than the Headlines Suggest

A developer ran a 400 billion parameter AI model on an iPhone 17 Pro at 0.6 tokens per second. The headline is impressive but the real story is more nuanced.

By Fanny Engriana · March 24, 2026 · 7 min read · 👁 46 views

Someone Just Ran a 400 Billion Parameter AI Model on an iPhone 17 Pro — And the Real Story Is More Nuanced Than the Headlines Suggest

A developer going by the handle Anemll posted a video on Sunday that lit up Hacker News with over 650 upvotes and nearly 300 comments: a 400 billion parameter AI model running locally on an iPhone 17 Pro. No cloud. No Wi-Fi required. Just a phone in someone's hand generating text from one of the largest openly available language models in existence. I saw the post at around 1 AM, and my immediate reaction was the same as most people's — that cannot possibly be real. Then I read the technical details and realized it was real, but the headline tells maybe 40 percent of the actual story.

What Actually Happened

The model in question is Qwen3.5-397B-A17B, which is a Mixture of Experts (MoE) architecture. And this distinction matters enormously. A dense 400B model — like the kind OpenAI and Anthropic run — uses all 400 billion parameters for every single token it generates. That would require roughly 800GB of memory at half precision. Your iPhone has 8GB of RAM. The math doesn't math. But MoE models are different. Qwen3.5-397B-A17B has 397 billion total parameters, but only 17 billion are active at any given time. The model is split into dozens of "experts," and a routing mechanism selects which experts to activate for each token. Think of it as a company with 400 employees where only 17 show up to handle any particular task. One commenter on the Hacker News thread put it bluntly: "You should mention that only 17B parameters are active at any point in time, not 400B, because this is a mixture-of-experts model, not a dense model." And they're right. The headline is technically accurate but practically misleading.

How Flash-MoE Makes It Work

The secret sauce is a custom inference engine called Flash-MoE, built by Anemll and collaborators. We covered the technical details in our full Flash-MoE breakdown on MacBook. It uses what they call "SSD streaming to GPU" — essentially treating the iPhone's flash storage as extended memory. Here's the simplified version: the full model weights live on the phone's SSD (1TB on the iPhone 17 Pro). When a particular expert is needed, Flash-MoE streams just those weights into the GPU's active memory, runs the computation, then swaps them out. It's the AI equivalent of swapping to disk, a technique that desktop operating systems have used since the 1960s but that nobody thought would work for real-time neural network inference. The performance? About 0.6 tokens per second. To put that in perspective, ChatGPT generates around 50-80 tokens per second. So this is roughly 100x slower. At 0.6 t/s, generating a single paragraph would take about two minutes. "Last Tuesday around 2 AM," I told a colleague about this and his response was: "So it's like having a genius friend who answers every question correctly but takes a literal nap between each word." And yeah, that's pretty much it.

The M4 Max Comparison

Anemll mentioned in follow-up posts that the same model runs at 12-15 tokens per second on an M4 Max MacBook Pro. That's 20x faster than the iPhone — and actually usable for real work. The iPhone demo is more proof-of-concept than practical tool.

Why This Actually Matters (Beyond the Cool Factor)

Let me be honest: running a 400B model at 0.6 tokens per second on a phone is not useful today. If you tried to have a conversation with it, you'd lose patience before it finished its first sentence. But dismissing this as a party trick misses the bigger picture entirely.

The Privacy Angle

Every major AI assistant today sends your data to cloud servers. Ask Claude about your medical symptoms? Your query hits Anthropic's servers. Ask Siri to summarize a confidential document? It goes to Apple's servers (or OpenAI's, depending on the task). Local inference means none of that data ever leaves your device. Tools like KittenTTS already prove that production-quality AI can run on consumer hardware, and this trend is only accelerating. For healthcare workers handling patient data, lawyers reviewing privileged communications, or journalists protecting sources, this isn't a nice-to-have — it's a fundamental requirement. "I had a doctor friend tell me she can't use any AI tools at work because of HIPAA compliance. The moment a capable model runs locally on her iPad, that changes overnight."

The Offline Angle

Think about every situation where you don't have internet: airplanes, remote areas, underground facilities, or during the increasingly common cloud service outages. A phone that can run a competent AI model without any connectivity opens up use cases that cloud-dependent AI literally cannot touch.

Moore's Law Hasn't Stopped

The iPhone 15 Pro (two years ago) couldn't have done this at all. The iPhone 17 Pro does it at 0.6 t/s. If we assume a conservative 3-4x improvement per phone generation through better neural engines and more unified memory, we're looking at: | Generation | Est. Speed | Usability | |-----------|-----------|-----------| | iPhone 17 Pro (2025) | 0.6 t/s | Proof of concept | | iPhone 18 Pro (2026) | 2-3 t/s | Barely usable | | iPhone 19 Pro (2027) | 8-12 t/s | Actually usable | | iPhone 20 Pro (2028) | 25-40 t/s | Near-cloud speed | By 2028, your phone might generate text as fast as today's cloud services. That's three years. Three iPhone upgrades. And the model architectures will get more efficient too — MoE was the first step, but techniques like speculative decoding, quantization improvements, layer duplication tricks, and hardware-aware model design will compound on top of the raw silicon improvements.

What the Comments Got Wrong

Reading through the nearly 300 Hacker News comments, I noticed three recurring misconceptions worth addressing.

"This Is Just Marketing for the iPhone 17"

Anemll isn't an Apple employee. Flash-MoE is an open-source project on GitHub. Apple didn't sponsor this, promote this, or (as far as anyone can tell) even know about it until it went viral. This is independent research that happens to run on Apple hardware because Apple's Neural Engine and unified memory architecture are currently the best mobile platform for this kind of workload.

"0.6 Tokens Per Second Is Worthless"

For interactive chat? Yes, it's painfully slow. But there are non-interactive use cases where speed barely matters. Background document summarization while you sleep. Overnight batch processing of emails for priority scoring. Slow but accurate medical image analysis where the alternative is waiting days for a radiologist. Not every AI task needs to happen in real-time.

"Just Use the Cloud, It Will Always Be Faster"

This one bugs me. The cloud will probably always be faster — but speed isn't the only axis that matters. Privacy, reliability, cost, and ownership all factor in. Running your own model means no monthly subscription, no rate limits, no terms of service that change overnight, and no company deciding to discontinue the model you depend on.

What Comes Next

Apple's Response

Apple has been relatively quiet about on-device AI since their initial Apple Intelligence announcements. But the Flash-MoE demo proves that their hardware is significantly more capable than what their own software actually uses. I'd bet money that WWDC 2026 includes some form of larger on-device model support, possibly using similar streaming-from-SSD techniques.

The Android Side

One commenter asked: "I want to try this on my Xiaomi. How do I do it?" The short answer is: you can't, not yet. Android devices use a fragmented hardware ecosystem — different GPUs (Adreno, Mali, Xclipse), different neural processors, different memory architectures. Flash-MoE is built specifically for Apple's Metal API and Neural Engine. An Android port would essentially need to be rewritten from scratch for each chipset family. Qualcomm's Snapdragon 8 Elite has impressive AI benchmarks, but the unified memory advantage that Apple silicon provides is hard to replicate in the Android world where RAM and GPU memory are typically separate pools.

The Open Source Ecosystem

Flash-MoE is open source (MIT license), and the code is already on GitHub. Several forks are appearing, including attempts to optimize for specific use cases like coding assistance and document analysis. I expect we'll see specialized models — smaller, faster, task-specific — that run at actually usable speeds on current hardware within the next six months.

The Honest Assessment

Is running a 400B model on an iPhone impressive? Absolutely — it's a genuine technical achievement that pushes the boundaries of what mobile hardware can do. Is it useful right now? No. Not for most people, not for most tasks. Will it be useful in 2-3 years? Almost certainly yes, and this demo is the clearest signal we've had that the future of AI is not exclusively cloud-based. The real headline isn't "iPhone runs 400B model." It's "the gap between cloud AI and local AI just got measurably smaller, and it's closing faster than anyone predicted." That's the story worth paying attention to. And look, I know this is an AI news site and we're supposed to hype everything. But I spent three hours reading through the technical details and the Hacker News comments, and the responsible take is this: it's a milestone, not a revolution. The revolution comes when your phone can do this at 30 tokens per second. We're not there yet. But Sunday proved we're heading there faster than the skeptics thought.

— Perspective from integrating OpenAI API, LangChain, and TensorFlow into shipped products (DocSumm, ServiceBot, ContentForge) at wardigi.com.

🏷 Tagged: #on-device ai #iphone 17 pro #flash-moe #local llm #mixture of experts #edge ai

Enjoyed this article?

Get more AI insights — browse our full library of 103+ articles and 373+ ready-to-use AI prompts.