Writing
Why Most Mobile Vendors Cannot Deliver On-Device AI: What US Enterprise Buyers Need to Know 2026
Fewer than 5% of mobile agencies have shipped on-device AI in production. Here is why, and what to ask before you commit.
In this article
Fewer than 5% of mobile development agencies have shipped a production on-device AI app with text inference. This is not a guess. It is the outcome of asking hundreds of vendors a simple question: show me the App Store link. The ones who can answer immediately are in the 5%. The ones who cannot are in the 95% who claim capability they do not have.
This matters to you because on-device AI is categorically harder than cloud AI integration. If your vendor has not shipped it, you are paying for their first attempt. This guide explains why the gap exists and what the hard problems look like up close.
Key findings
Fewer than 5% of mobile development agencies have shipped a production on-device AI app with text inference.
Wednesday's Off Grid is one of fewer than 10 open-source on-device AI mobile apps globally with 1,000+ GitHub stars — and the hard problems it solved are not documented anywhere.
The three hard problems (Metal abort() handling, chipset-specific QNN variants, background generation state) are discovered by shipping, not by reading documentation.
A vendor discovering these problems for the first time on your engagement adds 4-8 weeks of unplanned time to your project.
The gap between claiming and shipping
Cloud AI integration is a skill most mobile vendors have. The pattern is well-documented: add an API client, send user input to a cloud endpoint, receive a response, display it. Any competent mobile engineer can do this in a few days. The frameworks are mature, the documentation is thorough, and the error handling is straightforward.
On-device AI is a different discipline. The model runs on the device's hardware. The developer is responsible for memory management, hardware acceleration, model loading and offloading, device compatibility, and the failure modes that emerge when those constraints interact. The failure modes are not in any documentation because they are discovered in production.
When vendors say "we have AI capability," they almost always mean cloud AI integration. They have built features that call AI APIs. That is genuine work and it has value. It is not on-device AI.
The confusion is not always intentional. Some vendors do not understand the distinction themselves. They have shipped AI features, those features work, and when a client asks "can you do on-device AI?" the answer is "yes, we do AI" without recognising that the question was about a different class of problem.
What most vendors actually mean by "AI capable"
When a mobile vendor tells you they are "AI capable" or have "shipped AI features," ask one follow-up question: was the AI inference running on the device or on a remote server?
If the answer is "it called the OpenAI API" or "we used Claude" or "it connected to Google AI" — that is cloud AI. The model ran on someone else's server. The mobile app was a thin client that sent data and received a response.
Cloud AI is not a lesser capability. For many use cases, it is the right architecture. But it shares almost nothing technically with on-device AI development. The skills, the failure modes, the architecture decisions, and the device constraints are completely different.
A vendor with cloud AI experience building their first on-device AI feature will encounter the same problems every first-time on-device team encounters. Those problems will surface mid-project. They will take time to solve. Your timeline will absorb the cost.
The hard problems vendors do not talk about
Three specific problems in on-device AI development are not in any SDK documentation, framework guide, or developer forum. They are discovered by shipping to real users on real devices.
A vendor who has shipped on-device AI can describe these problems specifically and explain how they handled them. A vendor who has not will give a generic answer about device compatibility and memory management without naming the specific failure modes.
Metal abort(): the problem nobody documents
On iOS, the Metal framework handles GPU memory allocation. When an app tries to allocate more GPU memory than the device can provide — which happens when loading a large on-device model on a device with limited RAM — Metal does not return an error code. It calls abort().
abort() terminates the process immediately. It does not throw an exception. It does not return an error. It does not allow application-level code to catch the failure and show a user-facing message. The app simply terminates.
This means that if you load a model that is too large for the available RAM, the app crashes silently with no recoverable error state and no user-visible explanation. From the user's perspective, the app closed. From the developer's perspective, there is no error log entry that explains why.
The only way to handle this correctly is to pre-flight available RAM before loading the model. The app checks device RAM, checks current available RAM (which changes depending on how many other apps are open), checks the model's memory requirement, and makes a go/no-go decision before the model load begins. If the available RAM is insufficient, the app degrades gracefully — showing a lower-capability fallback or an informative message — rather than crashing.
This pre-flight logic is not documented anywhere. It is discovered when an app crashes in testing on a 4GB iPhone and the developer traces the crash through the Metal layer to understand why abort() was called.
Wednesday solved this in Off Grid's iOS implementation. The pre-flight RAM check is part of every on-device AI feature Wednesday now ships.
30 minutes with a Wednesday engineer covers your specific device matrix and the architecture decisions that handle failure modes safely.
Get my recommendation →QNN chipset variants on Android
Android devices run on chipsets from Qualcomm, Samsung (Exynos), Google (Tensor), and MediaTek. Each chipset manufacturer has a different NPU (Neural Processing Unit) architecture and different SDK for hardware-accelerated AI inference.
Qualcomm's SDK is QNN (Qualcomm Neural Networks). QNN provides access to the Snapdragon NPU, which is the fastest path for on-device AI inference on Android flagship devices. But QNN requires a compiled model artifact specific to each Snapdragon generation. The Snapdragon 8 Gen 1, 8 Gen 2, and 8 Gen 3 each require a different compiled binary. An app that ships one QNN artifact will only accelerate inference on one generation of Snapdragon chips.
The practical implication: an Android app shipping on-device AI to a wide user base needs a model routing system that detects the device's chipset, selects the appropriate compiled artifact, and falls back to CPU inference for devices without a matching Qualcomm variant. Exynos devices (some Samsung flagships), Tensor devices (Pixel), and MediaTek devices all need CPU inference unless specific artifacts are compiled for their respective NPU SDKs.
CPU inference is slower than NPU inference by a factor of 4-8x for typical model sizes. An app that runs at full speed on Snapdragon and falls back to CPU inference on Exynos has a noticeably different user experience across the Android ecosystem.
Wednesday's Off Grid Android implementation includes chipset detection, model routing across QNN variants, and CPU fallback with UX adaptation to reflect the performance difference. Building this routing system the first time requires trial and error across physical devices — not simulators, not emulators, physical devices from each chipset generation.
Background generation state management
Users do not stay on a screen while AI generates a response. They navigate away, switch apps, receive notifications, and return. An on-device AI feature that does not handle this correctly produces one of two failure modes.
The first failure mode: when the user navigates away, the generation stops. The user returns to find the output incomplete. No error, no explanation — the feature simply stopped generating when the screen changed.
The second failure mode: the generation continues, but the result is never delivered because the component that would display it has been unmounted. The AI ran for several seconds (or minutes, for longer generations), consumed battery and RAM, and produced output that went nowhere.
Both failure modes have the same root cause: the generation was tied to the component lifecycle. When the component unmounted, the generation's connection to the UI was severed.
The correct architecture runs inference in a background service that is independent of any component lifecycle. The service maintains a queue. When a generation completes, the service delivers the result via a callback to whatever screen the user is currently on — even if it is a different screen than the one that triggered the generation. If the user is not in the app, the result is stored and delivered when the app resumes.
Building this background service requires understanding both the platform's background execution rules (iOS and Android have different limits on background processing) and the on-device AI framework's threading model. It is not a single engineer's afternoon — it is a week of architecture work the first time.
Wednesday built this correctly in Off Grid, where users trigger image and text generation and then use other features while the generation runs. The result is delivered to whatever screen they are on when the generation completes.
Off Grid as the verification standard
Off Grid is Wednesday's on-device AI application. It runs complete AI inference — text generation, image generation, voice transcription, and vision analysis — on the device with no cloud connection and no telemetry. It is available on iOS, Android, and macOS.
50,000+ users have downloaded Off Grid. The GitHub page has 1,700+ stars. The code is public and auditable.
Building Off Grid required discovering and solving all three problems described above: Metal abort() handling on 4GB iPhones, chipset-specific QNN variant routing on Android, and background generation state management across both platforms. The solutions are in the app, visible to anyone who audits it.
When Wednesday evaluates a vendor's on-device AI claim, the standard is simple: show the equivalent of Off Grid. A live app, a public record, and specific technical answers to the three questions above. Any vendor who meets this standard has real experience. Any vendor who cannot describe the Metal abort() problem has not shipped on-device AI on iPhone.
Why this matters for your engagement
If your board has mandated AI in the mobile app and you are evaluating vendors, the on-device AI question is the one filter that most quickly segments the field.
95% of vendors who claim AI capability have cloud AI experience. That capability is genuine but irrelevant if your compliance requirements, data residency rules, or connectivity constraints make cloud AI impractical.
The 5% who have shipped on-device AI in production have solved problems that cannot be solved by reading. They bring solved solutions to your engagement. Your timeline absorbs their experience, not their learning curve.
Wednesday is in the 5%. Off Grid is the public record.
Wednesday's on-device AI capability is verifiable before you sign. The App Store link and GitHub page exist today.
Book my 30-min call →Frequently asked questions
More guides on evaluating mobile vendor capability are in the writing archive.
Read more decision guides →About the author
Praveen Kumar
LinkedIn →Technical Lead, Wednesday Solutions
Praveen leads on-device AI engineering at Wednesday Solutions and was part of the team that built Off Grid, one of fewer than 10 open-source on-device AI mobile apps globally with 1,000+ GitHub stars.
Four weeks from this call, a Wednesday squad is shipping your mobile app. 30 minutes confirms the team shape and start date.
Get your start date →Keep reading
Shipped for enterprise and growth teams across US, Europe, and Asia