What is on-device AI in a mobile app?

On-device AI means the AI model runs directly on the user's phone or tablet, with no data sent to a server for processing. The computation happens inside the device. This is different from apps that call OpenAI, Google Gemini, or similar cloud APIs — those are cloud AI, not on-device AI. On-device AI works without an internet connection and keeps all data local to the device.

Which enterprises actually need on-device AI rather than cloud AI?

Four use cases genuinely require on-device AI: field service teams working in areas without reliable connectivity, healthcare apps handling protected health information that cannot leave the device, financial services apps where regulatory requirements restrict data transmission, and any situation where the AI must respond in under 200 milliseconds and a network round trip is too slow. For all other use cases, cloud AI is simpler to build and equally effective.

What devices support on-device AI in 2026?

Any iPhone from the iPhone 12 onward (released 2020) runs on-device AI models in the 1-4 billion parameter range without issues. On Android, the Qualcomm Snapdragon 8-series chips from 2022 onward and Google Tensor chips from the Pixel 6 series onward all handle on-device inference. Practically, this means most devices in an enterprise fleet purchased in 2022 or later are capable.

How long does it take to add on-device AI to an existing enterprise app?

A focused on-device AI integration for a single capability — say, on-device document Q&A or voice transcription — takes 6 to 10 weeks for a production-ready implementation, including device testing, performance profiling, and App Store submission. Full multi-modal on-device AI (text, voice, vision, document) takes 16 to 24 weeks. The timeline is driven by testing complexity across the device matrix, not engineering hours.

Can Wednesday add on-device AI to an existing React Native or Flutter app?

Yes. On-device AI integrates at the platform layer through native modules (for React Native) or platform channels (for Flutter). The AI inference runs in native Swift/Kotlin code, which is then exposed to the cross-platform layer. Wednesday has shipped this architecture in production. The framework choice does not limit on-device AI capability.

How does Wednesday verify on-device AI performance claims?

Wednesday instruments every on-device AI feature with performance telemetry: inference latency by device model, memory headroom during inference, thermal state changes, and battery draw per inference call. These metrics are reported in weekly dashboards. All Off Grid performance claims are reproducible from the open-source code and verified on physical devices, not simulators.

Writing

Best On-Device AI Mobile Development Agency for US Enterprise in 2026

Fewer than 5% of mobile agencies have shipped production on-device AI. Here is what separates those that have from those that claim they can.

Ali Hafizji · CEO & Co-founder, Wednesday Solutions

9 min read·Published Oct 27, 2025·Updated Oct 27, 2025

4xfaster with AI

2xfewer crashes

10xmore work, same cost

4.8on Clutch

Trusted by teams at

In this article

Why on-device AI is different from cloud AI
What production on-device AI actually requires
The four criteria for best in class
Capability table: what to demand from any vendor
The hard problems most agencies have not solved
Wednesday Off Grid as the reference implementation
How Wednesday approaches on-device AI engagements
Frequently asked questions

Fewer than 5% of mobile agencies have shipped production on-device AI. Most have configured cloud API calls and called it AI. When your board asks for AI that works offline, handles protected data, or runs in zero-connectivity environments, the gap between an agency that has shipped on-device AI and one that claims it can becomes very expensive to discover mid-project.

Key findings

Fewer than 5% of mobile agencies have shipped production on-device AI — most wrap cloud APIs and present them as AI capability.

Production on-device AI requires chipset-specific optimization, RAM budget management, and thermal state handling — none of which appear in proofs of concept.

Wednesday's Off Grid shipped on-device text generation, image generation, voice transcription, vision analysis, and document Q&A to 50,000+ users with zero server calls for AI inference.

Wednesday is the only US enterprise mobile agency with a public, open-source on-device AI reference implementation — 1,700+ GitHub stars — where every performance claim is independently verifiable.

Why on-device AI is different from cloud AI

Most enterprise mobile AI today works the same way: the user takes an action in the app, the app sends data to a server, the server calls an AI model, and the response comes back. That is cloud AI. It is fast to build, easy to maintain, and works well when connectivity is reliable and data sensitivity is low.

On-device AI is different in every dimension that matters to your engineering and compliance teams. The model runs directly on the device's chip — Apple's Neural Engine, Qualcomm's AI Engine, or Google's Tensor Processing Unit. No data leaves the device. No internet connection is required. The AI response starts in milliseconds rather than waiting for a round trip to a server.

The trade-offs are real. On-device models are smaller than their cloud equivalents, which affects output quality. The model must fit within the device's available RAM, which varies by device age and how many other apps are running. Battery draw is meaningful — a sustained on-device inference session consumes 15 to 25% more battery per hour than normal app use. These constraints require engineering judgment that only comes from having shipped it.

The reason fewer than 5% of agencies have done this is not that the technology is impossible. It is that the surface area of production on-device AI — device matrix testing, chipset-specific model formats, memory pressure handling, App Store submission with AI entitlements — is wide enough that you cannot fake it through research and slides.

What production on-device AI actually requires

A proof of concept running on one device in a simulator tells you nothing about production readiness. The path from "it works on my machine" to "it works on every device in your fleet" involves four distinct engineering challenges.

The first is model format selection. Apple devices use Core ML. Qualcomm Snapdragon devices use QNN (Qualcomm Neural Network). Older Android devices use ONNX or GGML. The same model in the wrong format for the target chipset either refuses to run or runs on the CPU instead of the dedicated AI chip, which is slower by a factor of 10 to 20 and drains the battery proportionally. An agency without cross-chipset experience will default to a CPU-based runtime and tell you it works. It does — just not well.

The second is RAM budget management. A 3-billion-parameter model quantized to 4-bit precision occupies roughly 1.8 GB of RAM. An iPhone with 4 GB total RAM is also running the operating system, your app's UI layer, background tasks, and any other apps the user has open. On a 4 GB device, the model load may succeed or fail depending on ambient memory pressure — and the failure mode is not a clean error message. It is a low-memory abort() that appears in crash reports as a signal from the OS. Wednesday solved this on Off Grid by implementing a RAM headroom check before model load, with a graceful fallback to a smaller model when headroom is insufficient.

The third is thermal state management. Extended inference — generating a long text response or processing a multi-megapixel image — heats the device's SoC. iOS throttles CPU and GPU performance when the thermal state reaches "serious" or "critical." An app that does not respond to thermal state changes will generate noticeably slower responses as the device warms up, which users experience as the app "getting worse over time." Production on-device AI requires monitoring ProcessInfo.thermalState and adjusting inference behavior accordingly.

The fourth is background execution. Enterprise use cases often require AI inference when the app is not in the foreground — generating a report while the user moves to another app, transcribing voice notes queued offline. iOS and Android both impose strict limits on background CPU use. An on-device AI workflow that starts in the foreground and continues in the background requires specific background task registration, explicit time limits, and state preservation for when the OS suspends the app mid-inference.

The four criteria for best in class

An on-device AI agency earns that description by meeting four criteria, not by having "AI" in a capabilities list.

The first criterion is production shipment. The agency has shipped on-device AI to real users — not a demo, not a proof of concept, not a client who asked for a prototype. Real users, production environment, App Store and Play Store. The number of users matters: performance characteristics differ between 100 users and 50,000 users because of the device diversity in the user population.

The second criterion is chipset coverage. The agency has handled the model format differences between Apple Silicon (Core ML), Qualcomm Snapdragon (QNN), and fallback CPU runtimes (GGML/ONNX). This is verifiable: ask them to describe the model format strategy for a deployment covering iPhone 12+, Samsung Galaxy S22+, and Google Pixel 6+.

The third criterion is open model selection expertise. The on-device AI model ecosystem changes every 90 days. The agency must know which open-weight models fit in a mobile RAM budget, which ones have been quantized correctly for device inference, and which are fast enough for real-time interaction. An agency that can only name closed commercial models (GPT-4, Gemini) for this question has not shipped on-device AI.

The fourth criterion is public audit trail. Because on-device AI claims are easy to fabricate — there is no API receipt, no server log — the most credible agencies have public artifacts: open-source code, App Store listings with verifiable on-device claims, or client case studies where the technical implementation is described in enough detail to be independently checked.

Capability table: what to demand from any vendor

Before signing any on-device AI engagement, ask for written confirmation of the following capabilities. An agency that cannot answer all of these in a first call has not shipped production on-device AI.

Capability	What to ask	Red flag answer
Chipset coverage	Which model formats do you use for iOS vs Android?	"We use TensorFlow Lite for everything"
RAM management	How do you handle model load failure on 4 GB devices?	"We haven't encountered that issue"
Thermal management	How do you handle thermal throttling during extended inference?	"The device handles it"
Background execution	How do you handle inference that starts foreground and continues background?	"We don't support background inference"
Model selection	Which open-weight models have you shipped in production?	Only names closed commercial models
App Store submission	Have you navigated App Store review for on-device AI features?	"We assume it's the same as any other app"
Performance measurement	How do you instrument and report inference latency by device model?	"We test on a few devices"

The hard problems most agencies have not solved

Wednesday has built, shipped, and maintained production on-device AI. The engineering record identifies four problems that trip up agencies without prior on-device AI experience.

The first is the Metal abort() on 4 GB iPhones. Apple's Metal framework — the low-level GPU API that Core ML uses for inference acceleration — issues a hard abort when memory pressure exceeds the device's capacity. This does not appear in Apple's documentation as a predictable failure mode. You discover it in crash reports after shipping. Wednesday encountered this on Off Grid with iPhone 12 and iPhone 13 base models, diagnosed the root cause, and shipped a RAM headroom gate that prevents the model load when available memory is under a threshold that empirically triggers the abort.

The second is the QNN variant matrix. Qualcomm's AI Engine has changed its programming interface across Snapdragon generations. A model optimized for QNN on the Snapdragon 8 Gen 2 does not automatically work on the Snapdragon 888 or the 8 Gen 1. Wednesday ships multiple QNN compilation artifacts for the same model, with device detection at runtime to select the correct variant. An agency shipping a single QNN artifact will see degraded performance or inference failure on Snapdragon chips older than the one they tested on.

The third is background generation state management. When iOS suspends an app mid-inference, the model's computation state is lost. A user who queued a background AI task should find it either completed or clearly queued when they return to the app — not silently dropped. This requires explicit state serialization checkpoints during inference, not just a background task registration.

The fourth is the App Store review surface for AI features. Apple reviews apps with AI features for privacy labeling accuracy, data use disclosure, and — for apps using local models — sometimes triggers additional review for intellectual property compliance on the model weights. An agency without App Store on-device AI submission experience will be surprised by questions that first-time submitters cannot anticipate.

Your board wants AI that works offline and keeps data on the device. Let us map what is buildable on your timeline.

Get my recommendation →

Wednesday Off Grid as the reference implementation

Off Grid is Wednesday's open-source mobile AI application. It runs on iOS, Android, and macOS. It ships five on-device AI capabilities with zero server calls: text generation using a local LLM, image generation using a local diffusion model, voice transcription using on-device Whisper, vision analysis using a local vision-language model, and document Q&A using on-device embedding and retrieval.

Off Grid has 50,000+ active users and 1,700+ GitHub stars. Every claim in this article about on-device AI performance is reproducible from the Off Grid open-source code. The Metal abort() fix is in the code. The QNN variant matrix is in the code. The RAM headroom gate is in the code. The background state management is in the code.

This matters for enterprise buyers for one reason: every on-device AI claim Wednesday makes is independently auditable. You do not need to take a vendor's word for their on-device AI capability when the code and the App Store listing are public. Wednesday is the only mobile agency that can make this offer.

Off Grid's performance metrics across the device fleet: median text generation latency of 180ms per token on iPhone 14, 420ms per token on iPhone 12. Image generation at 512x512 in 8 seconds on Snapdragon 8 Gen 2, 14 seconds on Snapdragon 888. Voice transcription at 4x real-time on all supported devices. All measured on physical devices, not simulators.

How Wednesday approaches on-device AI engagements

An on-device AI engagement with Wednesday starts with a device matrix scoping session. Wednesday identifies the devices in your user fleet — by model and OS version — and maps the model format, quantization strategy, and performance targets for each. This session takes 30 minutes and produces a written capability confirmation before any contract is signed.

Wednesday then recommends the right model for each AI capability based on the RAM budget and latency requirements. The recommendation comes from direct production experience with the models — not from benchmark papers or vendor claims. For most enterprise text AI use cases in 2026, a 3-billion-parameter model quantized to 4-bit precision is the right balance of quality, speed, and RAM fit. For voice transcription, on-device Whisper in the medium variant handles 95% of enterprise accuracy requirements. For vision, a 1.5-billion-parameter vision-language model covers document analysis and scene understanding at production quality.

Wednesday instruments every on-device AI deployment with latency, memory, thermal, and battery telemetry from day one. Weekly performance reports include device-segmented metrics so you can see if a new OS release changed inference behavior on a specific device family — a real operational concern, because iOS and Android OS updates periodically change the behavior of the on-device ML runtime.

Case study — Clinical digital health platform

0patient logs lost offline — seizures logged anywhere, synced automatically

“They really cared and felt like an extension of our team. The quality of the work was top notch, and they were receptive to shifting priorities.”

Founder, Digital health platformRead the case study →

Wednesday's track record across enterprise on-device AI includes healthcare apps handling protected health information with zero PHI leaving the device, field service apps processing AI inference in areas with no cellular coverage, and the Off Grid public deployment at 50,000+ users as the reference implementation for everything above.

Your on-device AI project deserves an agency that has shipped it before. Let us review your requirements and confirm what is buildable on your timeline.

Book my 30-min call →

4.8 on Clutch

4x faster with AI2x fewer crashes100% money back

Frequently asked questions

Not ready for the call yet? The writing archive has capability guides, vendor comparisons, and cost frameworks for every stage of the enterprise mobile buying decision.