Trusted by teams at
In this article
On-device LLM mobile integration in React Native means routing inference calls through native modules that invoke hardware-accelerated runtimes directly: Core ML on iOS and AICore on Android. The JavaScript thread has no direct path to the Apple Neural Engine or Android NPU, so every team building this capability must write native bridge code. This guide gives you the concrete architecture, annotated code, and deployment strategy to ship it.
What Is On-Device LLM Mobile Integration in React Native?
On-device LLM mobile integration means running language model inference on the device's dedicated AI hardware (Neural Engine on iOS, NPU via AICore on Android) through native modules bridged to your React Native JS layer, with no cloud API call required.
Core ML on iOS exposes hardware-accelerated inference via the Neural Engine, bridgeable through Objective-C/Swift native modules using RCT_EXPORT_MODULE and RCT_EXPORT_METHOD macros to surface inference results as resolved Promises in JavaScript.
Gemini Nano on Android uses the AICore system service, accessible via Kotlin/Java native modules using ReactContextBaseJavaModule, with streaming token output bridgeable to React Native via DeviceEventEmitter and NativeEventEmitter.
Why Does React Native Require Native Modules for On-Device AI?
React Native's JS thread runs on JavaScriptCore (or Hermes), which has no binding to Core ML's MLModel class or Android's AICore service. Hardware-accelerated inference is not a JavaScript API. It is a system-level runtime that requires native code to invoke.
JS-only inference approaches like ONNX.js or TensorFlow.js running inside a WebView context bypass the Neural Engine entirely. All computation falls back to the CPU. In our project experience, inference that completes in under 100ms on the Apple Neural Engine can take 2 to 8 seconds on the CPU for a quantized 2B parameter model. The gap widens with model size.
Apple Neural Engine (ANE): A dedicated matrix-multiplication accelerator built into Apple Silicon and A-series chips. It is only accessible through Core ML's MLModel API. You cannot call it from JavaScript, a WebView, or a React Native background thread without a native module.
Android NPU / AICore: Google's AICore is a system-level service introduced in Android 14 that manages on-device foundation models including Gemini Nano. It is accessible via the Android Generative AI API (com.google.android.gms.ai). Like ANE, it requires native Kotlin or Java code to invoke.
Why Enterprises Choose On-Device Inference
Three categories of enterprise requirement push teams toward on-device inference specifically.
First, data residency. HIPAA and GDPR both create scenarios where sending user-generated text to a cloud API is legally problematic or prohibited. Legal teams at regulated enterprises commonly block cloud LLM API calls for any input that could contain PII. On-device inference eliminates the transmission entirely.
Second, offline-first applications. Field service teams in manufacturing plants, utility workers in remote locations, and retail associates in stores with unreliable Wi-Fi all need inference that works without a network connection. Cloud APIs fail silently in these environments.
Third, cost at scale. A mobile application making 50 inference calls per active user per day at 500 tokens per call generates cloud API costs that become significant above roughly 100,000 monthly active users. On-device inference has zero marginal cost per token after the model is on the device.
Cloud LLM API vs. On-Device LLM: Direct Comparison
| Dimension | Cloud LLM API | On-Device LLM |
|---|---|---|
| Latency | 500ms to 3s (higher end when network is congested or model is large) | 50ms to 500ms (lower end on latest Neural Engine hardware with a warmed model) |
| Cost | Per-token billing | Zero marginal cost |
| Privacy | Data leaves device | Data never leaves device |
| Offline support | None | Full |
| Model size | Unlimited | 1 to 8GB practical limit |
| Model quality ceiling | GPT-4 class | ~7B parameter class (2025) |
| Integration complexity | Low (REST API) | High (native modules required) |
Cloud latency is at the low end when the network is fast and the model is small. It reaches the high end under congestion or with large context windows. On-device latency is at the low end on a Pixel 8 or iPhone 15 Pro with a warmed 2B model. It reaches the high end on older hardware running a 7B quantized model from a cold start.
How to Set Up Core ML Integration on iOS
The iOS integration path has three distinct stages: adding the model to Xcode, writing the Swift native module, and calling it from JavaScript.
Step 1: Add the model to your Xcode project.
Add your .mlpackage or .mlmodel file to the Xcode project navigator. Set target membership to your main app target. Use Core ML Tools (coremltools Python package) to validate the model before adding it:
import coremltools as ct
model = ct.models.MLModel('YourModel.mlpackage')
print(model.get_spec())
For a quantized open-source model, Mistral 7B quantized to 4-bit via Core ML Tools produces a package in the 3.5 to 4GB range. For smaller on-device use cases, Apple's on-device language model APIs (available in iOS 18 via the Foundation Models framework) provide a system-managed model with no bundle size cost. See Core ML On-Device AI iOS Enterprise Guide 2026 for model selection and quantization tradeoffs in detail.
Step 2: Write the Swift native module.
// CoreMLInferenceModule.swift
import Foundation
import CoreML
import React
@objc(CoreMLInference)
class CoreMLInference: NSObject, RCTBridgeModule {
static func moduleName() -> String! { return "CoreMLInference" }
// Required for RCT_EXPORT_MODULE equivalent in Swift
static func requiresMainQueueSetup() -> Bool { return false }
private var model: MLModel?
@objc func loadModel(
_ resolve: @escaping RCTPromiseResolveBlock,
rejecter reject: @escaping RCTPromiseRejectBlock
) {
DispatchQueue.global(qos: .userInitiated).async {
do {
let modelURL = Bundle.main.url(
forResource: "YourModel",
withExtension: "mlmodelc"
)!
let config = MLModelConfiguration()
config.computeUnits = .all // Enables Neural Engine
self.model = try MLModel(contentsOf: modelURL, configuration: config)
resolve(true)
} catch {
reject("MODEL_LOAD_ERROR", error.localizedDescription, error)
}
}
}
@objc func runInference(
_ prompt: String,
resolver resolve: @escaping RCTPromiseResolveBlock,
rejecter reject: @escaping RCTPromiseRejectBlock
) {
// Never run inference on the main thread
DispatchQueue.global(qos: .userInitiated).async {
guard let model = self.model else {
reject("MODEL_NOT_LOADED", "Call loadModel() before runInference()", nil)
return
}
do {
// Input/output keys depend on your specific model's spec
let input = try MLDictionaryFeatureProvider(
dictionary: ["prompt": prompt as NSString]
)
let output = try model.prediction(from: input)
let result = output.featureValue(for: "output")?.stringValue ?? ""
resolve(result)
} catch {
// MLError.predictionFailed often indicates memory pressure
reject("INFERENCE_ERROR", error.localizedDescription, error)
}
}
}
}
The computeUnits = .all configuration is the line that enables Neural Engine routing. Setting it to .cpuOnly drops you back to CPU-bound inference. Never omit this.
Step 3: Call it from JavaScript.
import { NativeModules } from 'react-native';
const { CoreMLInference } = NativeModules;
async function runCoreMLInference(prompt: string): Promise<string> {
try {
await CoreMLInference.loadModel();
const result = await CoreMLInference.runInference(prompt);
return result;
} catch (error: any) {
if (error.code === 'MODEL_NOT_LOADED') {
throw new Error('Model file missing from bundle. Check Xcode target membership.');
}
if (error.code === 'INFERENCE_ERROR') {
throw new Error('Inference failed. Device may be under memory pressure.');
}
throw error;
}
}
Three error conditions require explicit handling: the model file missing from the bundle (check target membership in Xcode), memory pressure causing MLError.predictionFailed (implement a retry with model reload), and calling runInference before loadModel completes (guard with a loading state flag in your service layer).
How to Integrate Gemini Nano on Android via AICore and Kotlin
The Android path mirrors the iOS structure but has an additional prerequisite: device eligibility and model availability checks that iOS does not require.
Step 1: Check device eligibility and trigger model download.
Gemini Nano via AICore is available on Pixel 8 and later, Samsung Galaxy S24 series, and a growing list of Android 14+ devices with sufficient NPU capability. As of mid-2025, this is not universal across Android.
// In your ReactContextBaseJavaModule
import com.google.android.gms.ai.GenerativeModel
import com.google.android.gms.ai.GenerativeModelFutures
import com.google.android.gms.ai.java.GenerativeModel as JavaGenerativeModel
@ReactMethod
fun checkAvailability(promise: Promise) {
val availability = GenerativeModel.checkAvailability(reactApplicationContext)
when (availability) {
GenerativeModel.Availability.AVAILABLE -> promise.resolve("available")
GenerativeModel.Availability.DOWNLOADING -> promise.resolve("downloading")
GenerativeModel.Availability.NOT_AVAILABLE -> promise.resolve("unavailable")
else -> promise.resolve("unknown")
}
}
If the result is DOWNLOADING, the model is present on the device but not yet ready. Your JS layer should poll this check and show a progress indicator rather than attempting inference.
Step 2: Write the Kotlin native module with coroutine-based inference.
// GeminiNanoModule.kt
class GeminiNanoModule(reactContext: ReactApplicationContext) :
ReactContextBaseJavaModule(reactContext) {
override fun getName() = "GeminiNano"
private val model by lazy {
GenerativeModel(
modelName = "gemini-nano",
requestOptions = RequestOptions()
)
}
@ReactMethod
fun runInference(prompt: String, promise: Promise) {
CoroutineScope(Dispatchers.IO).launch {
try {
val response = model.generateContent(prompt)
promise.resolve(response.text)
} catch (e: Exception) {
promise.reject("INFERENCE_ERROR", e.message, e)
}
}
}
@ReactMethod
fun streamInference(prompt: String) {
CoroutineScope(Dispatchers.IO).launch {
try {
model.generateContentStream(prompt).collect { chunk ->
chunk.text?.let { token ->
val params = Arguments.createMap()
params.putString("token", token)
reactApplicationContext
.getJSModule(DeviceEventManagerModule.RCTDeviceEventEmitter::class.java)
.emit("GeminiNanoToken", params)
}
}
// Signal stream completion
val doneParams = Arguments.createMap()
doneParams.putBoolean("done", true)
reactApplicationContext
.getJSModule(DeviceEventManagerModule.RCTDeviceEventEmitter::class.java)
.emit("GeminiNanoToken", doneParams)
} catch (e: Exception) {
val errorParams = Arguments.createMap()
errorParams.putString("error", e.message)
reactApplicationContext
.getJSModule(DeviceEventManagerModule.RCTDeviceEventEmitter::class.java)
.emit("GeminiNanoToken", errorParams)
}
}
}
}
Step 3: Subscribe to streaming tokens in JavaScript.
import { NativeModules, NativeEventEmitter } from 'react-native';
const { GeminiNano } = NativeModules;
const geminiEmitter = new NativeEventEmitter(GeminiNano);
function streamGeminiInference(
prompt: string,
onToken: (token: string) => void,
onComplete: () => void,
onError: (error: string) => void
) {
const subscription = geminiEmitter.addListener('GeminiNanoToken', (event) => {
if (event.error) {
onError(event.error);
subscription.remove();
} else if (event.done) {
onComplete();
subscription.remove();
} else if (event.token) {
onToken(event.token);
}
});
GeminiNano.streamInference(prompt);
return () => subscription.remove(); // cleanup function
}
One deployment detail that catches teams off guard: Gemini Nano model download is triggered by the system, not your app. On a freshly factory-reset Pixel 8, the model may not be present for 24 to 48 hours after first boot. Build your availability check into app startup and communicate the "model downloading" state clearly to users rather than silently falling back.
Get a code review of your React Native native module bridge architecture for on-device AI integration.
Request a bridge architecture review →How to Build a Unified TypeScript AI Service Layer for Both Platforms
Wrapping both native modules behind a single TypeScript service class keeps your product code clean and makes the fallback logic testable without native dependencies.
// OnDeviceAIService.ts
import { Platform, NativeModules, NativeEventEmitter } from 'react-native';
interface OnDeviceAIService {
isAvailable(): Promise<boolean>;
runInference(prompt: string): Promise<string>;
streamInference(
prompt: string,
onToken: (token: string) => void,
onComplete: () => void
): () => void;
releaseModel(): Promise<void>;
}
class OnDeviceAIServiceImpl implements OnDeviceAIService {
private available: boolean | null = null;
async isAvailable(): Promise<boolean> {
if (this.available !== null) return this.available;
if (Platform.OS === 'ios') {
try {
await NativeModules.CoreMLInference.loadModel();
this.available = true;
} catch {
this.available = false;
}
} else if (Platform.OS === 'android') {
const status = await NativeModules.GeminiNano.checkAvailability();
this.available = status === 'available';
} else {
this.available = false;
}
return this.available;
}
async runInference(prompt: string): Promise<string> {
const onDevice = await this.isAvailable();
if (!onDevice) {
// Cloud fallback: replace with your preferred provider
return this.cloudFallback(prompt);
}
if (Platform.OS === 'ios') {
return NativeModules.CoreMLInference.runInference(prompt);
} else {
return NativeModules.GeminiNano.runInference(prompt);
}
}
private async cloudFallback(prompt: string): Promise<string> {
// Route to OpenAI, Vertex AI, or your enterprise gateway
throw new Error('Cloud fallback not configured');
}
async releaseModel(): Promise<void> {
if (Platform.OS === 'ios') {
await NativeModules.CoreMLInference.releaseModel?.();
}
// Android model lifecycle is managed by AICore system service
}
streamInference(
prompt: string,
onToken: (token: string) => void,
onComplete: () => void
): () => void {
if (Platform.OS === 'android') {
return streamGeminiInference(prompt, onToken, onComplete, console.error);
}
// iOS streaming requires a similar DeviceEventEmitter pattern
// or chunked polling depending on your Core ML model's output format
return () => {};
}
}
export const aiService = new OnDeviceAIServiceImpl();
Layer stack (text diagram):
React Native Product Code
↓
OnDeviceAIService.ts (TypeScript, Platform.OS routing)
↓
NativeModules bridge (React Native bridge / JSI)
↓
CoreMLInference (Swift) | GeminiNanoModule (Kotlin)
↓
Core ML MLModel API | Android AICore / Generative AI API
↓
Apple Neural Engine | Android NPU
Memory lifecycle management is where most teams cut corners. On iOS, subscribe to UIApplicationDidReceiveMemoryWarningNotification in your Swift module and call self.model = nil to release the model from memory. Expose a releaseModel() method to JS so your product code can proactively free memory when the AI feature is backgrounded. On Android, AICore manages the model lifecycle at the system level, so your Kotlin module does not need to handle this directly.
For teams managing multiple native modules across a large codebase, the patterns in Enterprise React Native Monorepo Architecture: A Practical Guide for Multi-Team Mobile Development apply directly here: isolate each native module as a workspace package so iOS and Android bridge code can be versioned and tested independently.
How to Handle Performance, Testing, and Enterprise Deployment
Performance expectations need to be set correctly before you commit to on-device inference in a product spec.
Realistic latency figures:
In our project experience, Core ML inference on Apple Neural Engine for a quantized 2B parameter model runs 50 to 150ms for short prompts (under 100 tokens). At the low end when computeUnits = .all is set, the model is already loaded into memory, and the device is not under thermal throttling. At the high end when the model is paged out or the device is thermally constrained.
A quantized 7B model on the same hardware runs 200 to 500ms. At the low end on an iPhone 15 Pro with a warmed model. At the high end on an iPhone 12 under memory pressure.
Cold-start model loading adds 1 to 3 seconds on first call. At the low end for a 2B model on recent hardware. At the high end for a 4B+ model on a device with slower storage.
Gemini Nano on Pixel 8 returns first tokens in 50 to 200ms for short prompts, with streaming making the latency less perceptible for longer outputs. At the low end when the model is already resident in AICore memory. At the high end when the system needs to page the model back in after memory pressure.
Profiling tools:
Use Xcode Instruments with the Core ML performance report template to see per-layer execution time and confirm Neural Engine utilization. If the report shows CPU-only execution, check your computeUnits configuration and model compatibility. On Android, use Android GPU Inspector and systrace to profile AICore inference sessions.
Testing strategy checklist:
- Mock native modules in Jest using
jest.mock('react-native', ...)to unit testOnDeviceAIService.tswithout native dependencies - Write integration tests for the Swift and Kotlin modules using XCTest and JUnit respectively, independent of React Native
- Use Detox for E2E tests that exercise the full inference path on a physical device (most commonly missed): teams skip this because Detox setup is time-consuming, but simulators and emulators have no Neural Engine or AICore access, making simulator-only testing meaningless for this feature
- Test the cloud fallback path explicitly by mocking
isAvailable()to return false (most commonly missed): teams test the happy path only and discover the fallback is broken in production when a user's device is unsupported - Test memory warning handling by triggering a simulated memory warning in Xcode's Debug menu (most commonly missed): the model release and reload cycle is almost never tested and frequently causes crashes on low-memory devices
- Profile cold-start model loading time on the oldest supported device in your target fleet
- Test the "model downloading" state for Gemini Nano on a freshly set up Android device
Enterprise deployment: the bundle size problem.
A quantized 4B parameter Core ML model is roughly 2 to 3GB. Shipping that in your app bundle produces an App Store submission that will be rejected or will cause users to abandon the download. The solution is on-demand resources (iOS) or Play Asset Delivery (Android).
On iOS, mark the .mlpackage as an on-demand resource with a tag (e.g., "ai-model-v1"). Request it at runtime:
let request = NSBundleResourceRequest(tags: ["ai-model-v1"])
request.beginAccessingResources { error in
// Model is now available in the bundle
}
On Android, Play Asset Delivery with the on-demand delivery mode achieves the same result. The model pack is downloaded after install when the user first accesses the AI feature.
Model versioning without app releases is achievable through both mechanisms. Update the asset pack or on-demand resource with a new version tag, and your app can check for and download updated models independently of the app binary. This matters for enterprise deployments where app release cycles are slow due to MDM approval processes.
For teams distributing through MDM (Jamf, Intune, VMware Workspace ONE), test that on-demand resource downloads work correctly through the enterprise network proxy. Some MDM configurations block the CDN endpoints that on-demand resources use, which surfaces as a silent download failure.
Code signing applies to bundled model files. Models included in the app bundle are covered by the app's code signature. Models downloaded post-install via on-demand resources are signed separately by Apple's CDN. Models downloaded from your own server require your own integrity verification: use a SHA-256 hash check before loading any externally downloaded model file.
The patterns for structuring this across a multi-platform codebase are covered in depth in AI-Augmented React Native Development Enterprise 2026, including how to manage model versioning as part of a CI/CD pipeline.
The unresolved tradeoff at the center of this architecture is model capability versus device coverage. The models that run well on-device in 2025 (2B to 7B parameter class, quantized) are meaningfully less capable than GPT-4 class cloud models for complex reasoning tasks. Expanding device coverage by supporting older hardware means using smaller, less capable models. Targeting only the latest Pixel and iPhone hardware gives you better models but excludes a significant portion of enterprise device fleets, particularly in organizations with 3 to 4 year device refresh cycles. No architectural decision resolves that tradeoff. It requires knowing your actual device fleet distribution before you commit to an on-device-first strategy.
Frequently asked questions
Get an expert code review of your React Native native module bridge architecture for on-device AI integration, covering iOS and Android implementation, memory lifecycle, and fallback strategy.
Request a bridge architecture review →About the author
Anurag Rathod
LinkedIn →Technical Lead, Wednesday Solutions
Anurag is a Technical Lead at Wednesday Solutions who specialises in React Native and enterprise AI enablement. He has shipped mobile platforms across logistics, container movement, gambling, esports, and martech, and brings compliance-ready, offline-first architecture to every engagement.
30 minutes with an engineer. You leave with a squad shape, a monthly cost, and a start date.
Get your start date →Keep reading
Shipped for enterprise and growth teams across US, Europe, and Asia