Why can't React Native call Core ML or AICore directly from JavaScript?

React Native's JS thread runs on JavaScriptCore or Hermes, which have no bindings to Core ML's MLModel class or Android's AICore system service. Hardware-accelerated inference is a system-level runtime, not a JavaScript API. Attempting CPU-based JS inference via ONNX.js or TensorFlow.js in a WebView bypasses the Neural Engine entirely, resulting in 2–8x slower inference for quantized 2B parameter models compared to native execution.

Which Android devices support Gemini Nano via AICore?

As of mid-2025, Gemini Nano via AICore is available on Pixel 8 and later, Samsung Galaxy S24 series, and a growing list of Android 14+ devices with sufficient NPU capability. Coverage is not universal across Android. Enterprise teams should build a device eligibility check using GenerativeModel.checkAvailability() at app startup and implement a cloud fallback for unsupported devices to avoid silent inference failures.

How do you ship a 3GB Core ML model without exceeding App Store size limits?

Use iOS On-Demand Resources to mark the .mlpackage with a resource tag and download it at runtime using NSBundleResourceRequest, keeping it out of the initial app bundle. On Android, Play Asset Delivery with on-demand delivery mode achieves the same result. Both mechanisms also support model versioning independent of app binary releases, which matters for enterprises with slow MDM-gated release cycles.

What is the realistic latency for on-device LLM inference in a React Native app?

Core ML inference on Apple Neural Engine for a quantized 2B model runs 50–150ms for short prompts when computeUnits is set to .all and the model is already loaded. A 7B quantized model on the same hardware runs 200–500ms. Cold-start model loading adds 1–3 seconds on first call. Gemini Nano on Pixel 8 returns first tokens in 50–200ms for short prompts, with streaming reducing perceived latency for longer outputs.

Writing

Bridging React Native to Native On-Device AI: A Practical Guide to Core ML and Gemini Nano Integrations

A concrete implementation guide for routing LLM inference through native modules to Apple Neural Engine and Android NPU—covering Swift and Kotlin bridge code, a unified TypeScript service layer, and enterprise deployment patterns for bundle size and model versioning.

Anurag Rathod · Technical Lead, Wednesday Solutions

14 min read·Published May 25, 2026·Updated May 25, 2026

4xfaster with AI

2xfewer crashes

10xmore work, same cost

4.8on Clutch

Trusted by teams at

In this article

Why Does React Native Require Native Modules for On-Device AI?
How to Set Up Core ML Integration on iOS
How to Integrate Gemini Nano on Android via AICore and Kotlin
How to Build a Unified TypeScript AI Service Layer for Both Platforms
How to Handle Performance, Testing, and Enterprise Deployment

On-device LLM mobile integration in React Native means routing inference calls through native modules that invoke hardware-accelerated runtimes directly: Core ML on iOS and AICore on Android. The JavaScript thread has no direct path to the Apple Neural Engine or Android NPU, so every team building this capability must write native bridge code. This guide gives you the concrete architecture, annotated code, and deployment strategy to ship it.

What Is On-Device LLM Mobile Integration in React Native?

On-device LLM mobile integration means running language model inference on the device's dedicated AI hardware (Neural Engine on iOS, NPU via AICore on Android) through native modules bridged to your React Native JS layer, with no cloud API call required.

Core ML on iOS exposes hardware-accelerated inference via the Neural Engine, bridgeable through Objective-C/Swift native modules using RCT_EXPORT_MODULE and RCT_EXPORT_METHOD macros to surface inference results as resolved Promises in JavaScript.

Gemini Nano on Android uses the AICore system service, accessible via Kotlin/Java native modules using ReactContextBaseJavaModule, with streaming token output bridgeable to React Native via DeviceEventEmitter and NativeEventEmitter.

Why Does React Native Require Native Modules for On-Device AI?

React Native's JS thread runs on JavaScriptCore (or Hermes), which has no binding to Core ML's MLModel class or Android's AICore service. Hardware-accelerated inference is not a JavaScript API. It is a system-level runtime that requires native code to invoke.

JS-only inference approaches like ONNX.js or TensorFlow.js running inside a WebView context bypass the Neural Engine entirely. All computation falls back to the CPU. In our project experience, inference that completes in under 100ms on the Apple Neural Engine can take 2 to 8 seconds on the CPU for a quantized 2B parameter model. The gap widens with model size.

Apple Neural Engine (ANE): A dedicated matrix-multiplication accelerator built into Apple Silicon and A-series chips. It is only accessible through Core ML's MLModel API. You cannot call it from JavaScript, a WebView, or a React Native background thread without a native module.

Android NPU / AICore: Google's AICore is a system-level service introduced in Android 14 that manages on-device foundation models including Gemini Nano. It is accessible via the Android Generative AI API (com.google.android.gms.ai). Like ANE, it requires native Kotlin or Java code to invoke.

Why Enterprises Choose On-Device Inference

Three categories of enterprise requirement push teams toward on-device inference specifically.

First, data residency. HIPAA and GDPR both create scenarios where sending user-generated text to a cloud API is legally problematic or prohibited. Legal teams at regulated enterprises commonly block cloud LLM API calls for any input that could contain PII. On-device inference eliminates the transmission entirely.

Second, offline-first applications. Field service teams in manufacturing plants, utility workers in remote locations, and retail associates in stores with unreliable Wi-Fi all need inference that works without a network connection. Cloud APIs fail silently in these environments.

Third, cost at scale. A mobile application making 50 inference calls per active user per day at 500 tokens per call generates cloud API costs that become significant above roughly 100,000 monthly active users. On-device inference has zero marginal cost per token after the model is on the device.

Cloud LLM API vs. On-Device LLM: Direct Comparison

Dimension	Cloud LLM API	On-Device LLM
Latency	500ms to 3s (higher end when network is congested or model is large)	50ms to 500ms (lower end on latest Neural Engine hardware with a warmed model)
Cost	Per-token billing	Zero marginal cost
Privacy	Data leaves device	Data never leaves device
Offline support	None	Full
Model size	Unlimited	1 to 8GB practical limit
Model quality ceiling	GPT-4 class	~7B parameter class (2025)
Integration complexity	Low (REST API)	High (native modules required)

Cloud latency is at the low end when the network is fast and the model is small. It reaches the high end under congestion or with large context windows. On-device latency is at the low end on a Pixel 8 or iPhone 15 Pro with a warmed 2B model. It reaches the high end on older hardware running a 7B quantized model from a cold start.

How to Set Up Core ML Integration on iOS

The iOS integration path has three distinct stages: adding the model to Xcode, writing the Swift native module, and calling it from JavaScript.

Step 1: Add the model to your Xcode project.

Add your .mlpackage or .mlmodel file to the Xcode project navigator. Set target membership to your main app target. Use Core ML Tools (coremltools Python package) to validate the model before adding it:

import coremltools as ct
model = ct.models.MLModel('YourModel.mlpackage')
print(model.get_spec())

For a quantized open-source model, Mistral 7B quantized to 4-bit via Core ML Tools produces a package in the 3.5 to 4GB range. For smaller on-device use cases, Apple's on-device language model APIs (available in iOS 18 via the Foundation Models framework) provide a system-managed model with no bundle size cost. See Core ML and On-Device AI for iOS Enterprise Apps 2026 for model selection and quantization tradeoffs in detail.

Step 2: Write the Swift native module.

// CoreMLInferenceModule.swift
import Foundation
import CoreML
import React

@objc(CoreMLInference)
class CoreMLInference: NSObject, RCTBridgeModule {

  static func moduleName() -> String! { return "CoreMLInference" }

  // Required for RCT_EXPORT_MODULE equivalent in Swift
  static func requiresMainQueueSetup() -> Bool { return false }

  private var model: MLModel?

  @objc func loadModel(
    _ resolve: @escaping RCTPromiseResolveBlock,
    rejecter reject: @escaping RCTPromiseRejectBlock
  ) {
    DispatchQueue.global(qos: .userInitiated).async {
      do {
        let modelURL = Bundle.main.url(
          forResource: "YourModel",
          withExtension: "mlmodelc"
        )!
        let config = MLModelConfiguration()
        config.computeUnits = .all // Enables Neural Engine
        self.model = try MLModel(contentsOf: modelURL, configuration: config)
        resolve(true)
      } catch {
        reject("MODEL_LOAD_ERROR", error.localizedDescription, error)
      }
    }
  }

  @objc func runInference(
    _ prompt: String,
    resolver resolve: @escaping RCTPromiseResolveBlock,
    rejecter reject: @escaping RCTPromiseRejectBlock
  ) {
    // Never run inference on the main thread
    DispatchQueue.global(qos: .userInitiated).async {
      guard let model = self.model else {
        reject("MODEL_NOT_LOADED", "Call loadModel() before runInference()", nil)
        return
      }
      do {
        // Input/output keys depend on your specific model's spec
        let input = try MLDictionaryFeatureProvider(
          dictionary: ["prompt": prompt as NSString]
        )
        let output = try model.prediction(from: input)
        let result = output.featureValue(for: "output")?.stringValue ?? ""
        resolve(result)
      } catch {
        // MLError.predictionFailed often indicates memory pressure
        reject("INFERENCE_ERROR", error.localizedDescription, error)
      }
    }
  }
}

The computeUnits = .all configuration is the line that enables Neural Engine routing. Setting it to .cpuOnly drops you back to CPU-bound inference. Never omit this.

Step 3: Call it from JavaScript.

import { NativeModules } from 'react-native';

const { CoreMLInference } = NativeModules;

async function runCoreMLInference(prompt: string): Promise<string> {
  try {
    await CoreMLInference.loadModel();
    const result = await CoreMLInference.runInference(prompt);
    return result;
  } catch (error: any) {
    if (error.code === 'MODEL_NOT_LOADED') {
      throw new Error('Model file missing from bundle. Check Xcode target membership.');
    }
    if (error.code === 'INFERENCE_ERROR') {
      throw new Error('Inference failed. Device may be under memory pressure.');
    }
    throw error;
  }
}

Three error conditions require explicit handling: the model file missing from the bundle (check target membership in Xcode), memory pressure causing MLError.predictionFailed (implement a retry with model reload), and calling runInference before loadModel completes (guard with a loading state flag in your service layer).

How to Integrate Gemini Nano on Android via AICore and Kotlin

The Android path mirrors the iOS structure but has an additional prerequisite: device eligibility and model availability checks that iOS does not require.

Step 1: Check device eligibility and trigger model download.

Gemini Nano via AICore is available on Pixel 8 and later, Samsung Galaxy S24 series, and a growing list of Android 14+ devices with sufficient NPU capability. As of mid-2025, this is not universal across Android.

// In your ReactContextBaseJavaModule
import com.google.android.gms.ai.GenerativeModel
import com.google.android.gms.ai.GenerativeModelFutures
import com.google.android.gms.ai.java.GenerativeModel as JavaGenerativeModel

@ReactMethod
fun checkAvailability(promise: Promise) {
    val availability = GenerativeModel.checkAvailability(reactApplicationContext)
    when (availability) {
        GenerativeModel.Availability.AVAILABLE -> promise.resolve("available")
        GenerativeModel.Availability.DOWNLOADING -> promise.resolve("downloading")
        GenerativeModel.Availability.NOT_AVAILABLE -> promise.resolve("unavailable")
        else -> promise.resolve("unknown")
    }
}

If the result is DOWNLOADING, the model is present on the device but not yet ready. Your JS layer should poll this check and show a progress indicator rather than attempting inference.

Step 2: Write the Kotlin native module with coroutine-based inference.

// GeminiNanoModule.kt
class GeminiNanoModule(reactContext: ReactApplicationContext) :
    ReactContextBaseJavaModule(reactContext) {

    override fun getName() = "GeminiNano"

    private val model by lazy {
        GenerativeModel(
            modelName = "gemini-nano",
            requestOptions = RequestOptions()
        )
    }

    @ReactMethod
    fun runInference(prompt: String, promise: Promise) {
        CoroutineScope(Dispatchers.IO).launch {
            try {
                val response = model.generateContent(prompt)
                promise.resolve(response.text)
            } catch (e: Exception) {
                promise.reject("INFERENCE_ERROR", e.message, e)
            }
        }
    }

    @ReactMethod
    fun streamInference(prompt: String) {
        CoroutineScope(Dispatchers.IO).launch {
            try {
                model.generateContentStream(prompt).collect { chunk ->
                    chunk.text?.let { token ->
                        val params = Arguments.createMap()
                        params.putString("token", token)
                        reactApplicationContext
                            .getJSModule(DeviceEventManagerModule.RCTDeviceEventEmitter::class.java)
                            .emit("GeminiNanoToken", params)
                    }
                }
                // Signal stream completion
                val doneParams = Arguments.createMap()
                doneParams.putBoolean("done", true)
                reactApplicationContext
                    .getJSModule(DeviceEventManagerModule.RCTDeviceEventEmitter::class.java)
                    .emit("GeminiNanoToken", doneParams)
            } catch (e: Exception) {
                val errorParams = Arguments.createMap()
                errorParams.putString("error", e.message)
                reactApplicationContext
                    .getJSModule(DeviceEventManagerModule.RCTDeviceEventEmitter::class.java)
                    .emit("GeminiNanoToken", errorParams)
            }
        }
    }
}

Step 3: Subscribe to streaming tokens in JavaScript.

import { NativeModules, NativeEventEmitter } from 'react-native';

const { GeminiNano } = NativeModules;
const geminiEmitter = new NativeEventEmitter(GeminiNano);

function streamGeminiInference(
  prompt: string,
  onToken: (token: string) => void,
  onComplete: () => void,
  onError: (error: string) => void
) {
  const subscription = geminiEmitter.addListener('GeminiNanoToken', (event) => {
    if (event.error) {
      onError(event.error);
      subscription.remove();
    } else if (event.done) {
      onComplete();
      subscription.remove();
    } else if (event.token) {
      onToken(event.token);
    }
  });

  GeminiNano.streamInference(prompt);
  return () => subscription.remove(); // cleanup function
}

One deployment detail that catches teams off guard: Gemini Nano model download is triggered by the system, not your app. On a freshly factory-reset Pixel 8, the model may not be present for 24 to 48 hours after first boot. Build your availability check into app startup and communicate the "model downloading" state clearly to users rather than silently falling back.

Get a code review of your React Native native module bridge architecture for on-device AI integration.

Request a bridge architecture review →

How to Build a Unified TypeScript AI Service Layer for Both Platforms

Wrapping both native modules behind a single TypeScript service class keeps your product code clean and makes the fallback logic testable without native dependencies.

// OnDeviceAIService.ts
import { Platform, NativeModules, NativeEventEmitter } from 'react-native';

interface OnDeviceAIService {
  isAvailable(): Promise<boolean>;
  runInference(prompt: string): Promise<string>;
  streamInference(
    prompt: string,
    onToken: (token: string) => void,
    onComplete: () => void
  ): () => void;
  releaseModel(): Promise<void>;
}

class OnDeviceAIServiceImpl implements OnDeviceAIService {
  private available: boolean | null = null;

  async isAvailable(): Promise<boolean> {
    if (this.available !== null) return this.available;

    if (Platform.OS === 'ios') {
      try {
        await NativeModules.CoreMLInference.loadModel();
        this.available = true;
      } catch {
        this.available = false;
      }
    } else if (Platform.OS === 'android') {
      const status = await NativeModules.GeminiNano.checkAvailability();
      this.available = status === 'available';
    } else {
      this.available = false;
    }

    return this.available;
  }

  async runInference(prompt: string): Promise<string> {
    const onDevice = await this.isAvailable();

    if (!onDevice) {
      // Cloud fallback: replace with your preferred provider
      return this.cloudFallback(prompt);
    }

    if (Platform.OS === 'ios') {
      return NativeModules.CoreMLInference.runInference(prompt);
    } else {
      return NativeModules.GeminiNano.runInference(prompt);
    }
  }

  private async cloudFallback(prompt: string): Promise<string> {
    // Route to OpenAI, Vertex AI, or your enterprise gateway
    throw new Error('Cloud fallback not configured');
  }

  async releaseModel(): Promise<void> {
    if (Platform.OS === 'ios') {
      await NativeModules.CoreMLInference.releaseModel?.();
    }
    // Android model lifecycle is managed by AICore system service
  }

  streamInference(
    prompt: string,
    onToken: (token: string) => void,
    onComplete: () => void
  ): () => void {
    if (Platform.OS === 'android') {
      return streamGeminiInference(prompt, onToken, onComplete, console.error);
    }
    // iOS streaming requires a similar DeviceEventEmitter pattern
    // or chunked polling depending on your Core ML model's output format
    return () => {};
  }
}

export const aiService = new OnDeviceAIServiceImpl();

Layer stack (text diagram):

React Native Product Code
        ↓
OnDeviceAIService.ts (TypeScript, Platform.OS routing)
        ↓
NativeModules bridge (React Native bridge / JSI)
        ↓
CoreMLInference (Swift) | GeminiNanoModule (Kotlin)
        ↓
Core ML MLModel API   | Android AICore / Generative AI API
        ↓
Apple Neural Engine   | Android NPU

Memory lifecycle management is where most teams cut corners. On iOS, subscribe to UIApplicationDidReceiveMemoryWarningNotification in your Swift module and call self.model = nil to release the model from memory. Expose a releaseModel() method to JS so your product code can proactively free memory when the AI feature is backgrounded. On Android, AICore manages the model lifecycle at the system level, so your Kotlin module does not need to handle this directly.

For teams managing multiple native modules across a large codebase, the patterns in Enterprise React Native Monorepo Architecture: A Practical Guide for Multi-Team Mobile Development apply directly here: isolate each native module as a workspace package so iOS and Android bridge code can be versioned and tested independently.

How to Handle Performance, Testing, and Enterprise Deployment

Performance expectations need to be set correctly before you commit to on-device inference in a product spec.

Realistic latency figures:

In our project experience, Core ML inference on Apple Neural Engine for a quantized 2B parameter model runs 50 to 150ms for short prompts (under 100 tokens). At the low end when computeUnits = .all is set, the model is already loaded into memory, and the device is not under thermal throttling. At the high end when the model is paged out or the device is thermally constrained.

A quantized 7B model on the same hardware runs 200 to 500ms. At the low end on an iPhone 15 Pro with a warmed model. At the high end on an iPhone 12 under memory pressure.

Cold-start model loading adds 1 to 3 seconds on first call. At the low end for a 2B model on recent hardware. At the high end for a 4B+ model on a device with slower storage.

Gemini Nano on Pixel 8 returns first tokens in 50 to 200ms for short prompts, with streaming making the latency less perceptible for longer outputs. At the low end when the model is already resident in AICore memory. At the high end when the system needs to page the model back in after memory pressure.

Profiling tools:

Use Xcode Instruments with the Core ML performance report template to see per-layer execution time and confirm Neural Engine utilization. If the report shows CPU-only execution, check your computeUnits configuration and model compatibility. On Android, use Android GPU Inspector and systrace to profile AICore inference sessions.

Testing strategy checklist:

Mock native modules in Jest using jest.mock('react-native', ...) to unit test OnDeviceAIService.ts without native dependencies
Write integration tests for the Swift and Kotlin modules using XCTest and JUnit respectively, independent of React Native
Use Detox for E2E tests that exercise the full inference path on a physical device (most commonly missed): teams skip this because Detox setup is time-consuming, but simulators and emulators have no Neural Engine or AICore access, making simulator-only testing meaningless for this feature
Test the cloud fallback path explicitly by mocking isAvailable() to return false (most commonly missed): teams test the happy path only and discover the fallback is broken in production when a user's device is unsupported
Test memory warning handling by triggering a simulated memory warning in Xcode's Debug menu (most commonly missed): the model release and reload cycle is almost never tested and frequently causes crashes on low-memory devices
Profile cold-start model loading time on the oldest supported device in your target fleet
Test the "model downloading" state for Gemini Nano on a freshly set up Android device

Enterprise deployment: the bundle size problem.

A quantized 4B parameter Core ML model is roughly 2 to 3GB. Shipping that in your app bundle produces an App Store submission that will be rejected or will cause users to abandon the download. The solution is on-demand resources (iOS) or Play Asset Delivery (Android).

On iOS, mark the .mlpackage as an on-demand resource with a tag (e.g., "ai-model-v1"). Request it at runtime:

let request = NSBundleResourceRequest(tags: ["ai-model-v1"])
request.beginAccessingResources { error in
    // Model is now available in the bundle
}

On Android, Play Asset Delivery with the on-demand delivery mode achieves the same result. The model pack is downloaded after install when the user first accesses the AI feature.

Model versioning without app releases is achievable through both mechanisms. Update the asset pack or on-demand resource with a new version tag, and your app can check for and download updated models independently of the app binary. This matters for enterprise deployments where app release cycles are slow due to MDM approval processes.

For teams distributing through MDM (Jamf, Intune, VMware Workspace ONE), test that on-demand resource downloads work correctly through the enterprise network proxy. Some MDM configurations block the CDN endpoints that on-demand resources use, which surfaces as a silent download failure.

Code signing applies to bundled model files. Models included in the app bundle are covered by the app's code signature. Models downloaded post-install via on-demand resources are signed separately by Apple's CDN. Models downloaded from your own server require your own integrity verification: use a SHA-256 hash check before loading any externally downloaded model file.

The patterns for structuring this across a multi-platform codebase are covered in depth in AI-Augmented React Native Development Enterprise 2026, including how to manage model versioning as part of a CI/CD pipeline.

Case study — Fashion e-commerce platform

99%crash-free sessions maintained across every release at 20 million users

“We're most impressed with Wednesday Solutions' flexibility and willingness to orient and train their developers before they join our teams.”

Associate Engineering Director, Fashion e-commerce platformRead the case study →

The unresolved tradeoff at the center of this architecture is model capability versus device coverage. The models that run well on-device in 2025 (2B to 7B parameter class, quantized) are meaningfully less capable than GPT-4 class cloud models for complex reasoning tasks. Expanding device coverage by supporting older hardware means using smaller, less capable models. Targeting only the latest Pixel and iPhone hardware gives you better models but excludes a significant portion of enterprise device fleets, particularly in organizations with 3 to 4 year device refresh cycles. No architectural decision resolves that tradeoff. It requires knowing your actual device fleet distribution before you commit to an on-device-first strategy.

Frequently asked questions

Get an expert code review of your React Native native module bridge architecture for on-device AI integration, covering iOS and Android implementation, memory lifecycle, and fallback strategy.

Request a bridge architecture review →

About the author

Anurag Rathod

LinkedIn →

Technical Lead, Wednesday Solutions

Anurag is a Technical Lead at Wednesday Solutions who specialises in React Native and enterprise AI enablement. He has shipped mobile platforms across logistics, container movement, gambling, esports, and martech, and brings compliance-ready, offline-first architecture to every engagement.

30 minutes with an engineer. You leave with a squad shape, a monthly cost, and a start date.

Get your start date →

4.8 on Clutch

4x faster with AI2x fewer crashes100% money back

Keep reading

May 2026 · 10 min read