Local AI on Android - Do More On-Device with LiteRT & MediaPipe

28 Aug 2025

Running ML on-device isn’t just a cool demo anymore — it’s how you ship private, fast, resilient features without praying to the network gods. Modern Android hardware is plenty capable, and the tooling around LiteRT (formerly TensorFlow Lite) and MediaPipe has matured to the point where you can add image recognition, text classification, and even AR pipelines with surprisingly little code. Here’s the practical, opinionated guide I wish more apps followed.

What “local ML” means (and where it shines)

Local ML = your model runs entirely on the device CPU/GPU/NPUs via LiteRT or MediaPipe Tasks. No server call, no streaming tensors over the wire. Sweet spots:

Image recognition/segmentation (e.g., “Is this a receipt?” “Where’s the product label?”)
Text classification (moderation, topic/routing, sentiment, intent)
AR/real-time perception (object detection, face/pose landmarks feeding AR overlays)

If you’ve got latency-sensitive UX (camera previews, autocomplete, smart replies), on-device is almost always the right default.

Why now? Devices finally caught up

Mid-range phones ship with efficient NPUs/GPUs and more RAM; Android’s NNAPI has solid coverage; LiteRT/MediaPipe add high-level Task APIs that hide the scary parts. You don’t need a research team to get excellent results. In 2025, shipping on-device is the mature, modern path — not the experimental one.

The Big Wins

Privacy: user data never leaves the phone by default.
Speed: <100ms inference is normal for medium models; camera loops feel instant.
Offline: works on trains, planes, basements, and spotty 3G.

Project setup (do this first)

Add dependencies. I recommend starting with the Task libraries (they wrap preprocessing/postprocessing):

dependencies {
  // Core LiteRT runtime
  implementation("org.tensorflow:tensorflow-lite:<latest>")

  // Task libraries (high-level APIs)
  implementation("org.tensorflow:tensorflow-lite-task-vision:<latest>")
  implementation("org.tensorflow:tensorflow-lite-task-text:<latest>")

  // Or MediaPipe Tasks alternatives (great for vision/AR pipelines)
  implementation("com.google.mediapipe:tasks-vision:<latest>")
}

Bundle your model(s) in app/src/main/assets/models/…. Enable memory-mapping and avoid compression so LiteRT can map the file directly:

android {
  aaptOptions { noCompress "LiteRT" } // AGP older syntax
  // or on newer AGP:
  // packaging { resources { noCompress += ["LiteRT"] } }
}

Opinion: Always memory-map models. It slashes startup time and RAM spikes.

Option A — Vision in minutes with LiteRT Task APIs

Goal: classify a camera frame with a MobileNet-style classifier.

import org.tensorflow.lite.task.core.BaseOptions
import org.tensorflow.lite.task.vision.classifier.ImageClassifier
import org.tensorflow.lite.task.vision.classifier.ImageClassifier.ImageClassifierOptions
import org.tensorflow.lite.task.vision.core.TensorImage

class LocalImageClassifier(private val context: Context) {

  private val classifier: ImageClassifier by lazy {
    val base = BaseOptions.builder()
      .setNumThreads(4)
      .useNnapi() // or .useGpu() depending on device mix
      .build()

    val options = ImageClassifierOptions.builder()
      .setBaseOptions(base)
      .setMaxResults(3)
      .setScoreThreshold(0.5f)
      .build()

    ImageClassifier.createFromFileAndOptions(
      context,
      "models/mobilenet_v3.tflite",
      options
    )
  }

  fun classify(bitmap: Bitmap): List<Pair<String, Float>> {
    val image = TensorImage.fromBitmap(bitmap)
    val results = classifier.classify(image)
    // First head, top categories
    return results.firstOrNull()?.categories?.map {
      it.label to it.score
    } ?: emptyList()
  }
}

Tips that pay off:

Reuse the classifier instance (don’t re-create per frame).
On camera: run on a background dispatcher and throttle (e.g., every 2–3 frames).
Device strategy: NNAPI for Pixels/newer flagships, GPU for mid-range; test both.

Option B — Text classification (routing, moderation, intent)

For small-to-medium classifiers, a quantized BERT-like or CNN/RNN works well:

import org.tensorflow.lite.task.text.nlclassifier.BertNLClassifier

class LocalTextClassifier(context: Context) {
  private val model = BertNLClassifier.createFromFile(
    context, "models/bert_sentiment_int8.tflite"
  )

  fun classify(text: String): List<Pair<String, Float>> =
    model.classify(text).map { it.label to it.score }
}

Pro tips:

Normalize input (lowercase, trim, collapse whitespace) outside the hot path.
Keep sequence length small (e.g., 128 tokens). It’s the quickest lever for speed.

Option C — MediaPipe for perception + AR

MediaPipe Tasks give you production-ready detectors/segmenters with robust tracking—perfect for powering AR overlays.

import com.google.mediapipe.tasks.vision.core.RunningMode
import com.google.mediapipe.tasks.vision.objectdetector.ObjectDetector
import com.google.mediapipe.tasks.vision.objectdetector.ObjectDetectorResult

class LocalObjectDetector(context: Context) {

  private val detector: ObjectDetector

  init {
    val base = com.google.mediapipe.tasks.core.BaseOptions.builder()
      .setModelAssetPath("models/efficientdet_lite_int8.tflite")
      .setDelegate(com.google.mediapipe.tasks.core.Delegate.CPU) // or GPU
      .build()

    val options = ObjectDetector.ObjectDetectorOptions.builder()
      .setBaseOptions(base)
      .setRunningMode(RunningMode.IMAGE) // or LIVE_STREAM for camera
      .setMaxResults(5)
      .setScoreThreshold(0.5f)
      .build()

    detector = ObjectDetector.createFromOptions(context, options)
  }

  fun detect(mpImage: com.google.mediapipe.framework.image.BitmapImage): ObjectDetectorResult {
    return detector.detect(mpImage)
  }
}

Pair it with ARCore (anchors, hit tests) for realistic placement/occlusion while MediaPipe provides the perception signals.

Make models small & fast with selective quantization

You can usually cut model size/latency by 2–4x via quantization — without destroying accuracy — if you’re intentional.

Quick glossary:

Dynamic range (weights-only) int8: fast, easiest; good first step.
Full integer (int8 activations + weights): best for CPU/NNAPI; needs calibration data.
Float16: great for GPU; tiny accuracy drop; still uses float math.
Per-channel quantization: better accuracy for convs; turn it on if available.

Post-training quantization (Python, during build)

Dynamic range (no dataset required):

import tensorflow as tf
converter = tf.lite.tfliteConverter.from_saved_model("exported_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
LiteRT_model = converter.convert()
open("model_dr_int8.tflite", "wb").write(LiteRT_model)

Full integer (representative dataset for calibration):

def rep_data():
  for batch in calibration_ds.take(100):
    yield [batch]  # match your model’s input signature

converter = tf.lite.tfliteConverter.from_saved_model("exported_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = rep_data
converter.target_spec.supported_ops = [
  tf.lite.OpsSet.tflite_BUILTINS_INT8,  # prefer int8
  tf.lite.OpsSet.tflite_BUILTINS        # allow float fallback if needed
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
LiteRT_model = converter.convert()
open("model_full_int8.tflite", "wb").write(LiteRT_model)

Selective quantization (the pragmatic way): allow both INT8 and FLOAT ops so layers that quantize poorly (e.g., embeddings, the final softmax) stay float while the rest quantizes. That’s exactly what the supported_ops list above accomplishes—INT8 when possible, transparent float fallback when accuracy demands it. If accuracy still dips, try float16:

converter = tf.lite.tfliteConverter.from_saved_model("exported_model")
converter.target_spec.supported_types = [tf.float16]
LiteRT_model = converter.convert()
open("model_fp16.tflite","wb").write(LiteRT_model)

Opinion: Ship INT8 on CPU/NNAPI, FP16 for GPU paths. Keep a float model around for A/B checks.

If you need low-level control (raw LiteRT Interpreter)

You won’t usually need this with Task APIs, but it’s handy for custom ops:

import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.nnapi.NnApiDelegate
import java.io.FileInputStream
import java.nio.MappedByteBuffer
import java.nio.channels.FileChannel

private fun loadModel(context: Context, path: String): MappedByteBuffer {
  val afd = context.assets.openFd(path)
  FileInputStream(afd.fileDescriptor).channel.use { fc ->
    return fc.map(FileChannel.MapMode.READ_ONLY, afd.startOffset, afd.declaredLength)
  }
}

class RawLiteRT(context: Context) {
  private val delegate = NnApiDelegate()
  private val interpreter = Interpreter(
    loadModel(context, "models/custom_int8.tflite"),
    Interpreter.Options().apply {
      setNumThreads(4)
      addDelegate(delegate)
    }
  )

  fun run(input: ByteBuffer, output: ByteBuffer) {
    interpreter.run(input, output) // Match the model’s tensor shapes/dtypes
  }

  fun close() {
    interpreter.close()
    delegate.close()
  }
}

Gotchas: allocate direct ByteBuffers, reuse them, and never allocate in your 60fps loop.

Performance checklist (these move the needle)

Warm up the model at app start or feature entry (run one dummy inference).
Pin a background thread (e.g., Dispatchers.Default.limitedParallelism(1)) for deterministic latency.
Downscale inputs to the model’s native resolution; avoid runtime resizing to arbitrary sizes.
Batch wisely (often batch=1 is fastest on mobile).
Switch delegates by device (simple heuristic: Pixels/new flagships → NNAPI; others → GPU; fallback CPU).

Shipping & size management

For large models, consider Play Asset Delivery or on-first-run download (signed, checksummed).
Gate features by Device Capability (RAM, NNAPI support) instead of OS version alone.
Store the model version in preferences; expose a hidden debug screen to inspect model/engine/delegate at runtime.

Testing & monitoring

Add a micro-benchmark around inference (log P50/P95).
Compare INT8 vs FP16 vs FLOAT accuracy on a validation slice; don’t guess.
Use androidx.tracing to mark preprocessing → inference → postprocessing spans.

Opinionated TL;DR

Start with LiteRT Task (or MediaPipe Tasks for vision/AR), quantize aggressively but allow float fallback (that’s “selective quantization”), memory-map your .tflite, and pick delegates per device. You’ll get private, fast, offline features that feel magic — and you’ll stop burning cloud budget on requests your users’ phones can handle just fine.

Byte Me