Local AI on Android - Do More On-Device with LiteRT & MediaPipe

Running ML on-device isn’t just a cool demo anymore — it’s how you ship private, fast, resilient features without praying to the network gods. Modern Android hardware is plenty capable, and the tooling around LiteRT (formerly TensorFlow Lite) and MediaPipe has matured to the point where you can add image recognition, text classification, and even AR pipelines with surprisingly little code. Here’s the practical, opinionated guide I wish more apps followed.


What “local ML” means (and where it shines)

Local ML = your model runs entirely on the device CPU/GPU/NPUs via LiteRT or MediaPipe Tasks. No server call, no streaming tensors over the wire. Sweet spots:

If you’ve got latency-sensitive UX (camera previews, autocomplete, smart replies), on-device is almost always the right default.


Why now? Devices finally caught up

Mid-range phones ship with efficient NPUs/GPUs and more RAM; Android’s NNAPI has solid coverage; LiteRT/MediaPipe add high-level Task APIs that hide the scary parts. You don’t need a research team to get excellent results. In 2025, shipping on-device is the mature, modern path — not the experimental one.


The Big Wins


Project setup (do this first)

Add dependencies. I recommend starting with the Task libraries (they wrap preprocessing/postprocessing):

dependencies {
  // Core LiteRT runtime
  implementation("org.tensorflow:tensorflow-lite:<latest>")

  // Task libraries (high-level APIs)
  implementation("org.tensorflow:tensorflow-lite-task-vision:<latest>")
  implementation("org.tensorflow:tensorflow-lite-task-text:<latest>")

  // Or MediaPipe Tasks alternatives (great for vision/AR pipelines)
  implementation("com.google.mediapipe:tasks-vision:<latest>")
}

Bundle your model(s) in app/src/main/assets/models/…. Enable memory-mapping and avoid compression so LiteRT can map the file directly:

android {
  aaptOptions { noCompress "LiteRT" } // AGP older syntax
  // or on newer AGP:
  // packaging { resources { noCompress += ["LiteRT"] } }
}

Opinion: Always memory-map models. It slashes startup time and RAM spikes.


Option A — Vision in minutes with LiteRT Task APIs

Goal: classify a camera frame with a MobileNet-style classifier.

import org.tensorflow.lite.task.core.BaseOptions
import org.tensorflow.lite.task.vision.classifier.ImageClassifier
import org.tensorflow.lite.task.vision.classifier.ImageClassifier.ImageClassifierOptions
import org.tensorflow.lite.task.vision.core.TensorImage

class LocalImageClassifier(private val context: Context) {

  private val classifier: ImageClassifier by lazy {
    val base = BaseOptions.builder()
      .setNumThreads(4)
      .useNnapi() // or .useGpu() depending on device mix
      .build()

    val options = ImageClassifierOptions.builder()
      .setBaseOptions(base)
      .setMaxResults(3)
      .setScoreThreshold(0.5f)
      .build()

    ImageClassifier.createFromFileAndOptions(
      context,
      "models/mobilenet_v3.tflite",
      options
    )
  }

  fun classify(bitmap: Bitmap): List<Pair<String, Float>> {
    val image = TensorImage.fromBitmap(bitmap)
    val results = classifier.classify(image)
    // First head, top categories
    return results.firstOrNull()?.categories?.map {
      it.label to it.score
    } ?: emptyList()
  }
}

Tips that pay off:


Option B — Text classification (routing, moderation, intent)

For small-to-medium classifiers, a quantized BERT-like or CNN/RNN works well:

import org.tensorflow.lite.task.text.nlclassifier.BertNLClassifier

class LocalTextClassifier(context: Context) {
  private val model = BertNLClassifier.createFromFile(
    context, "models/bert_sentiment_int8.tflite"
  )

  fun classify(text: String): List<Pair<String, Float>> =
    model.classify(text).map { it.label to it.score }
}

Pro tips:


Option C — MediaPipe for perception + AR

MediaPipe Tasks give you production-ready detectors/segmenters with robust tracking—perfect for powering AR overlays.

import com.google.mediapipe.tasks.vision.core.RunningMode
import com.google.mediapipe.tasks.vision.objectdetector.ObjectDetector
import com.google.mediapipe.tasks.vision.objectdetector.ObjectDetectorResult

class LocalObjectDetector(context: Context) {

  private val detector: ObjectDetector

  init {
    val base = com.google.mediapipe.tasks.core.BaseOptions.builder()
      .setModelAssetPath("models/efficientdet_lite_int8.tflite")
      .setDelegate(com.google.mediapipe.tasks.core.Delegate.CPU) // or GPU
      .build()

    val options = ObjectDetector.ObjectDetectorOptions.builder()
      .setBaseOptions(base)
      .setRunningMode(RunningMode.IMAGE) // or LIVE_STREAM for camera
      .setMaxResults(5)
      .setScoreThreshold(0.5f)
      .build()

    detector = ObjectDetector.createFromOptions(context, options)
  }

  fun detect(mpImage: com.google.mediapipe.framework.image.BitmapImage): ObjectDetectorResult {
    return detector.detect(mpImage)
  }
}

Pair it with ARCore (anchors, hit tests) for realistic placement/occlusion while MediaPipe provides the perception signals.


Make models small & fast with selective quantization

You can usually cut model size/latency by 2–4x via quantization — without destroying accuracy — if you’re intentional.

Quick glossary:

Post-training quantization (Python, during build)

Dynamic range (no dataset required):

import tensorflow as tf
converter = tf.lite.tfliteConverter.from_saved_model("exported_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
LiteRT_model = converter.convert()
open("model_dr_int8.tflite", "wb").write(LiteRT_model)

Full integer (representative dataset for calibration):

def rep_data():
  for batch in calibration_ds.take(100):
    yield [batch]  # match your model’s input signature

converter = tf.lite.tfliteConverter.from_saved_model("exported_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = rep_data
converter.target_spec.supported_ops = [
  tf.lite.OpsSet.tflite_BUILTINS_INT8,  # prefer int8
  tf.lite.OpsSet.tflite_BUILTINS        # allow float fallback if needed
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
LiteRT_model = converter.convert()
open("model_full_int8.tflite", "wb").write(LiteRT_model)

Selective quantization (the pragmatic way): allow both INT8 and FLOAT ops so layers that quantize poorly (e.g., embeddings, the final softmax) stay float while the rest quantizes. That’s exactly what the supported_ops list above accomplishes—INT8 when possible, transparent float fallback when accuracy demands it. If accuracy still dips, try float16:

converter = tf.lite.tfliteConverter.from_saved_model("exported_model")
converter.target_spec.supported_types = [tf.float16]
LiteRT_model = converter.convert()
open("model_fp16.tflite","wb").write(LiteRT_model)

Opinion: Ship INT8 on CPU/NNAPI, FP16 for GPU paths. Keep a float model around for A/B checks.


If you need low-level control (raw LiteRT Interpreter)

You won’t usually need this with Task APIs, but it’s handy for custom ops:

import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.nnapi.NnApiDelegate
import java.io.FileInputStream
import java.nio.MappedByteBuffer
import java.nio.channels.FileChannel

private fun loadModel(context: Context, path: String): MappedByteBuffer {
  val afd = context.assets.openFd(path)
  FileInputStream(afd.fileDescriptor).channel.use { fc ->
    return fc.map(FileChannel.MapMode.READ_ONLY, afd.startOffset, afd.declaredLength)
  }
}

class RawLiteRT(context: Context) {
  private val delegate = NnApiDelegate()
  private val interpreter = Interpreter(
    loadModel(context, "models/custom_int8.tflite"),
    Interpreter.Options().apply {
      setNumThreads(4)
      addDelegate(delegate)
    }
  )

  fun run(input: ByteBuffer, output: ByteBuffer) {
    interpreter.run(input, output) // Match the model’s tensor shapes/dtypes
  }

  fun close() {
    interpreter.close()
    delegate.close()
  }
}

Gotchas: allocate direct ByteBuffers, reuse them, and never allocate in your 60fps loop.


Performance checklist (these move the needle)


Shipping & size management


Testing & monitoring


Opinionated TL;DR

Start with LiteRT Task (or MediaPipe Tasks for vision/AR), quantize aggressively but allow float fallback (that’s “selective quantization”), memory-map your .tflite, and pick delegates per device. You’ll get private, fast, offline features that feel magic — and you’ll stop burning cloud budget on requests your users’ phones can handle just fine.