Local AI on Android - Do More On-Device with LiteRT & MediaPipe
28 Aug 2025Running ML on-device isnât just a cool demo anymore â itâs how you ship private, fast, resilient features without praying to the network gods. Modern Android hardware is plenty capable, and the tooling around LiteRT (formerly TensorFlow Lite) and MediaPipe has matured to the point where you can add image recognition, text classification, and even AR pipelines with surprisingly little code. Hereâs the practical, opinionated guide I wish more apps followed.
What âlocal MLâ means (and where it shines)
Local ML = your model runs entirely on the device CPU/GPU/NPUs via LiteRT or MediaPipe Tasks. No server call, no streaming tensors over the wire. Sweet spots:
- Image recognition/segmentation (e.g., âIs this a receipt?â âWhereâs the product label?â)
- Text classification (moderation, topic/routing, sentiment, intent)
- AR/real-time perception (object detection, face/pose landmarks feeding AR overlays)
If youâve got latency-sensitive UX (camera previews, autocomplete, smart replies), on-device is almost always the right default.
Why now? Devices finally caught up
Mid-range phones ship with efficient NPUs/GPUs and more RAM; Androidâs NNAPI has solid coverage; LiteRT/MediaPipe add high-level Task APIs that hide the scary parts. You donât need a research team to get excellent results. In 2025, shipping on-device is the mature, modern path â not the experimental one.
The Big Wins
- Privacy: user data never leaves the phone by default.
- Speed: <100ms inference is normal for medium models; camera loops feel instant.
- Offline: works on trains, planes, basements, and spotty 3G.
Project setup (do this first)
Add dependencies. I recommend starting with the Task libraries (they wrap preprocessing/postprocessing):
dependencies {
// Core LiteRT runtime
implementation("org.tensorflow:tensorflow-lite:<latest>")
// Task libraries (high-level APIs)
implementation("org.tensorflow:tensorflow-lite-task-vision:<latest>")
implementation("org.tensorflow:tensorflow-lite-task-text:<latest>")
// Or MediaPipe Tasks alternatives (great for vision/AR pipelines)
implementation("com.google.mediapipe:tasks-vision:<latest>")
}
Bundle your model(s) in app/src/main/assets/models/âŚ. Enable memory-mapping and avoid compression so LiteRT can map the file directly:
android {
aaptOptions { noCompress "LiteRT" } // AGP older syntax
// or on newer AGP:
// packaging { resources { noCompress += ["LiteRT"] } }
}
Opinion: Always memory-map models. It slashes startup time and RAM spikes.
Option A â Vision in minutes with LiteRT Task APIs
Goal: classify a camera frame with a MobileNet-style classifier.
import org.tensorflow.lite.task.core.BaseOptions
import org.tensorflow.lite.task.vision.classifier.ImageClassifier
import org.tensorflow.lite.task.vision.classifier.ImageClassifier.ImageClassifierOptions
import org.tensorflow.lite.task.vision.core.TensorImage
class LocalImageClassifier(private val context: Context) {
private val classifier: ImageClassifier by lazy {
val base = BaseOptions.builder()
.setNumThreads(4)
.useNnapi() // or .useGpu() depending on device mix
.build()
val options = ImageClassifierOptions.builder()
.setBaseOptions(base)
.setMaxResults(3)
.setScoreThreshold(0.5f)
.build()
ImageClassifier.createFromFileAndOptions(
context,
"models/mobilenet_v3.tflite",
options
)
}
fun classify(bitmap: Bitmap): List<Pair<String, Float>> {
val image = TensorImage.fromBitmap(bitmap)
val results = classifier.classify(image)
// First head, top categories
return results.firstOrNull()?.categories?.map {
it.label to it.score
} ?: emptyList()
}
}
Tips that pay off:
- Reuse the classifier instance (donât re-create per frame).
- On camera: run on a background dispatcher and throttle (e.g., every 2â3 frames).
- Device strategy: NNAPI for Pixels/newer flagships, GPU for mid-range; test both.
Option B â Text classification (routing, moderation, intent)
For small-to-medium classifiers, a quantized BERT-like or CNN/RNN works well:
import org.tensorflow.lite.task.text.nlclassifier.BertNLClassifier
class LocalTextClassifier(context: Context) {
private val model = BertNLClassifier.createFromFile(
context, "models/bert_sentiment_int8.tflite"
)
fun classify(text: String): List<Pair<String, Float>> =
model.classify(text).map { it.label to it.score }
}
Pro tips:
- Normalize input (lowercase, trim, collapse whitespace) outside the hot path.
- Keep sequence length small (e.g., 128 tokens). Itâs the quickest lever for speed.
Option C â MediaPipe for perception + AR
MediaPipe Tasks give you production-ready detectors/segmenters with robust trackingâperfect for powering AR overlays.
import com.google.mediapipe.tasks.vision.core.RunningMode
import com.google.mediapipe.tasks.vision.objectdetector.ObjectDetector
import com.google.mediapipe.tasks.vision.objectdetector.ObjectDetectorResult
class LocalObjectDetector(context: Context) {
private val detector: ObjectDetector
init {
val base = com.google.mediapipe.tasks.core.BaseOptions.builder()
.setModelAssetPath("models/efficientdet_lite_int8.tflite")
.setDelegate(com.google.mediapipe.tasks.core.Delegate.CPU) // or GPU
.build()
val options = ObjectDetector.ObjectDetectorOptions.builder()
.setBaseOptions(base)
.setRunningMode(RunningMode.IMAGE) // or LIVE_STREAM for camera
.setMaxResults(5)
.setScoreThreshold(0.5f)
.build()
detector = ObjectDetector.createFromOptions(context, options)
}
fun detect(mpImage: com.google.mediapipe.framework.image.BitmapImage): ObjectDetectorResult {
return detector.detect(mpImage)
}
}
Pair it with ARCore (anchors, hit tests) for realistic placement/occlusion while MediaPipe provides the perception signals.
Make models small & fast with selective quantization
You can usually cut model size/latency by 2â4x via quantization â without destroying accuracy â if youâre intentional.
Quick glossary:
- Dynamic range (weights-only) int8: fast, easiest; good first step.
- Full integer (int8 activations + weights): best for CPU/NNAPI; needs calibration data.
- Float16: great for GPU; tiny accuracy drop; still uses float math.
- Per-channel quantization: better accuracy for convs; turn it on if available.
Post-training quantization (Python, during build)
Dynamic range (no dataset required):
import tensorflow as tf
converter = tf.lite.tfliteConverter.from_saved_model("exported_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
LiteRT_model = converter.convert()
open("model_dr_int8.tflite", "wb").write(LiteRT_model)
Full integer (representative dataset for calibration):
def rep_data():
for batch in calibration_ds.take(100):
yield [batch] # match your modelâs input signature
converter = tf.lite.tfliteConverter.from_saved_model("exported_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = rep_data
converter.target_spec.supported_ops = [
tf.lite.OpsSet.tflite_BUILTINS_INT8, # prefer int8
tf.lite.OpsSet.tflite_BUILTINS # allow float fallback if needed
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
LiteRT_model = converter.convert()
open("model_full_int8.tflite", "wb").write(LiteRT_model)
Selective quantization (the pragmatic way): allow both INT8 and FLOAT ops so layers that quantize poorly (e.g., embeddings, the final softmax) stay float while the rest quantizes. Thatâs exactly what the supported_ops list above accomplishesâINT8 when possible, transparent float fallback when accuracy demands it. If accuracy still dips, try float16:
converter = tf.lite.tfliteConverter.from_saved_model("exported_model")
converter.target_spec.supported_types = [tf.float16]
LiteRT_model = converter.convert()
open("model_fp16.tflite","wb").write(LiteRT_model)
Opinion: Ship INT8 on CPU/NNAPI, FP16 for GPU paths. Keep a float model around for A/B checks.
If you need low-level control (raw LiteRT Interpreter)
You wonât usually need this with Task APIs, but itâs handy for custom ops:
import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.nnapi.NnApiDelegate
import java.io.FileInputStream
import java.nio.MappedByteBuffer
import java.nio.channels.FileChannel
private fun loadModel(context: Context, path: String): MappedByteBuffer {
val afd = context.assets.openFd(path)
FileInputStream(afd.fileDescriptor).channel.use { fc ->
return fc.map(FileChannel.MapMode.READ_ONLY, afd.startOffset, afd.declaredLength)
}
}
class RawLiteRT(context: Context) {
private val delegate = NnApiDelegate()
private val interpreter = Interpreter(
loadModel(context, "models/custom_int8.tflite"),
Interpreter.Options().apply {
setNumThreads(4)
addDelegate(delegate)
}
)
fun run(input: ByteBuffer, output: ByteBuffer) {
interpreter.run(input, output) // Match the modelâs tensor shapes/dtypes
}
fun close() {
interpreter.close()
delegate.close()
}
}
Gotchas: allocate direct ByteBuffers, reuse them, and never allocate in your 60fps loop.
Performance checklist (these move the needle)
- Warm up the model at app start or feature entry (run one dummy inference).
- Pin a background thread (e.g.,
Dispatchers.Default.limitedParallelism(1)) for deterministic latency. - Downscale inputs to the modelâs native resolution; avoid runtime resizing to arbitrary sizes.
- Batch wisely (often batch=1 is fastest on mobile).
- Switch delegates by device (simple heuristic: Pixels/new flagships â NNAPI; others â GPU; fallback CPU).
Shipping & size management
- For large models, consider Play Asset Delivery or on-first-run download (signed, checksummed).
- Gate features by Device Capability (RAM, NNAPI support) instead of OS version alone.
- Store the model version in preferences; expose a hidden debug screen to inspect model/engine/delegate at runtime.
Testing & monitoring
- Add a micro-benchmark around inference (log P50/P95).
- Compare INT8 vs FP16 vs FLOAT accuracy on a validation slice; donât guess.
- Use
androidx.tracingto mark preprocessing â inference â postprocessing spans.
Opinionated TL;DR
Start with LiteRT Task (or MediaPipe Tasks for vision/AR), quantize aggressively but allow float fallback (thatâs âselective quantizationâ), memory-map your .tflite, and pick delegates per device. Youâll get private, fast, offline features that feel magic â and youâll stop burning cloud budget on requests your usersâ phones can handle just fine.