How to build an offline voice assistant on NVIDIA Jetson Orin
A builder's guide to running a fully offline voice agent on Jetson Orin — local STT, an SLM, and TTS, with real latency numbers and where the DIY Whisper + llama.cpp + Piper stack breaks down.
If you're putting a voice interface on a robot running NVIDIA Jetson Orin, you've probably already discovered the trap: the fastest way to a demo is to glue three cloud APIs together, and the fastest way to a product is to never do that. The moment your robot leaves good Wi-Fi — a warehouse aisle, a vehicle, a field — a cloud-dependent voice agent stops working exactly when it matters most.
This is a builder-to-builder guide to doing it the right way: a fully offline voice assistant on Jetson Orin, with nothing leaving the device. We'll walk the DIY path — whisper.cpp for speech-to-text, llama.cpp running a small language model, and Piper for text-to-speech — show real numbers, and be honest about where stitching those three pieces together starts to hurt.
The buying trigger here is latency. A voice agent lives or dies on time-to-first-token, and every architectural choice below is in service of that one number.
Why offline, and why Jetson Orin
Three forces push voice on-device for physical AI:
- Latency. A cloud round-trip for STT, then the LLM, then TTS is three serialized network hops. Even on a good connection that's often over a second before the robot starts talking — and humans read a pause that long as "it's broken."
- Privacy and sovereignty. Every utterance can contain faces, names, proprietary process data, or in-vehicle context. For automotive, defense, and healthcare buyers, streaming that to third-party servers is a compliance non-starter.
- Offline reliability. Robots, drones, and vehicles operate where connectivity is flaky or absent. The voice loop cannot depend on a network that isn't there.
The Jetson Orin Nano Super makes this practical at the low end: NVIDIA rates it at up to 67 INT8 TOPS with 8 GB of LPDDR5 and 102 GB/s of memory bandwidth, in a 25 W envelope, for $249. That's enough to run all three stages of a voice pipeline locally — if you budget your memory and latency carefully.
The DIY offline voice stack
Here's the architecture nearly everyone starts with once they commit to on-device:
mic → VAD → whisper.cpp (STT) → llama.cpp (SLM) → Piper (TTS) → speaker
local local local
No network in the path. Three separate engines, each a good piece of open source, wired together by your application code. Let's go stage by stage.
1. Speech-to-text: whisper.cpp with CUDA
whisper.cpp is a C/C++ port of OpenAI's Whisper with a CUDA backend, so it runs on the Orin GPU rather than pegging the CPU. Build it with CUDA enabled and pick a model sized to your latency budget — tiny.en or base.en for real-time, larger models only if you can tolerate the lag.
# On the Jetson, build whisper.cpp with CUDA
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_CUDA=1
cmake --build build -j --config Release
# Grab a small English model
sh ./models/download-ggml-model.sh base.en
# Stream from the mic
./build/bin/whisper-stream -m models/ggml-base.en.bin -t 4 --step 500 --length 5000Developers have shown real-time STT working with whisper.cpp + CUDA on the Orin Nano Super (writeup here). If you need more headroom, NVIDIA's WhisperTRT optimizes Whisper with TensorRT and reports roughly 3x faster inference at about 60% of the memory of the PyTorch baseline for base.en on the Orin Nano. The tradeoff: TensorRT engines are hardware-and-version-specific, so you give up some portability.
The practical lesson from the field: on smaller Jetsons, the tiny/base models stream comfortably, but pushing to small or larger quickly blows your real-time budget. Choose the smallest model that's accurate enough for your vocabulary.
2. Reasoning: a small language model on llama.cpp
For the brain, you want a small language model (SLM), not a 70B. Llama 3.2 3B is a sensible default for the Orin Nano class — compact enough to fit in 8 GB alongside STT and TTS, capable enough for command parsing and short conversational turns. Run it through llama.cpp, quantized.
# Build llama.cpp with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release
# Run a quantized SLM as a local server
./build/bin/llama-server -m llama-3.2-3b-instruct-Q4_K_M.gguf \
--n-gpu-layers 99 --ctx-size 2048 --port 8080A few hard-won tuning notes for Jetson:
Q4_K_Mis the sweet spot for most Jetson deployments — a good accuracy/size tradeoff that keeps the model in memory.- Offload to the GPU (
--n-gpu-layers). On Jetson's unified memory, GPU offload is meaningfully faster than CPU-only. - Run
sudo jetson_clocksand set the power mode to max before benchmarking, or your tokens/sec will look artificially bad.
Expect on the order of single-digit-to-low-double-digit tokens per second for a 3B model on an Orin Nano, scaling up substantially on Orin NX and AGX Orin. That's fine for short voice turns, but it's why streaming matters — you want the first tokens flowing to TTS while the model is still generating.
3. Text-to-speech: Piper
Piper is a fast, local neural TTS from the Rhasspy project. It uses VITS-trained models exported to ONNX, runs entirely offline, supports 35+ languages, and was specifically optimized to run on low-power ARM hardware like the Raspberry Pi — so it's comfortable on a Jetson CPU, leaving the GPU for STT and the SLM.
echo "Navigation complete. Standing by." | \
./piper --model en_US-lessac-medium.onnx --output_file out.wavPiper synthesizes quickly enough to start audio playback shortly after the first sentence is ready, which is the behavior you need for natural turn-taking.
Where the DIY stack starts to hurt
Get all three running and you have a genuinely offline voice assistant. You'll also have a pile of integration problems that no single component owns:
- Latency from the seams. Three processes talking over pipes, sockets, or a localhost HTTP loop means serialization and copies on the critical path. Each hop is small; together they're the difference between "snappy" and "laggy."
- Barge-in is nobody's job. When a human talks over the robot, you need VAD, generation, and playback to coordinate so playback cancels and the mic reopens instantly. That control loop lives between the components — so you have to build and own it.
- Memory contention. STT, a 3B SLM, and TTS all competing for 8 GB means careful, manual budgeting. One model upgrade and you're OOM.
- Three release cadences. whisper.cpp, llama.cpp, and Piper all move independently. Every bump is your integration test to re-run.
- Python in the hot path. Most glue ends up in Python, which doesn't love deterministic real-time audio and doesn't drop cleanly into the C++ control stack robots already run.
None of these are reasons to use the cloud — they're reasons the stitched-together on-device stack isn't the finish line. We wrote about hitting exactly this wall in why we ripped cloud voice out of our robots.
The integrated alternative
The thesis behind EdgeAI is that a robot's voice agent shouldn't be three engines and a glue script — it should be one on-device C++ runtime built around the things that actually matter:
- End-to-end on-device — local STT, an SLM for reasoning, and local TTS, with nothing leaving the device and no network in the critical path.
- Streaming by design — partial transcripts feed the model and tokens feed TTS as they're produced, so the agent starts speaking while it's still thinking.
- Barge-in as a first-class operation — VAD, generation, and playback share one orchestration loop, so a human speaking instantly cancels playback and reopens the mic.
- C++ on the hot path — deterministic and low-overhead, running on a tuned inference engine built to drop into the control stack you already ship.
Same offline, same privacy — without owning the integration burden of three independent projects.
DIY stack vs integrated runtime
| DIY (whisper.cpp + llama.cpp + Piper) | Integrated C++ runtime | |
|---|---|---|
| Offline / private | Yes | Yes |
| Critical path | 3 processes + IPC | One runtime |
| Barge-in / interruption | You build it | Built in |
| Memory budgeting | Manual | Managed |
| Hot path language | Mostly Python glue | C++ |
| Maintenance | 3 release cadences | One stack |
| Time to first prototype | Fast | Fast |
| Time to a shippable product | Slow | Faster |
DIY is the right way to learn the problem and prove offline voice works on your hardware. An integrated runtime is how you ship it without a multi-month integration project.
Measure it on your own hardware
Whichever path you take, benchmark honestly on your target Jetson:
- Measure time-to-first-token (mic-stops to first-audio-out) across 50 utterances.
- Track end-to-end turn latency, barge-in stop latency, and peak memory.
- Run with
jetson_clockson and the correct power mode — and report the distribution, not just the average. Voice quality lives in the slow tail.
FAQ
Can a Jetson Orin Nano run STT, an LLM, and TTS at the same time?
Yes, within limits. The Orin Nano Super's 8 GB is enough for a small Whisper model, a quantized 3B SLM, and Piper TTS if you budget memory carefully — typically GPU for STT and the SLM, CPU for TTS. For larger models or more headroom, step up to Orin NX or AGX Orin.
Which Whisper model should I use for real-time voice on Jetson?
Start with tiny.en or base.en built with CUDA. They stream in real time on the Orin Nano class. Larger models improve accuracy but quickly exceed a real-time latency budget. NVIDIA's WhisperTRT can buy back headroom via TensorRT at the cost of portability.
What's the best small language model for an on-device voice agent?
For the Orin Nano class, a 3B-parameter instruct model such as Llama 3.2 3B at Q4_K_M quantization is a strong default — small enough to coexist with STT and TTS, capable enough for command parsing and short conversational turns. Larger Orin modules can run 7–8B models.
Is an offline voice assistant actually private?
If every stage runs locally and no audio or text leaves the device, then yes — there are no third-party servers in the loop and nothing to intercept in transit. That's the core reason to go on-device for automotive, defense, and healthcare use cases.
Why not just use cloud voice APIs?
Cloud wins the prototype and loses the product for physical AI: three network round-trips of latency, a hard dependency on connectivity, per-word costs that scale with your fleet, and user audio leaving the device. See our writeup on moving off cloud voice.
Ship it
A fully offline voice assistant on Jetson Orin is very achievable today — whisper.cpp, llama.cpp, and Piper prove the offline and privacy axes out of the box. The work that remains is integration: latency in the seams, barge-in, and memory budgeting across three independently-moving projects.
If you'd rather not own that integration project, that's exactly what we're building. Start with EdgeAI free, read the getting-started docs, or reach out — we'd love to compare numbers on your hardware.