Voice AI just got a lot more controllable. Hume has shipped configurable turn detection and interruption handling to its EVI (Empathic Voice Interface) API, and if you're building live voice agents, this is the release you've been waiting for. The update exposes explicit developer controls over how EVI decides when a user has finished speaking, when a barge-in is legitimate, and how much silence to tolerate before committing to a response. These aren't cosmetic settings. They directly determine whether your voice agent feels like a conversation or a bad phone tree.
Here's what shipped, why it matters operationally, and what your team should do with it this week.
What Actually Shipped
Hume's updated EVI turn detection documentation exposes three core configuration parameters:
| Parameter | Range | Default |
|---|---|---|
| turn_detection.end_of_turn_silence_ms | 500–3000 ms | Not specified |
| turn_detection.speech_detection_threshold | 0.0–1.0 | 0.5 |
| turn_detection.prefix_padding_ms | — | 300 ms |
Alongside turn detection, the release adds `interruption.min_interruption_ms`, which sets the minimum speech duration required before EVI treats incoming audio as a legitimate barge-in. If a user makes a short noise, a cough, or a filler sound, the system won't misread it as an intentional interruption unless it crosses your defined threshold. This is a speech-to-speech architecture, not a text-to-speech pipeline bolted onto a transcription service. That distinction matters enormously for latency and naturalness, and these new controls sit natively inside that speech layer.
Why This Is Harder Than It Looks
Turn detection is one of the genuinely unsolved UX problems in voice AI. The failure modes are asymmetric and both are painful:
- •Commit too early: You cut the user off mid-sentence. They feel unheard. Drop-off spikes.
- •Wait too long: The agent hangs in silence. The conversation feels broken. Users assume it crashed.
Most voice stacks have historically hidden this complexity behind fixed heuristics. That worked when voice agents were narrow IVR replacements. It breaks down the moment you're handling real-world speech: filled pauses ("um," "uh"), domain jargon, users who speak slowly, or users calling from noisy environments. The `end_of_turn_silence_ms` range of 500 to 3000 milliseconds captures almost the entire realistic distribution of human conversational pauses. A 500ms threshold is aggressive and appropriate for a fast-paced sales qualification call. A 3000ms threshold is appropriate for a medical intake agent where users are reading back information or thinking through answers. No single default fits both domains. Hume is correct to expose this as a configurable surface rather than baking in a one-size-fits-all value. The `speech_detection_threshold` (0.0–1.0, default 0.5) controls voice activity detection sensitivity. Turn this up and you reduce false positives from background noise. Turn it down and you catch softer speakers or phone audio that's been compressed. Again, domain matters: a contact center handling calls from mobile users in cars needs a different setting than an enterprise scheduling bot used on desktop Zoom audio.
The Operational Cost Angle Nobody Is Talking About
Coverage of this release will focus on UX. The more important story for engineering leaders is cost. Better turn boundary detection reduces wasted model invocations. Every false end-of-turn triggers a model inference cycle. Every false barge-in interrupts a generation and initiates a new one. At scale, these errors aren't just UX friction: they're compute spend. If your voice agent handles 100,000 calls per month and 5% of turns are falsely detected, you're paying for 5,000 unnecessary inference calls per month, plus the latency penalty each one introduces. The `min_interruption_ms` control directly addresses the barge-in false positive problem. A short cough or a background speaker shouldn't reset your agent's response. With an explicit minimum duration threshold, you can filter that noise at the edge before it ever reaches the model layer. That's not just better UX. That's a lower cloud bill. Formalize this in your evaluation framework before you tune anything. Instrument your pipeline to capture:
End-of-turn latency (time from user silence to first agent audio byte)
Interruption false positive rate (barge-ins that weren't intentional)
User cut-off rate (how often the agent speaks over the end of a user utterance)
Session abandonment rate correlated with silence threshold settings
You need a before-and-after baseline. Without it, you're tuning by feel, which is how teams ship regressions.
Competitive Context: What Other Stacks Offer
To be direct about the landscape: EVI is not the only serious contender in speech-to-speech voice AI. OpenAI's Realtime API, Deepgram's voice stack, and Retell AI all compete in this space, and engineering leaders are actively benchmarking them. Here's an honest comparison of control surface transparency across the major players as of mid-2026:
| Capability | EVI (Hume) | Retell AI |
|---|---|---|
| Configurable silence threshold | ✅ | ❌ |
| VAD sensitivity tuning | ✅ | ❌ |
| Min interruption duration control | ✅ | ❌ |
| Prefix padding control | ✅ | ❌ |
| Speech-to-speech architecture | ✅ | ❌ |
This comparison reflects publicly documented capabilities. Deepgram's VAD controls are well-documented at the transcription layer. Retell AI abstracts most of these settings. OpenAI's Realtime API exposes silence threshold but not VAD sensitivity or interruption minimums at the time of writing.
The pattern is clear. Hume is betting that enterprise buyers want control surfaces, not black boxes. That's the right bet. The teams building production voice agents in support, scheduling, and sales have already learned that demo quality and production quality are different things. Demo quality is easy to achieve with good defaults. Production quality requires tuning for your specific call distribution, your user population, and your domain vocabulary. Hume's architecture positions EVI as the platform for teams who've moved past the demo stage. The configurable interruption threshold in particular is a signal that Hume is thinking about production edge cases that competitors haven't published solutions for yet.
What Your Team Should Do This Week
This is not a "wait and evaluate" situation if you're already running voice agents in production. The controls are live. The risk of not evaluating them is that you're leaving UX quality and cost efficiency on the table. Here's a concrete action plan:
Establish your baseline metrics now. Instrument your current pipeline for the four metrics listed above. You need numbers before you touch any settings.
Identify your domain-appropriate silence threshold. Run a sample of your call recordings and measure actual pause distributions between user utterances. If your p50 inter-turn silence is 800ms, starting at `end_of_turn_silence_ms: 1000` is a reasonable first test.
Set `min_interruption_ms` conservatively to start. A value of 300–500ms filters most incidental noise without suppressing genuine barge-ins. Tighten from there based on your false positive data.
Run the VAD sensitivity at default (0.5) first. Only adjust `speech_detection_threshold` after you've isolated whether your false detection issues come from noise sensitivity or silence timing. Changing both simultaneously makes root cause analysis impossible.
Run a side-by-side benchmark against your current stack. If you're using a competitor's voice API, pull 1,000 sessions through EVI with your tuned settings and compare the four metrics directly. This is the only way to make a defensible platform decision.
The Evaluation Environment Matters
Don't tune against a clean audio test setup. Use real call recordings or realistic noise injection. A setting that performs well on studio audio will fail on a user calling from a car. Your evaluation environment should match your production environment or the data is meaningless.
The Bigger Signal
This release is about more than turn detection. It's about what kind of voice AI platform Hume is building. The decision to expose `prefix_padding_ms` as a developer control is particularly telling. This is a niche parameter: it controls how much audio is included before detected speech to avoid clipping utterance starts. Most platforms either don't expose it or hard-code it. Hume exposing it signals a philosophy: give developers visibility into the full signal processing chain, not just the model layer.
That philosophy is increasingly important as voice agents move into higher-stakes domains. Medical intake, legal research assistants, financial advisory bots: these use cases require auditability and predictability. A platform that exposes its tuning surface is a platform you can reason about. A platform that hides its heuristics is a platform you're betting on blindly. Hume's EVI is building toward the former. The configurable turn detection and interruption controls shipped this week are evidence that the architecture is designed for teams who need to own their interaction quality, not just rent it. The teams that formalize voice interaction QA now, before their competitors do, will have a compounding advantage: better data, better baselines, and a tighter feedback loop between product changes and measured outcomes. Start that process with this release.
Get started with Hume
Want to start building with Hume? Here's a quickstart:
# send this to add text
{"type": "assistant_input", "text": "<chunk>"}
# send this message when you're done speaking
{"type": "assistant_end"}Ready to get started?
Join companies achieving their goals with our platform.
Read More Blog Posts
Hume EVI Gets GPT-5.2, Claude Opus 4, and ZERO Mode
Hume just shipped a meaningful update to its Emotionally intelligent Voice Interface (EVI): first-class support for five new frontier LLM models, a new `ZERO` p
Hume's TTS Temperature Parameter Changes the Game
Hume just shipped an experimental `temperature` parameter for its text-to-speech endpoints. On the surface, it looks like a minor API addition. Underneath, it's
