Realtime vs. Pipeline Voice Agent: Architecture Guide 2026

Resumir contenido con:

Realtime vs. Pipeline: Which Voice Agent Architecture Should You Pick?

If you're building an AI voice agent today, you face an architecture decision early on that will shape your project for months: realtime model (speech-to-speech, S2S) or STT–LLM–TTS pipeline (cascade)? The short answer: both ship in production, but they're optimized for different problems. Realtime wins on conversational naturalness and latency; pipelines win on control, telephony, compliance, and cost.

Famulor is built on a hybrid pipeline approach that gives you all the cascade advantages — swappable STT/LLM/TTS, granular tool calls, SIP-ready telephony, EU-hosted compliance — and optionally lets you plug in a realtime model for emotion-sensitive use cases. In this article we walk you through when each architecture makes sense, the trade-offs, and how to ship both productively with Famulor.

The Two Architectures at a Glance

Both architectures share one shape: audio in, audio out, "brain" in the middle. They differ in how many models you need for that middle step — and therefore how much you, the operator, can control.

Realtime / Speech-to-Speech (S2S)

A single multimodal model handles the entire conversational turn: it ingests raw audio, reasons about it, and streams audio back — all in a single model call. Because audio never converts to text, the model can pick up on tone, pacing, hesitation, and emotional coloring that would disappear in transcription.

Common players: OpenAI Realtime API (gpt-4o realtime), Google Gemini Live, ElevenLabs Conversational, native S2S from Cartesia.

STT–LLM–TTS Pipeline (Cascade)

Three specialist models in sequence:

STT (speech-to-text / ASR): transcribes caller audio. Examples: Deepgram Nova, Cartesia Ink, Gladia, AssemblyAI.
LLM: processes the transcript, reasons, calls tools, generates a text response. Examples: GPT-4o, Claude, Gemini Pro, Llama 3.
TTS: converts the response into natural-sounding audio. Examples: Cartesia Sonic, ElevenLabs v3, MiniMax, Gemini Flash TTS.

Famulor is, at its core, a modern, heavily optimized pipeline with streaming overlap — exposed through our No-Code AI Voice Agent for any language and use case depth.

Latency: Where Realtime Wins Structurally — and Pipelines Catch Up

Latency is the most-discussed trade-off, and rightly so. Realtime has a structural edge: no audio-to-text-and-back serialization, no handoff between separate models. Audio flows through once.

Pipelines have it harder. Even if every component is fast, latency compounds: STT + LLM time-to-first-token + TTS time-to-first-audio + network. An unoptimized pipeline is almost always slower than a good realtime model.

But: modern pipelines don't wait for each stage to finish. They stream partial STT transcripts to the LLM while the user is still speaking, and pipe LLM tokens into TTS as they arrive. That streaming overlap is why competitive pipeline latencies under 700 ms are absolutely doable. Famulor uses this streaming model by default — see the Core Concepts in the docs.

Tool Calling: Where Pipelines Score Their Biggest Practical Win

Most production voice agents need to do more than talk. They look up accounts, check orders, book appointments, trigger workflows. How each architecture handles function calling has direct UX impact.

In a pipeline, tool calling lives at the LLM layer using standard text-based function calling — the mature mechanism familiar from chat apps. You get:

Structured error handling (retry, fallback, clear error messages)
Parallel tool calls (hit several APIs simultaneously)
Full control over what happens during a tool execution (filler audio, "let me check that for you…")

Realtime models also support tool calling, but the experience varies wildly by provider. Some models block silently while waiting for a tool result. Newer versions support non-blocking calls, but reliability is measurably lower than pipeline LLMs because the model is simultaneously hearing, reasoning, and speaking.

In Famulor you implement tool calls via Tools & Functions or Custom Mid-Call Tools — wired through the internal automation platform (300+ integrations, similar to Zapier or Make).

Turn Detection: When Did the Caller Stop Talking?

One of the trickiest problems in voice AI: when is the caller done — versus when are they just thinking? Get it wrong and you either interrupt or leave dead air.

Realtime models rely on built-in end-of-turn detection. It often works well, but you're stuck with what the provider exposes. Tuning is limited.

Pipelines give you more freedom:

Pick your own turn-detection model
Combine VAD (Voice Activity Detection) with semantic turn detection
Tune adaptive interruption handling
Set sensitivity per use case (helpdesk vs. healthcare vs. outbound)

In Famulor, turn detection is preconfigured but tunable per voice agent in the General Settings — useful when older patients on the phone need longer pause tolerance than B2B sales callbacks.

Voice Quality and Conversational Feel: Realtime's Sweet Spot

This is where realtime has its most interesting advantage — and one that's hard to quantify. When audio is transcribed, a lot of information disappears: no tone, no emotional coloring, no hesitation. The LLM only sees words.

A realtime model hears all of that. It can react to tone — if someone sounds frustrated, it can respond with more empathy without the text instructing it to.

That said, modern TTS engines are catching up fast. Cartesia Sonic, ElevenLabs v3, and Gemini 3.1 Flash TTS produce speech with breathing, laughter, and emotional inflection. Pipelines can sound great — they just work with less information about how the user spoke. Voice picking guidance is in Voice Selection.

Control, Modularity, and Debugging: Where Pipelines Shine

This is the clearest pipeline advantage — and the biggest practical limitation of realtime models in production.

A pipeline is transparent by design. Text sits between every stage. You can log exactly what was transcribed, what the LLM produced, and what was synthesized. When something goes sideways — a misheard word, an off-target response — you can pinpoint the issue. Famulor logs every call including transcript and LLM output for exactly this reason.

Pipelines are also easy to reconfigure:

Swap STT providers without touching your LLM prompt
Switch TTS voice without changing anything else
Move from GPT-4o to Claude because a use case fits Anthropic better

Realtime models are opaque: audio in, audio out. You can't swap components, you're locked into one provider's ecosystem.

Cost: Realtime Is Harder to Control

Realtime APIs typically bill per second of audio in and audio out. That makes cost scale directly with conversation length — and as system prompts and history grow, costs become hard to predict.

Pipelines let you optimize per layer:

Lightweight LLM for simple queries, premium LLM for complex ones (routing!)
Cost-efficient STT for high-volume transcription
Premium TTS only where voice quality matters (brand voice use cases)

Famulor uses transparent per-minute pricing. The pricing page has the current model — and the Twilio calculator lets you model telephony cost precisely upfront.

Cost Dimension	Realtime / S2S	Pipeline (Famulor)
Billing	Per second of audio in + out	Per minute, summed across components
Optimization	Hard — everything in one model	Per layer, individually tunable
Routing cheap vs. premium models	Not possible	Standard pattern in Famulor
Predictability	Low (token usage opaque)	High (clean per-minute pricing)
Telephony surcharges	Variable	Modeled exactly via Twilio or Telnyx calculator

Calculadora ROI

Calcula tu ROI automatizando llamadas

Descubre cuánto podrías ahorrar al usar voice agents con IA.

Número de agentes humanos40

5200

Horas por día6

412

Salario por hora (€)€22

1260

Resultado ROI

ROI 228%

Minutos necesarios288,000

Plan recomendadoscale

Costo total agentes humanos

105.600 €/mes

Costo agentes IA

32.239 €/mes

Ahorro estimado

73.361 €/mes

Sin tarjeta de crédito

Telephony: The Underrated Factor

This one catches teams off guard. Traditional phone networks carry audio at 8 kHz using codecs like G.711 (PCMU/PCMA). Realtime models, however, are trained on 16–48 kHz web audio (WebRTC, Opus). The mismatch means: worse recognition, muffled TTS quality, more misunderstandings.

For phone-based deployments — AI call centers, IVR replacement, outbound dialing — pipelines with telephony-optimized STT are the more reliable choice. Famulor supports SIP trunking natively, so your existing VoIP/PBX provider plugs in (Telnyx, Twilio, Sipgate, your own Asterisk).

If your voice agent only runs inside a web widget (no phone), the telephony argument is moot — latency and voice quality become the deciders.

Compliance: Where Pipelines Become Non-Negotiable

For regulated industries — healthcare, financial services, legal, government — compliance is not an add-on, it's a hard requirement. Pipelines offer:

Component selection per region (e.g. EU-hosted STT/LLM for GDPR)
PII redaction at the text layer before data hits the LLM or logs
Components with HIPAA, GDPR, SOC 2, ISO 27001 certifications available
Audit logs at every step: said, transcribed, generated

Realtime models, by contrast, are mostly hosted by a handful of US hyperscalers in centralized data centers. Audio in, audio out — content filtering, PII redaction, and detailed audit logging are far harder to enforce.

Famulor offers EU hosting, GDPR-compliant setups, and transparent data flow — see Understanding Billing for the operational side and Industries for regulated-industry examples.

Hybrid: The Best of Both Worlds

You don't have to commit to one architecture. Two hybrid patterns we see regularly with Famulor customers:

Realtime + separate STT

You need a reliable, timestamped transcript (for compliance, QA, or downstream analytics)? Let the realtime model handle audio reasoning — and run a dedicated STT in parallel for transcription. The two streams stay independent.

Realtime + separate TTS ("Half-Cascade")

Use the realtime model for audio input (preserving its ability to hear tone, hesitation, emotion) but output text instead of audio. The text routes through a dedicated TTS engine of your choice — full control over voice, including voice cloning, brand voices, and scripted speech.

In Famulor you can configure both hybrid modes per voice agent — the Flow Builder handles this without code.

Comparison Matrix

Criterion	STT–LLM–TTS Pipeline	Realtime / S2S
Getting started	More components to orchestrate	Simpler initial integration
Latency	Conversation-grade with tuning	Structurally faster
Turn detection	Full context-aware support	Built-in, limited customization
Voice naturalness	Excellent with modern TTS	Prosodic awareness
Modularity / Debugging	Fully modular and inspectable	Opaque; LLM/voice choice limited
Tool calling	Mature text-based function calls	Supported, varies by provider
Customization	Highly configurable	Limited to what the model exposes
Cost	Optimize each layer	Hard to optimize
Compliance / EU hosting	Full data flow control	Centralized; data residency varies
8 kHz telephony	Optimized via STT choice	Mismatch with 16–48 kHz training

Which Architecture for Which Use Case?

Use Case	Recommendation	Why
Inbound healthcare hotline	Pipeline (Famulor)	GDPR/HIPAA, EU hosting, PII redaction, audit logs
Outbound sales calls	Pipeline (Famulor)	SIP trunking, telephony-optimized STT, CRM tool calls
Empathy-heavy consumer bot	Hybrid (Realtime + TTS)	Audio-in for tone detection, controlled brand voice
E-commerce web widget	Pipeline or Realtime	WebRTC = no 8 kHz issue; compliance need decides
Law firm intake	Pipeline (Famulor)	Audit logs, regional hosting, conflict-check tool calls
Quick MVP / Prototype	Realtime	Fastest start with fewest components — migrate later

Implementing With Famulor — Step by Step

Define the use case: inbound vs. outbound, industry, primary language, compliance level. Industry-specific examples on the Industries page.
Create the voice agent: in the No-Code editor, start from a template. Initial message, system prompt, tools — all without code.
Pick STT/LLM/TTS stack: for English telephony we typically recommend Deepgram Nova + GPT-4o + Cartesia Sonic. For premium brand voices: Deepgram + Claude + ElevenLabs v3. Famulor swaps stacks with one click.
Wire up tools/integrations: CRM, calendar, helpdesk via Famulor integrations or webhook.
Phone number via SIP: bring your own VoIP number or provision one through Famulor. Model Twilio/Telnyx cost upfront with the Twilio calculator.
Test: in the browser tester first, then on real telephony. Famulor logs every call including transcript for iteration.
Go live: activate inbound webhooks, set up outbound campaigns, push reporting to your CRM.

Best Practices & Common Mistakes

Don't: decide architecture from a browser demo. Demos run on WebRTC with perfect audio — phone is a different planet.
Do: define latency budgets per use case. Healthcare tolerates ≈800 ms; outbound sales must stay under 600 ms.
Don't: let your system prompt grow unbounded. Every turn re-sends it — cost and latency both climb.
Do: use knowledge bases instead of stuffing the prompt. Standard pattern in Famulor.
Don't: test realtime if your use case is primarily phone. Validate a pipeline setup with telephony-optimized STT first.
Do: add filler audio when tool calls take more than 1 second. See Filler Audio in the docs.

Industry Examples From Real Famulor Deployments

Dr. Becker dental practice (60 staff): inbound appointment hotline. Pipeline with GPT-4o + Cartesia Sonic, Cal.com integration, EU hosting. Result: hang-up rate dropped from 22% to 6%.
Berlin real estate brokerage: outbound qualification of listing inquiries. Pipeline with Claude + Cartesia Ink + ElevenLabs, GoHighLevel sync. Result: 3× more qualified first conversations per day.
Shopify shop (DACH, 8,000 orders/month): web widget plus phone. Pipeline with GPT-4o-mini for standard inquiries, GPT-4o for escalations. Result: 45% of tickets resolved without a human agent.
Law firm intake: inbound with EU hosting, PII redaction before LLM call, audit log per turn. Pipeline with German STT, Claude, MiniMax TTS.

Conclusion: For Most Production Voice Agents, a Modern Pipeline Wins

Realtime models are technically impressive and have real advantages in WebRTC consumer use cases. But for the overwhelming majority of production voice agent deployments — telephony, regulated industries, controlled cost, multilingual setups, granular tool calls — a modern streaming pipeline is the clear winner. You keep control, modularity, compliance, and cost predictability.

With Famulor you get exactly that pipeline architecture as a no-code platform: 40+ languages, SIP trunking, 300+ integrations, swappable STT/LLM/TTS components, EU hosting, transparent per-minute pricing — and optional hybrid realtime modes for emotion-sensitive use cases. Check the pricing or jump straight into a voice agent for your setup.

🎯 Demo en vivo

Pruebe nuestro Asistente de IA

Experimente lo natural que suena nuestro asistente telefónico de IA.

Ingrese sus datos y reciba una llamada de nuestro agente de IA en segundos.

El agente está entrenado para hablar sobre los servicios de Famulor y programar citas.

✓ Disponibilidad 24/7•✓ Conversaciones naturales•✓ Cumple con GDPR

Demo AI agent

Famulor representative

🇪🇸Español

FAQ — Common Questions on Realtime vs. Pipeline Voice Agents

What is the main difference between realtime and pipeline voice agents?

Realtime (speech-to-speech) models process audio in a single multimodal model. Pipeline (cascade) architectures use three specialist models: STT for transcription, LLM for reasoning, TTS for output. Pipelines are more controllable; realtime is structurally lower-latency.

Which architecture does Famulor use?

Famulor uses a modern streaming pipeline with overlap between STT, LLM, and TTS — and optionally supports hybrid realtime setups for emotion-sensitive consumer use cases.

When should I pick realtime over pipeline?

If your voice agent runs only in a web widget (no phone), you don't have strict GDPR or HIPAA requirements, and emotional tone detection matters more than tool-calling reliability. For telephony and regulated industries, pipeline wins.

Does realtime work well with traditional telephony?

Limited. Telephony carries 8 kHz audio; realtime models are trained on 16–48 kHz web audio. The mismatch measurably degrades recognition and TTS quality. For phone, pipelines with telephony-optimized STT (e.g. via Famulor) are more reliable.

How fast can a pipeline voice agent be?

With streaming overlap, modern pipelines hit conversational latencies under 700 ms. Famulor optimizes STT streaming, LLM token streaming, and TTS streaming so first reaction lands around 500–800 ms — depending on use case and stack.

Can I switch STT, LLM, and TTS providers in Famulor?

Yes. Famulor supports multiple STT providers (Cartesia, Deepgram, Gladia), multiple LLMs (GPT-4o, Claude, Gemini), and multiple TTS engines (Cartesia Sonic, ElevenLabs, MiniMax, Gemini TTS). Switch with one click in the voice selection setting.

What does a pipeline voice agent cost compared to realtime?

Pipelines allow per-layer optimization: cheaper LLMs for standard queries, premium LLMs only for escalations. Realtime APIs are billed per audio second and are hard to optimize. Famulor uses transparent per-minute pricing — see pricing.

Is GDPR or HIPAA compliance possible with realtime?

Hard. Realtime models are mostly hosted by US hyperscalers without EU data residency. Pipelines allow EU components, PII redaction before the LLM call, and audit logs at every stage. Famulor offers EU hosting by default.

How does tool calling work in Famulor?

Famulor uses text-based function calling at the LLM layer — the mature mechanism familiar from chat apps. Define tools in the Tools & Functions editor, optionally as Custom Mid-Call Tools via webhook or directly inside the Famulor automation platform.

Can I migrate from pipeline to realtime later?

Yes. In Famulor the architecture is a per-voice-agent configuration. Most teams start with pipeline for stability and compliance — and add hybrid realtime modes for selected use cases later, without rebuilding tools or integrations.

Volver al Blog