Summarize Content With:
Realtime vs. Pipeline: Which Voice Agent Architecture Should You Pick?
If you're building an AI voice agent today, you face an architecture decision early on that will shape your project for months: realtime model (speech-to-speech, S2S) or STT–LLM–TTS pipeline (cascade)? The short answer: both ship in production, but they're optimized for different problems. Realtime wins on conversational naturalness and latency; pipelines win on control, telephony, compliance, and cost.
Famulor is built on a hybrid pipeline approach that gives you all the cascade advantages — swappable STT/LLM/TTS, granular tool calls, SIP-ready telephony, EU-hosted compliance — and optionally lets you plug in a realtime model for emotion-sensitive use cases. In this article we walk you through when each architecture makes sense, the trade-offs, and how to ship both productively with Famulor.
The Two Architectures at a Glance
Both architectures share one shape: audio in, audio out, "brain" in the middle. They differ in how many models you need for that middle step — and therefore how much you, the operator, can control.
Realtime / Speech-to-Speech (S2S)
A single multimodal model handles the entire conversational turn: it ingests raw audio, reasons about it, and streams audio back — all in a single model call. Because audio never converts to text, the model can pick up on tone, pacing, hesitation, and emotional coloring that would disappear in transcription.
Common players: OpenAI Realtime API (gpt-4o realtime), Google Gemini Live, ElevenLabs Conversational, native S2S from Cartesia.
STT–LLM–TTS Pipeline (Cascade)
Three specialist models in sequence:
- STT (speech-to-text / ASR): transcribes caller audio. Examples: Deepgram Nova, Cartesia Ink, Gladia, AssemblyAI.
- LLM: processes the transcript, reasons, calls tools, generates a text response. Examples: GPT-4o, Claude, Gemini Pro, Llama 3.
- TTS: converts the response into natural-sounding audio. Examples: Cartesia Sonic, ElevenLabs v3, MiniMax, Gemini Flash TTS.
Famulor is, at its core, a modern, heavily optimized pipeline with streaming overlap — exposed through our No-Code AI Voice Agent for any language and use case depth.
Latency: Where Realtime Wins Structurally — and Pipelines Catch Up
Latency is the most-discussed trade-off, and rightly so. Realtime has a structural edge: no audio-to-text-and-back serialization, no handoff between separate models. Audio flows through once.
Pipelines have it harder. Even if every component is fast, latency compounds: STT + LLM time-to-first-token + TTS time-to-first-audio + network. An unoptimized pipeline is almost always slower than a good realtime model.
But: modern pipelines don't wait for each stage to finish. They stream partial STT transcripts to the LLM while the user is still speaking, and pipe LLM tokens into TTS as they arrive. That streaming overlap is why competitive pipeline latencies under 700 ms are absolutely doable. Famulor uses this streaming model by default — see the Core Concepts in the docs.
Tool Calling: Where Pipelines Score Their Biggest Practical Win
Most production voice agents need to do more than talk. They look up accounts, check orders, book appointments, trigger workflows. How each architecture handles function calling has direct UX impact.
In a pipeline, tool calling lives at the LLM layer using standard text-based function calling — the mature mechanism familiar from chat apps. You get:
- Structured error handling (retry, fallback, clear error messages)
- Parallel tool calls (hit several APIs simultaneously)
- Full control over what happens during a tool execution (filler audio, "let me check that for you…")
Realtime models also support tool calling, but the experience varies wildly by provider. Some models block silently while waiting for a tool result. Newer versions support non-blocking calls, but reliability is measurably lower than pipeline LLMs because the model is simultaneously hearing, reasoning, and speaking.
In Famulor you implement tool calls via Tools & Functions or Custom Mid-Call Tools — wired through the internal automation platform (300+ integrations, similar to Zapier or Make).
Turn Detection: When Did the Caller Stop Talking?
One of the trickiest problems in voice AI: when is the caller done — versus when are they just thinking? Get it wrong and you either interrupt or leave dead air.
Realtime models rely on built-in end-of-turn detection. It often works well, but you're stuck with what the provider exposes. Tuning is limited.
Pipelines give you more freedom:
- Pick your own turn-detection model
- Combine VAD (Voice Activity Detection) with semantic turn detection
- Tune adaptive interruption handling
- Set sensitivity per use case (helpdesk vs. healthcare vs. outbound)
In Famulor, turn detection is preconfigured but tunable per voice agent in the General Settings — useful when older patients on the phone need longer pause tolerance than B2B sales callbacks.
Voice Quality and Conversational Feel: Realtime's Sweet Spot
This is where realtime has its most interesting advantage — and one that's hard to quantify. When audio is transcribed, a lot of information disappears: no tone, no emotional coloring, no hesitation. The LLM only sees words.
A realtime model hears all of that. It can react to tone — if someone sounds frustrated, it can respond with more empathy without the text instructing it to.
That said, modern TTS engines are catching up fast. Cartesia Sonic, ElevenLabs v3, and Gemini 3.1 Flash TTS produce speech with breathing, laughter, and emotional inflection. Pipelines can sound great — they just work with less information about how the user spoke. Voice picking guidance is in Voice Selection.
Control, Modularity, and Debugging: Where Pipelines Shine
This is the clearest pipeline advantage — and the biggest practical limitation of realtime models in production.
A pipeline is transparent by design. Text sits between every stage. You can log exactly what was transcribed, what the LLM produced, and what was synthesized. When something goes sideways — a misheard word, an off-target response — you can pinpoint the issue. Famulor logs every call including transcript and LLM output for exactly this reason.
Pipelines are also easy to reconfigure:
- Swap STT providers without touching your LLM prompt
- Switch TTS voice without changing anything else
- Move from GPT-4o to Claude because a use case fits Anthropic better
Realtime models are opaque: audio in, audio out. You can't swap components, you're locked into one provider's ecosystem.
Cost: Realtime Is Harder to Control
Realtime APIs typically bill per second of audio in and audio out. That makes cost scale directly with conversation length — and as system prompts and history grow, costs become hard to predict.
Pipelines let you optimize per layer:
- Lightweight LLM for simple queries, premium LLM for complex ones (routing!)
- Cost-efficient STT for high-volume transcription
- Premium TTS only where voice quality matters (brand voice use cases)
Famulor uses transparent per-minute pricing. The pricing page has the current model — and the Twilio calculator lets you model telephony cost precisely upfront.
| Cost Dimension | Realtime / S2S | Pipeline (Famulor) |
|---|---|---|
| Billing | Per second of audio in + out | Per minute, summed across components |
| Optimization | Hard — everything in one model | Per layer, individually tunable |
| Routing cheap vs. premium models | Not possible | Standard pattern in Famulor |
| Predictability | Low (token usage opaque) | High (clean per-minute pricing) |
| Telephony surcharges | Variable | Modeled exactly via Twilio or Telnyx calculator |
Estimate your ROI from automating calls
See how much your business could save by switching to AI-powered voice agents.
ROI Result
ROI 228%
No credit card required
Telephony: The Underrated Factor
This one catches teams off guard. Traditional phone networks carry audio at 8 kHz using codecs like G.711 (PCMU/PCMA). Realtime models, however, are trained on 16–48 kHz web audio (WebRTC, Opus). The mismatch means: worse recognition, muffled TTS quality, more misunderstandings.
For phone-based deployments — AI call centers, IVR replacement, outbound dialing — pipelines with telephony-optimized STT are the more reliable choice. Famulor supports SIP trunking natively, so your existing VoIP/PBX provider plugs in (Telnyx, Twilio, Sipgate, your own Asterisk).
If your voice agent only runs inside a web widget (no phone), the telephony argument is moot — latency and voice quality become the deciders.
Compliance: Where Pipelines Become Non-Negotiable
For regulated industries — healthcare, financial services, legal, government — compliance is not an add-on, it's a hard requirement. Pipelines offer:
- Component selection per region (e.g. EU-hosted STT/LLM for GDPR)
- PII redaction at the text layer before data hits the LLM or logs
- Components with HIPAA, GDPR, SOC 2, ISO 27001 certifications available
- Audit logs at every step: said, transcribed, generated
Realtime models, by contrast, are mostly hosted by a handful of US hyperscalers in centralized data centers. Audio in, audio out — content filtering, PII redaction, and detailed audit logging are far harder to enforce.
Famulor offers EU hosting, GDPR-compliant setups, and transparent data flow — see Understanding Billing for the operational side and Industries for regulated-industry examples.
Hybrid: The Best of Both Worlds
You don't have to commit to one architecture. Two hybrid patterns we see regularly with Famulor customers:
Realtime + separate STT
You need a reliable, timestamped transcript (for compliance, QA, or downstream analytics)? Let the realtime model handle audio reasoning — and run a dedicated STT in parallel for transcription. The two streams stay independent.
Realtime + separate TTS ("Half-Cascade")
Use the realtime model for audio input (preserving its ability to hear tone, hesitation, emotion) but output text instead of audio. The text routes through a dedicated TTS engine of your choice — full control over voice, including voice cloning, brand voices, and scripted speech.
In Famulor you can configure both hybrid modes per voice agent — the Flow Builder handles this without code.
Comparison Matrix
| Criterion | STT–LLM–TTS Pipeline | Realtime / S2S |
|---|---|---|
| Getting started | More components to orchestrate | Simpler initial integration |
| Latency | Conversation-grade with tuning | Structurally faster |
| Turn detection | Full context-aware support | Built-in, limited customization |
| Voice naturalness | Excellent with modern TTS | Prosodic awareness |
| Modularity / Debugging | Fully modular and inspectable | Opaque; LLM/voice choice limited |
| Tool calling | Mature text-based function calls | Supported, varies by provider |
| Customization | Highly configurable | Limited to what the model exposes |
| Cost | Optimize each layer | Hard to optimize |
| Compliance / EU hosting | Full data flow control | Centralized; data residency varies |
| 8 kHz telephony | Optimized via STT choice | Mismatch with 16–48 kHz training |
Which Architecture for Which Use Case?
| Use Case | Recommendation | Why |
|---|---|---|
| Inbound healthcare hotline | Pipeline (Famulor) | GDPR/HIPAA, EU hosting, PII redaction, audit logs |
| Outbound sales calls | Pipeline (Famulor) | SIP trunking, telephony-optimized STT, CRM tool calls |
| Empathy-heavy consumer bot | Hybrid (Realtime + TTS) | Audio-in for tone detection, controlled brand voice |
| E-commerce web widget | Pipeline or Realtime | WebRTC = no 8 kHz issue; compliance need decides |
| Law firm intake | Pipeline (Famulor) | Audit logs, regional hosting, conflict-check tool calls |
| Quick MVP / Prototype | Realtime | Fastest start with fewest components — migrate later |
Implementing With Famulor — Step by Step
- Define the use case: inbound vs. outbound, industry, primary language, compliance level. Industry-specific examples on the Industries page.
- Create the voice agent: in the No-Code editor, start from a template. Initial message, system prompt, tools — all without code.
- Pick STT/LLM/TTS stack: for English telephony we typically recommend Deepgram Nova + GPT-4o + Cartesia Sonic. For premium brand voices: Deepgram + Claude + ElevenLabs v3. Famulor swaps stacks with one click.
- Wire up tools/integrations: CRM, calendar, helpdesk via Famulor integrations or webhook.
- Phone number via SIP: bring your own VoIP number or provision one through Famulor. Model Twilio/Telnyx cost upfront with the Twilio calculator.
- Test: in the browser tester first, then on real telephony. Famulor logs every call including transcript for iteration.
- Go live: activate inbound webhooks, set up outbound campaigns, push reporting to your CRM.
Best Practices & Common Mistakes
- Don't: decide architecture from a browser demo. Demos run on WebRTC with perfect audio — phone is a different planet.
- Do: define latency budgets per use case. Healthcare tolerates ≈800 ms; outbound sales must stay under 600 ms.
- Don't: let your system prompt grow unbounded. Every turn re-sends it — cost and latency both climb.
- Do: use knowledge bases instead of stuffing the prompt. Standard pattern in Famulor.
- Don't: test realtime if your use case is primarily phone. Validate a pipeline setup with telephony-optimized STT first.
- Do: add filler audio when tool calls take more than 1 second. See Filler Audio in the docs.
Industry Examples From Real Famulor Deployments
- Dr. Becker dental practice (60 staff): inbound appointment hotline. Pipeline with GPT-4o + Cartesia Sonic, Cal.com integration, EU hosting. Result: hang-up rate dropped from 22% to 6%.
- Berlin real estate brokerage: outbound qualification of listing inquiries. Pipeline with Claude + Cartesia Ink + ElevenLabs, GoHighLevel sync. Result: 3× more qualified first conversations per day.
- Shopify shop (DACH, 8,000 orders/month): web widget plus phone. Pipeline with GPT-4o-mini for standard inquiries, GPT-4o for escalations. Result: 45% of tickets resolved without a human agent.
- Law firm intake: inbound with EU hosting, PII redaction before LLM call, audit log per turn. Pipeline with German STT, Claude, MiniMax TTS.
Conclusion: For Most Production Voice Agents, a Modern Pipeline Wins
Realtime models are technically impressive and have real advantages in WebRTC consumer use cases. But for the overwhelming majority of production voice agent deployments — telephony, regulated industries, controlled cost, multilingual setups, granular tool calls — a modern streaming pipeline is the clear winner. You keep control, modularity, compliance, and cost predictability.
With Famulor you get exactly that pipeline architecture as a no-code platform: 40+ languages, SIP trunking, 300+ integrations, swappable STT/LLM/TTS components, EU hosting, transparent per-minute pricing — and optional hybrid realtime modes for emotion-sensitive use cases. Check the pricing or jump straight into a voice agent for your setup.
Try our AI Assistant
Experience how natural our AI phone assistant sounds.
Enter your details and receive a call from our AI agent within seconds.
Agent is trained to discuss Famulor services and book appointments.

Demo AI agent
Famulor representative
FAQ — Common Questions on Realtime vs. Pipeline Voice Agents
What is the main difference between realtime and pipeline voice agents?
Realtime (speech-to-speech) models process audio in a single multimodal model. Pipeline (cascade) architectures use three specialist models: STT for transcription, LLM for reasoning, TTS for output. Pipelines are more controllable; realtime is structurally lower-latency.
Which architecture does Famulor use?
Famulor uses a modern streaming pipeline with overlap between STT, LLM, and TTS — and optionally supports hybrid realtime setups for emotion-sensitive consumer use cases.
When should I pick realtime over pipeline?
If your voice agent runs only in a web widget (no phone), you don't have strict GDPR or HIPAA requirements, and emotional tone detection matters more than tool-calling reliability. For telephony and regulated industries, pipeline wins.
Does realtime work well with traditional telephony?
Limited. Telephony carries 8 kHz audio; realtime models are trained on 16–48 kHz web audio. The mismatch measurably degrades recognition and TTS quality. For phone, pipelines with telephony-optimized STT (e.g. via Famulor) are more reliable.
How fast can a pipeline voice agent be?
With streaming overlap, modern pipelines hit conversational latencies under 700 ms. Famulor optimizes STT streaming, LLM token streaming, and TTS streaming so first reaction lands around 500–800 ms — depending on use case and stack.
Can I switch STT, LLM, and TTS providers in Famulor?
Yes. Famulor supports multiple STT providers (Cartesia, Deepgram, Gladia), multiple LLMs (GPT-4o, Claude, Gemini), and multiple TTS engines (Cartesia Sonic, ElevenLabs, MiniMax, Gemini TTS). Switch with one click in the voice selection setting.
What does a pipeline voice agent cost compared to realtime?
Pipelines allow per-layer optimization: cheaper LLMs for standard queries, premium LLMs only for escalations. Realtime APIs are billed per audio second and are hard to optimize. Famulor uses transparent per-minute pricing — see pricing.
Is GDPR or HIPAA compliance possible with realtime?
Hard. Realtime models are mostly hosted by US hyperscalers without EU data residency. Pipelines allow EU components, PII redaction before the LLM call, and audit logs at every stage. Famulor offers EU hosting by default.
How does tool calling work in Famulor?
Famulor uses text-based function calling at the LLM layer — the mature mechanism familiar from chat apps. Define tools in the Tools & Functions editor, optionally as Custom Mid-Call Tools via webhook or directly inside the Famulor automation platform.
Can I migrate from pipeline to realtime later?
Yes. In Famulor the architecture is a per-voice-agent configuration. Most teams start with pipeline for stability and compliance — and add hybrid realtime modes for selected use cases later, without rebuilding tools or integrations.
Related blog posts

Gemini 3.1 Flash TTS Prompting Guide for AI Voice Agents

Full Control Over Your Phone Numbers with Famulor BYOC














