Inhoud samenvatten met:
GPT-Realtime-Translate for AI Phone Agents: Live Call Translation in 70+ Languages
In May 2026, OpenAI released GPT-Realtime-Translate, the first production-grade speech-to-speech translation model that keeps conversational pace with the speaker — 70+ input languages, 13 output languages, and sub-second end-to-end latency. For AI phone agents this changes the economics of multilingual customer service. One bot can now answer a call in Polish, translate the conversation live into German for a human agent on the other end, and stream the German reply back to the caller in Polish — without separate translators, without three stitched-together models, and without the audible delay that has made real-time interpretation impractical for telephony. If you plan to automate global voice contact in 2026, you should understand exactly what GPT-Realtime-Translate does, where it stops, and how it plugs into a voice-AI platform like Famulor.
What GPT-Realtime-Translate is — and what it is not
GPT-Realtime-Translate is part of the expanded OpenAI Realtime API and runs in the same audio-in, audio-out mode as GPT-Realtime-2. It takes a continuous audio stream of a speaker in one of the 70+ supported input languages and emits, in near real time, an audio stream in one of 13 output languages. Unlike a traditional pipeline that chains speech-to-text, machine translation, and text-to-speech, the translation happens inside a single model — which dramatically cuts latency and model-to-model loss of meaning.
Worth saying clearly: GPT-Realtime-Translate is not a conversational model. It will not answer questions, will not call tools, and will not negotiate appointments. It translates — and that is all. Anyone building a full-fledged voice agent that books, qualifies, classifies, or transacts will combine Translate with a reasoning model such as GPT-Realtime-2, or switch between the two modes depending on the conversation phase. Famulor handles that orchestration inside its flow-builder and seamlessly transitions between translate, dialogue, and tool-use modes mid-call.
Traditional STT-MT-TTS pipeline vs. GPT-Realtime-Translate
Until now, live translation in a telephony context required a three-stage pipeline: Deepgram or Whisper for STT, a translation model such as Google Translate, DeepL, or GPT-4, and finally a TTS engine such as ElevenLabs or Cartesia Sonic. It works — but it costs latency and it costs prosody, because the translated audio is regenerated from scratch and the speaker's pacing, emphasis, and emotional color are lost on the way.
| Criterion | Traditional STT → MT → TTS | GPT-Realtime-Translate |
|---|---|---|
| End-to-end latency | 1,500–2,500 ms | 400–900 ms |
| Model count | 3 (STT + MT + TTS) | 1 |
| Input languages | 50–95 (depends on STT) | 70+ |
| Output languages | near-unlimited (depends on TTS) | 13 |
| Prosody and emphasis | largely lost | largely preserved |
| Billing | per STT minute + per MT token + per TTS character | per minute (single bill) |
| Code complexity | high (stream sync, error handling) | low (one endpoint) |
| Tool-calling during translation | possible (via separate LLM) | not directly |
The takeaway for telephony: traditional pipelines still make sense when you need a rare output language or when tool calls must fire inside the translation step. For everything else — and that covers the majority of multilingual phone use cases in 2026 — Realtime-Translate is both faster and cheaper.
Implementing GPT-Realtime-Translate with Famulor, step by step
Famulor was built to orchestrate multiple voice models inside a single call. GPT-Realtime-Translate is wired in through the OpenAI integration and configured in the assistant builder as "Translation Mode". A typical inbound setup for global customers looks like this:
- Provision the SIP trunk. Connect an international phone number through Twilio or attach a local Telnyx or Plivo SIP trunk. Famulor supports any standards-compliant VoIP or PBX provider — no vendor lock-in.
- Detect the input language. Inside the Famulor flow, a 1.5-second language detection runs on the first audio frames. If the caller speaks Polish, the flow automatically switches into Translate mode with input "pl" and output "de".
- Activate the Translate model. In the assistant setup, GPT-Realtime-Translate is selected as the audio engine; the target language is bound to a flow variable so it can be set dynamically.
- Bridge to a human agent. Famulor streams the translated audio to a human team member through intelligent call forwarding. The agent hears German; the caller hears Polish.
- Fallback for unsupported output languages. If the caller's preferred language is not on the 13-output list (for example, Slovenian or Hindi), Famulor automatically falls back to the classic STT-MT-TTS pipeline using Cartesia Sonic or ElevenLabs.
- Reporting and QA. Every translation session is transcribed in both languages and stored. Compliance teams can spot-check the original alongside the translation.
The entire process is configurable in the no-code builder in under 30 minutes. For more advanced setups — for example, when CRM tool calls must fire after the translation step — the flow is augmented with a GPT-Realtime-2 node that takes over the conversation as soon as the caller requests a concrete action.
Industry examples where live translation pays for itself immediately
Hospitality. A 60-room boutique hotel in Vienna with 65% international guests handles after-hours reservation and concierge calls. Before: three languages during the day, English on weekends, everything else went to voicemail. With GPT-Realtime-Translate plus the Famulor inbound bot for hospitality, inquiries in Polish, Czech, Arabic, and Russian are translated live into German, handled by the night concierge, and translated back. Direct-booking conversion from non-English markets: +38% in the first quarter.
Contact center for e-commerce. A DACH-region online retailer expands into Spain, France, Italy, and Poland. Instead of building four local support teams, the Famulor contact-center agent takes inbound calls in the local language, translates them for the German support team, and returns the reply. Headcount expansion is avoided; time-to-market drops from nine months to three weeks.
Roadside assistance and emergency dispatch. Roadside assistance providers contracted across the EU regularly receive calls in languages their dispatchers do not speak. Old workflow: guess the GPS location, request a callback in another language. With Realtime-Translate, the dispatcher hears the address in German on the first try, dispatches the tow truck immediately, and the caller stays on the line.
Home care and social services. Outpatient care providers serving multicultural neighborhoods can conduct intake calls and routine check-ins in Arabic, Turkish, or Ukrainian. Care coordinators in the back office still operate in German, with the entire conversation transcribed bilingually for the patient record.
Tax and legal advisory with international clients. A mid-sized accounting firm serving Polish, Italian, and French business clients used to limit first-contact intake to half-day bilingual reception. A Famulor inbound agent in Translate mode now covers every after-hours call, summarizes the intent in a structured email for the responsible accountant, and logs the interaction in the CRM. First-contact rate outside business hours: from 12% to 71%.
Travel and tour operators. Travel agencies with direct customers across multiple EU markets can handle cancellations, rebookings, and service inquiries around the clock in the traveller's native language — without building local service teams in every market. Particularly valuable during weather or strike disruptions when call volume spikes suddenly.
Cost and ROI: when the switch pays off
OpenAI bills GPT-Realtime-Translate per minute, not per token like the reasoning models. That makes cost planning for telephony predictable: a five-minute translated conversation costs a fixed amount regardless of how much speech occurred. Traditional STT-MT-TTS pipelines come in 15–40% more expensive per minute, and the latency penalty compounds into a 2–4% drop rate as impatient callers hang up.
Before migrating, model your call volume by language. If a market generates fewer than 200 translated minutes per month, the setup effort may not pay back even at full ROI. If the volume exceeds 800 minutes, the ROI is almost always positive.
Bereken je ROI met geautomatiseerde gesprekken
Ontdek hoeveel je per maand bespaart via AI voice agents.
ROI Resultaat
ROI 228%
Geen creditcard nodig
Best practices and common mistakes
Detect the input language fast and stably. The detector must commit in under two seconds, otherwise the caller hears a delayed "hello". Famulor uses a dual-path approach: a fast, lower-precision detector runs in the first 800 ms while a slower, more accurate detector validates the result — if the precise detector revises the language, the user perceives at most a one-syllable hiccup.
Audit cultural register and idiomatic phrasing. Translate preserves semantic meaning, but cultural register, regional forms of address, and industry-specific terminology can drift. Set up a glossary in the Famulor knowledge base for required translations ("IBAN" stays "IBAN"; "cancellation fee" must always render as "Stornogebühr" in German).
Disclose AI use at the start of the call. Under the EU AI Act (Article 50, in force from August 2026), every caller must be told they are speaking with an AI — even if the AI is only translating. Famulor ships pre-built disclosure templates in all 70+ input languages and can play the disclosure automatically based on the detected language.
Plan for the 13-output-language ceiling. If your main market's output language is not in the Translate list (for example, Norwegian or Hindi), the flow has to route to a classic pipeline. That fallback path must be tested before go-live, or you will lose exactly the calls the new service was meant to win.
Never translate twice. A common MVP bug: Translate converts the caller's audio into German, the human agent responds, and the response stream gets pushed through Translate a second time even though it is already in the target language. Famulor recognises the speaker role and routes only the caller's stream through the translation model.
When GPT-Realtime-Translate is not the right tool
Translate is not for outbound sales calls where the AI agent argues, handles objections, and books appointments — that requires GPT-Realtime-2 or a comparable conversational stack with tool calls. Translate is also unsuitable for highly regulated settings where translation must be notarised or signed off by a certified interpreter (court proceedings, asylum hearings). And anyone covering hyper-regional dialects or extremely rare languages will often be better served by specialist STT vendors plus a manual glossary than by the generalised Translate model.
How Famulor differs from competing voice-AI platforms
Vapi, Bland, and Retell offer OpenAI Realtime connectivity but do not ship a built-in translation orchestration layer with language detection, fallback routing, and disclosure templates. Synthflow and PolyAI focus more squarely on pure conversational AI and treat translation as an edge case. Famulor differs on three counts: first, EU hosting with full data processing agreements for DACH customers; second, the native no-code flow-builder that orchestrates translate mode, tool calls, and human handoff in the same interface; and third, the 300+ integrations that write translated content directly into CRMs, helpdesks, or booking systems without custom code. We have written more about this multi-engine architecture in our language and accent diversity guide.
Conclusion: if you plan to scale globally, switch now
GPT-Realtime-Translate is the first production-ready live translation that is fast enough for real phone calls and cheap enough for any mid-market use case. If you currently receive calls in more than three languages and solve that with localised teams, voicemail, or stitched translation pipelines, you can stand up a Translate-enabled AI agent in Famulor within days, cut your cost per translated call by 20–35%, and respond noticeably faster. The next step is a 15-minute setup call where we map your language mix, call volume, and SIP setup together.
Probeer onze AI-assistent
Ervaar hoe natuurlijk onze AI-telefoonassistent klinkt.
Vul uw gegevens in en ontvang binnen enkele seconden een oproep van onze AI-agent.
De agent is getraind om over Famulor-diensten te praten en afspraken te maken.

Demo AI agent
Famulor representative
FAQ
Which languages does GPT-Realtime-Translate accept as input?
More than 70 languages, including every major European language plus Arabic, Turkish, Russian, Mandarin, Hindi, Japanese, and Korean. OpenAI publishes the complete list in the Realtime API documentation.
Which languages can GPT-Realtime-Translate speak as output?
13 output languages, including English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Japanese, Korean, and Mandarin. For unsupported output languages, Famulor automatically routes to a classic STT-MT-TTS pipeline.
How fast is the translation in a real phone call?
End-to-end latency sits between 400 and 900 milliseconds depending on network conditions and audio codec. That stays well under the threshold for natural conversational pacing.
What does a translated phone minute cost with GPT-Realtime-Translate?
OpenAI charges per minute in the low single-cent range, undercutting most classic STT-MT-TTS setups. In Famulor, telephony costs (Twilio, Telnyx) and the Famulor platform fee are added — the ROI calculator above models this in detail.
Is a Translate session compliant with the EU AI Act and GDPR?
Yes, provided the AI disclosure is played at the start of the call and audio is processed inside a GDPR-compliant infrastructure. Famulor hosts in the EU, signs data processing agreements, and ships disclosure templates in all input languages.
Can the AI agent fire tools (such as booking an appointment) while translating?
GPT-Realtime-Translate itself does not support tool calls. Famulor solves this in the flow-builder through mode-switching: when the caller requests a concrete action, the flow hands off to GPT-Realtime-2, which executes the tool calls, and then returns to Translate mode.
Does this work on landline numbers as well, or only on VoIP apps?
It works on both. Famulor binds any SIP trunk, so landline and mobile numbers are covered through Twilio, Telnyx, Plivo, or a local PBX provider. The caller does not need a special app.
What happens with background noise or poor audio quality?
GPT-Realtime-Translate is trained on realistic audio environments and handles street noise or mobile compression well. On very poor connections, Famulor automatically inserts a reconfirmation ("Could you repeat that, please?") before the translation is passed to the agent.
How many concurrent translation calls can Famulor handle?
Famulor scales horizontally — concurrency is bounded only by your plan and SIP capacity. For 50–500 concurrent inbound translate calls the standard OpenAI Realtime API limits are sufficient; higher volumes are coordinated through enterprise quotas.
How do I get started with GPT-Realtime-Translate in Famulor?
Create a new assistant in the Famulor dashboard, pick "Translation Mode" as the audio engine, bind your SIP number, and test the bot with inbound calls from three sample languages. The full setup takes under 30 minutes — and for a deeper look at SIP options we recommend our Telnyx, Twilio, and SIP trunks guide before going live.
















