Gemini 3.1 Flash TTS Prompting Guide for AI Voice Agents

If you run an AI voice agent, you know the pain: the script is right, the answer is correct, the customer got useful information – and yet the bot still sounds like a robot. Monotone, too fast, "too smooth". Modern TTS models like Gemini 3.1 Flash TTS, Cartesia Sonic, ElevenLabs v3 and MiniMax interpret prompts semantically rather than just phonetically. That opens new opportunities – and new failure modes.

This article is a practical guide for prompting modern TTS engines like Gemini 3.1 Flash TTS so your AI voice agent actually sounds human – warm, clear, situation-appropriate. We cover the rules, the common traps, a complete working example, and how to roll the whole thing out productively with Famulor.

Why Gemini 3.1 Flash TTS Is Different

Classic TTS engines are phoneme-based: text in, speech out. You feed punctuation, maybe SSML, and you control pace and pitch via parameters. Gemini 3.1 Flash TTS does that too – but it's also an LLM. It treats your entire prompt as context, not just as a script to read.

That has two consequences:

Upside: You can direct in natural language. Sentences like "warm and unhurried" or "a bit unsure, like someone who's just thinking it through" actually work.
Downside: The model has to decide what's direction and what's spoken text. Without discipline, it just reads your direction aloud – including "Director's Notes", "Style:" or "Pace:".

That's why a battle-tested prompt schema is essential. The structure that works most reliably for our customers at Famulor – across hundreds of voice agent setups – starts with one simple idea: clearly separate direction from transcript.

The Canonical Prompt Structure That Actually Works

This pattern significantly reduces the chance that Gemini 3.1 Flash TTS reads your direction notes aloud:

Synthesize speech for the performance defined below. The profile, scene, performance notes, and context are direction only. Do NOT speak them. Speak ONLY the lines under #### TRANSCRIPT.

# AUDIO PROFILE: Maria S.
## "The friendly receptionist"

## SCENE: Late afternoon at the clinic front desk
Quiet waiting-room atmosphere, the receiver picked up calmly.

### PERFORMANCE
Style: Warm and confident, calm timbre, no rush.
Pace: Natural breath flow, a small settling pause early on.
Accent: Standard American English with a soft Midwestern timbre.

### CONTEXT
Maria takes appointments and reassures nervous callers.

#### TRANSCRIPT
[warmly] Dr. Becker's office, [thoughtfully] Maria speaking, [warmly] how can I help you?

Three components are load-bearing:

The synthesize-speech preamble at the very top. This sentence triggers the speech-synthesis classifier path inside the model – without it, Gemini frequently reads your entire prompt aloud.
The #### TRANSCRIPT delimiter with exactly four hashes. Other variants sometimes work, but this one matches the official docs and is the most reliable in production.
Short section labels. Use ### PERFORMANCE instead of ### DIRECTOR'S NOTES. We've literally heard the word "DIRECTOR'S" being spoken aloud in tests. Apostrophes and multi-word headers are classifier hazards.

The Nine Rules for Natural-Sounding TTS Prompting

From hundreds of customer setups at Famulor – from dental practices to outbound B2B call centers – we distilled nine rules that separate "okay" from "actually sounds human".

1. Always prepend a synthesize-speech preamble

This single paragraph triggers the speech-synthesis path instead of the "read it all" path. Skip it and you'll get sporadic failures where your bot reads the entire direction aloud – sometimes with a serious tone. For inbound bots in healthcare or service industries, that's a reputation killer.

2. Use `#### TRANSCRIPT` exactly as the delimiter

Google's official docs use this exact spelling. Other headers (##### TRANSCRIPT, ### Speak from here) sometimes work but are unreliable. Stick to the documented standard.

3. Use short, neutral section labels

Avoid multi-word, dramatic headers like ### DIRECTOR'S NOTES or ### SAMPLE CONTEXT. Use:

### PERFORMANCE for Style, Pace, and Accent
### CONTEXT for persona background

Apostrophes inside headers are particularly dangerous – the model loves to read them aloud.

4. Classify the emotional register before picking tags

There's no universal tag template. Classify the emotional context first and then pick appropriate audio tags. This table works well in our no-code voice agent setups:

Register	When	Safe tags	Forbidden
EMPATHY	Customer upset, apologizing, acknowledging a problem	[sighs], [warmly], [thoughtfully], [gently]	[soft laugh], [cheerfully]
CLARIFY_PROBLEM	Confirming the details of a customer's issue	[thoughtfully], [warmly], [gently]	[soft laugh], [cheerfully], [sighs]
TRANSACTIONAL	Policy, transfers, troubleshooting, scheduling	[warmly], [thoughtfully]	[soft laugh], [sighs], [cheerfully]
WARM_FRIENDLY	Greetings, closings, confirmations, upsells	[warmly], [thoughtfully], [cheerfully], [soft laugh] (max one)	(none)

Never laugh at an upset customer. It's the fastest way to make an AI voice feel deeply wrong.

5. Stick to documented audio tags

Custom emotion tags like [apologetically], [helpfully] or [carefully] sound flatter in practice than the documented set. We tested them systematically – the prosody is measurably worse. For emotion, stick to:

[warmly]
[thoughtfully]
[sighs]
[gently]
[soft laugh]
[cheerfully]

For non-emotional modifiers (pacing, volume, character), custom tags are fine: [whispers], [very slow], [like a cartoon dog] all work.

6. Write a scene, not a role label

Compare the two:

Bad: "A warm customer service rep explaining something clearly." – too abstract, the model has nothing to latch onto.
Good: "Late afternoon at a dental clinic front desk. Maria has the calendar open, pen in hand, and she's mildly happy about an easy appointment."

Concrete sensory details move realism meaningfully. Generic role labels don't. When you build an AI call center agent, describe the scene the persona is in – not just the job title.

7. Never instruct flatness

This one bites everyone. You want a calm empathy moment and write "quiet, no rush, calm". Gemini takes that literally, kills the prosody, and the bot sounds like a tired voicemail. Avoid:

"quiet", "quietly"
"flat", "monotone"
"no rush" (reads as "go slow and flat")
"careful" (reads as "over-precise, stiff")
"whispered" (unless you actually want whispering)

Better phrasings for quieter moods:

"warm and sincere"
"voice dropped half an octave but full of feeling"
"patient and unhurried"
"measured but present"

8. Commas over periods in the transcript

For some TTS engines, more periods produce more human-like pauses. For Gemini 3.1 Flash TTS, the opposite is true. Periods between tagged clauses make the output choppy.

Bad – sounds chopped:

[warmly] Okay. [thoughtfully] So your appointment. [warmly] That's all set. [cheerfully] Tuesday. [warmly] At three… [thoughtfully] PM.

Good – natural prose flow, tags only mark emotional pivots:

[warmly] Okay, [thoughtfully] so your appointment, [warmly] that's all set. [cheerfully] Tuesday, [warmly] at three… [thoughtfully] PM.

Rule of thumb: commas between tagged clauses inside a sentence; periods only where the original text actually ends; ellipses (...) for one or two natural trail-offs per utterance; em-dashes (—) for a mid-thought micro-pause.

9. Don't quote literal transcript words in Style/Pace

The model occasionally reads them aloud. Bad:

Pace: A small lift at "oh" at the start, like the thought just came up.

Good:

Pace: A small lift at the opening, like the thought just came up.

Describe the rhythm, don't name the words. This matters even more when you roll out voice agent scripts in Famulor that get filled with dynamic variables at runtime.

Full Working Example

Here's a production-ready empathy prompt, similar to one we use for an inbound legal hotline:

Synthesize speech for the performance defined below. The profile, scene, performance notes, and context are direction only. Do NOT speak them. Speak ONLY the lines under #### TRANSCRIPT.

# AUDIO PROFILE: Maria J.
## "The Senior Support Rep"

## SCENE: A tough moment in the call
The customer has shared something frustrating. Maria leans a little closer to the mic, voice carrying real feeling, the kind of apology you actually mean.

### PERFORMANCE
Style: Warm and sincere. Genuine concern. The voice carries feeling, not flatness. A soft exhale at the opening is real, not performative. Never amused, never casual.
Pace: Natural, with a small settling pause early on. The beat of someone actually taking in what they heard.

### CONTEXT
Maria is the rep who actually listens, and callers can hear the difference. She takes ownership of getting things fixed.

#### TRANSCRIPT
[sighs] Oh. [gently] I'm really sorry to hear that. [warmly] Lemme see [thoughtfully] what I can do. [warmly] We'll get this sorted out [gently] for you... [warmly] right away.

The Most Common Failures – And How to Fix Them

Symptom	Cause	Fix
"DIRECTOR'S" audibly read aloud	Section header read as transcript	Shorten header to `### PERFORMANCE`
Audio sounds monotone or dead	"quiet", "flat", "no rush" in Style	Rewrite Style without flatness words
Persona name spoken aloud	Phonetic collision with transcript opening	Rename persona (Kiara D. → Morgan P.)
Word from CONTEXT bleeds into the transcript	Section boundary ambiguous	Remove the collision word, rephrase CONTEXT
Empty `content.parts` or 500 errors	Documented preview-stage bug	Retry up to 5x with backoff
Robotic delivery despite good content	Period-separated fragments	Rewrite TRANSCRIPT with commas between tagged clauses
Laughter in an apology	Universal tag template across all scenarios	Classify emotional register first, use register-specific palette
Custom tags like [apologetically] feel flat	Weak training coverage	Stick to the documented set

Rolling It Out Productively With Famulor

The rules above are valuable across any TTS stack – but they pay off most when integrated systematically into your voice agent pipeline. Famulor ships with the building blocks for that:

No-Code editor: In the No-Code AI Voice Agent you maintain persona, scene, performance notes and transcript separately. Variables from your CRM or calendar are injected only into the transcript at runtime – never into direction.
Multi-TTS backbone: Switch between Gemini 3.1 Flash TTS, Cartesia Sonic, ElevenLabs and MiniMax without rewriting your prompts. Famulor handles engine-specific adaptation.
40 languages, regional accents: Standard German, Swiss German, Viennese, Bavarian, US/UK English, Spanish, French, Dutch and many more. Accent hints belong in ### PERFORMANCE, not in the transcript.
SIP trunking & telephony: Connect your existing VoIP/PBX provider via SIP trunk so your AI call center keeps the same phone numbers.
300+ integrations: The Famulor integrations – similar to Zapier or Make – connect the bot with calendars, CRMs, helpdesks, and webhooks. Native n8n and Make connectors included.

Best Practices: From Prompt to Production Hotline

Going from a tested prompt to a productive hotline benefits from a few additional disciplines:

Maintain a persona sheet. Centralize Profile, Scene, Performance defaults, and CONTEXT for every voice persona. The style stays consistent even when multiple team members tweak scripts.
A/B test on real calls. Don't compare variants only in the TTS studio. Background noise, codec compression, and latency shift the perception. Run A/B over real telephony.
Avoid tag inflation. More tags don't equal better output. Use tags like accents – sparingly, in the right place, not in every sentence.
Define fallback branches. For every empathy response, have a TRANSACTIONAL variant ready in case the call shifts from emotional to administrative – with its own tag set.
Log misclassifications. When the bot reads a direction aloud, save the prompt. Those cases are gold for the next optimization round.

Industry Examples: Where TTS Prompting Moves Real KPIs

These rules aren't academic – they shift real business metrics. Three examples from typical Famulor setups:

Dental practice (inbound appointments): Patients are often nervous. A WARM_FRIENDLY greeting plus EMPATHY mode for pain-driven calls measurably reduces hang-up rates. Concretely: in a pain scenario, replace "[cheerfully] Hello!" with "[warmly] Dr. Becker's office, [thoughtfully] thanks for calling."
Trades / home services (outbound callbacks): Customers haven't heard back for two days. A TRANSACTIONAL register with a warm opener works far better than a salesy "[cheerfully]" – measurably more conversation continuation.
E-commerce (returns): EMPATHY register with a subtle [sighs] at the start signals "I'm hearing you." [soft laugh] here is forbidden – it sounds cynical and pushes customers into the complaint loop.
Law firms: TRANSACTIONAL with quiet authority. Accent hints like "soft Midwestern timbre" in ### PERFORMANCE add credibility.

Comparison: Gemini 3.1 Flash TTS vs. Alternatives

Gemini isn't the only expressive TTS option. A pragmatic overview:

Model	Strength	Weakness	Best for
Gemini 3.1 Flash TTS	Highly expressive, semantic prompting	Longer response latency in preview, occasionally reads direction aloud	Empathy-heavy hotlines, healthcare, legal
Cartesia Sonic	Very low latency, stable	Less granular emotional tags	Outbound call centers, real-time setups
ElevenLabs v3	Voice cloning, many languages	Cost, less granular emotion control	Brand voices, premium brands
MiniMax	Asian languages, good price/perf	Western accents less expressive	International multilingual setups

With Famulor you don't have to commit: choose the right engine per voice agent while keeping the same prompt schema everywhere.

Quick Checklist Before Every Deploy

Synthesize-speech preamble at the top?
#### TRANSCRIPT delimiter spelled exactly?
Section labels: ### PERFORMANCE, ### CONTEXT – no apostrophes, no multi-word headers?
Scene concrete and sensory, not an abstract role?
Style/Pace free of flatness words ("quiet", "flat", "no rush")?
Style/Pace free of literal transcript quotes?
Audio tags register-appropriate (no laughing at upset customers)?
Audio tags from the documented set?
Transcript: commas between clauses, periods only at real sentence ends?
A/B tested over a real telephony codec?

Gemini 3.1 Flash TTS is one of the most expressive TTS models in production-ready voice agent stacks. It can deliver empathy, pace shifts, and subtle pauses that used to be the domain of human voice talent. But quality lives and dies with the prompt schema. Apply the nine rules, classify the emotional register, and use tag sets with discipline – and your AI voice agent will move from "obviously a bot" to "wait, is that real?".

If you don't want to engineer all of this yourself – and would rather use a production-ready platform that already curates these best practices – Famulor is your first choice. You get a no-code platform with a multi-TTS backbone, 40 languages, SIP trunking, 300+ integrations, and a voice agent setup that goes live in minutes, not weeks. Take a look at the pricing or jump straight into a demo for your AI call center.

FAQ – Common Questions About TTS Prompting for AI Voice Agents

What is Gemini 3.1 Flash TTS?

Gemini 3.1 Flash TTS is a Google text-to-speech model built on top of an LLM that interprets prompts semantically. It can follow natural-language direction like "warm and unhurried" rather than just converting text to speech.

Why does my TTS model read the direction notes aloud?

Because without a clear structure, the speech-synthesis classifier can't reliably tell direction from transcript. A synthesize-speech preamble plus a clean #### TRANSCRIPT delimiter eliminates the issue in most cases.

Which audio tags work best with Gemini 3.1 Flash TTS?

The documented tags [warmly], [thoughtfully], [sighs], [gently], [soft laugh] and [cheerfully] deliver the most reliable prosody. Custom emotion tags often sound flatter.

Should I use periods or commas between tagged clauses?

For Gemini 3.1 Flash TTS specifically, use commas. Periods between tags chop up the output. Use periods only where the original text actually ends.

How do I avoid my AI voice agent laughing during an apology?

Classify the emotional register first (e.g. EMPATHY) and only use register-appropriate tags. [soft laugh] has no place in an empathy passage.

Does the prompt schema also work for non-English voice agents?

Yes. Keep direction (Profile, Scene, Performance, Context) in English – the model is most stable that way – and write the transcript block in your target language. Famulor supports 40 languages including regional accents.

Can I switch TTS engines inside Famulor?

Yes. Pick the right engine per voice agent – Gemini 3.1 Flash TTS, Cartesia Sonic, ElevenLabs or MiniMax – without rewriting your prompts.

How do I integrate Famulor with my existing phone system?

Via SIP trunking. Connect your existing VoIP/PBX provider and keep your existing phone numbers. Details are in the Famulor integrations.

What does an AI voice agent on Famulor cost?

Famulor uses transparent per-minute pricing plus optional plans for enterprise deployments. See the pricing page for details.

How long does a productive rollout take?

With a no-code setup, a clearly defined use case and a prepared persona sheet, first productive calls are realistic within 1–2 days. More complex multi-channel scenarios usually take 1–2 weeks depending on integration depth.