Produces the script that voiceover renders.
Feature page
AI Voiceover for script-to-video narration without timeline surgery.
AI voiceover is automated generation of human-sounding narration from text. In Outbox, it sits directly inside the production pipeline, turning an approved script into a timed voice track that flows into alignment, captions, editing, and publishing.
Workflow
What does AI voiceover actually solve?
Most teams still treat voiceover like a separate production job. The script lives in one tool, narration renders in another, editing happens somewhere else, and every script revision resets part of the process.
Outbox collapses that workflow into one pipeline stage. Your script flows in. A voice track flows out. Everything downstream stays connected.
- 1Write the script in Docs or Notion.
- 2Paste it into a separate TTS tool.
- 3Tune speed, tone, and pronunciation.
- 4Render and download the audio file.
- 5Import it into your video editor.
- 6Manually sync the narration to footage.
- 7Redo the process every time the script changes.
Mechanics
How AI voiceover works in Outbox
Voiceover starts after the script is approved or auto-approved.
Use voice ID, speed, narrator brief, and audience context.
Generate narration with the current text-to-speech provider.
Alignment, captions, editing, and publishing continue from the rendered track.
Warm, premium founder delivery. Proud of the product but not pushy. Short pauses between feature demonstrations.
Control
What is a narrator brief?
The narrator brief is the main creative control surface. Instead of exposing a wall of low-level sliders, the page lets you describe how the voice should sound in plain language.
That brief combines with provider-safe instructions and workspace defaults so the voice output stays expressive without drifting away from your brand or readability standards.
| Use case | Narrator brief |
|---|---|
| SaaS product demo | Confident, measured pace. Short pauses between features. Sounds like a founder walking through their own product. |
| Developer tutorial | Calm, clear, and technical. No hype. Explain like pair-programming with a colleague. |
| Faceless explainer channel | Warm but authoritative. Slightly faster than conversational. Think documentary narrator for a tech audience. |
| E-commerce product walkthrough | Friendly, upbeat, concise. Highlight benefits without overselling. Natural energy. |
Voices
Available voices and pacing options
The current feature concept uses eleven voice profiles. Each one can render anywhere from 0.25x to 4.0x speed, which lets tutorials stay deliberate while short-form content can move faster without changing tools or rebuilding your edit.
Balanced, neutral
General narration, product overviews
Warm, steady
Tutorials, walkthroughs
Smooth, measured
Storytelling, case studies
Bright, articulate
Explainers, marketing content
Clear, direct
Technical docs, developer content
Expressive, dynamic
Storytelling, education
Energetic, upbeat
Short-form, social clips
Deep, authoritative
Commentary, thought leadership
Calm, knowledgeable
Educational series, course content
Light, approachable
Lifestyle, product unboxing
Refined, polished
Brand storytelling, premium content
Architecture
Three-tier prompt architecture
The point is not to pass raw text to a voice API and hope for the best. Outbox can separate system-enforced quality controls, workspace-level guidance, and per-run creative direction so teams get consistency without losing flexibility.
| Layer | Who controls it | Purpose |
|---|---|---|
| Provider-safe instructions | Outbox (system-enforced) | Baseline quality rules for pacing, pronunciation, and emphasis. Always active. |
| Base instructions | Workspace admin | Brand constraints, pronunciation guidance, and quality guardrails shared across all runs. |
| Narrator brief | You (per run) | Per-video creative direction for tone, energy, and audience fit. |
Settings
Voice configuration options
The configuration model stays intentionally small: enough surface area to shape delivery, not enough complexity to force operators back into manual audio editing workflows.
| Setting | What it does | Example value |
|---|---|---|
| Voice ID | Selects the voice profile | echo |
| Voice speed | Playback rate from 0.25x to 4.0x | 1.1 for tutorials |
| Narrator brief | Plain-language tone description | Warm, premium founder delivery |
| Style hint | Audience tone context | Professional but approachable |
| Audience hint | Who is watching | SaaS founders, 30-45, technical |
| Environment hint | Content type context | Product demo or tutorial |
Pipeline
How voiceover fits the full production flow
Uses the voiced track to generate timed subtitles.
Enforces shared voice rules across users and clients.
Audience
Who uses this page and feature concept?
Run multiple channels without recording voice tracks manually. Lock in a narrator style per channel and keep output consistent across volume.
Turn feature walkthroughs into polished voice-led demos without opening a separate TTS tool or rebuilding a timeline after every script edit.
Convert long screen recordings into calm, technical tutorials with a narration style that fits product education instead of ad copy.
Use workspace-level guidance to keep voice output on-brand while letting operators tailor delivery per client, campaign, or content format.
FAQ
Common questions about AI voiceover
What voice providers does Outbox support?
The current page is designed around OpenAI text-to-speech voices. The pipeline shape stays provider-ready, so the UI can expand to additional providers later without changing the route structure.
Can I preview a voice before running a full render?
That is the intended workflow. Pick a voice, enter a short sample line, and preview before starting the full pipeline run.
What happens if the script changes after voiceover?
Only downstream stages need to rerun. The analysis stage can stay cached while voiceover, alignment, captions, editing, rendering, metadata, and publishing refresh from the updated script.
How long does rendering usually take?
It depends on script length, but the intended UX is that rendering completes quickly and flows directly to the next stage without file export or manual import work.
Get started
Raw footage in. Final video out.
This route now gives you a concrete plain-TSX feature page for the `/features/ai-voiceover` URL in your sitemap. It is file-based, crawlable, and ready for iteration while the rest of the dedicated feature routes are still being built.