Feature page

Auto captions that ship inside your video — not as an afterthought.

Auto captions are timed on-screen subtitles generated from your narration and burned directly into the final video. In Outbox, captioning is stage five of a nine-stage production pipeline — no separate subtitle tool needed.

Join the waitlist See pricing

TL;DR: Outbox generates styled captions from your narration, chunks them into readable groups, and burns them into the final video. Pick a preset, set your format — captions render and flow to the next stage. No subtitle editor. No manual timing.

3 caption presets100% script accuracyStage 5 of 9Active-word highlightingMulti-format layouts

Pipeline stage

Captions is stage 5 of 9.

Active

Analyze

Script

Voiceover

Align

Captions

Edit

Render

Metadata

Publish

Problem

What do auto captions actually solve?

Adding captions is tedious. You export audio, upload to a transcription service, fix errors, download an SRT file, import it into your editor, adjust timing, style everything, and re-export per aspect ratio.

Eight steps. Three tools. Every single video. In Outbox, captions are one pipeline stage. Your narration flows in. Styled, timed captions flow out. The pipeline continues.

Manual alternative

1Export your final audio or video from the editor.
2Upload to a transcription service like Descript or Rev.
3Wait for the transcript. Fix errors manually.
4Download an .srt or .vtt file.
5Import the subtitle file into your video editor.
6Adjust timing, fix overlaps, tweak positioning.
7Style the captions — font, size, color, outline, shadow.
8Re-export for each aspect ratio (16:9, 9:16, 1:1).

Outbox result

One stage instead of eight disconnected steps.

Analyze -> Script -> Voiceover -> Align -> Captions -> Edit -> Render -> Metadata -> Publish

Smartphone displaying video content with captions

Mechanics

How auto captions work in the pipeline

Receive timed segments

The alignment stage delivers script segments with start time, end time, and text.

Chunk into readable groups

Long segments split into 2–8 word chunks at natural pause points.

Apply caption preset

Font, size, color, positioning, and emphasis rules from your selected style.

Generate ASS subtitle file

Every timed caption event gets written to an ASS file with full styling.

Pass to edit and render

FFmpeg burns captions into the final video frames. No sidecar files.

Caption presets

Three styles. Zero manual work.

3 presets

WELCOME TO OUTBOX

classic_boldClassic Bold

Welcome to Outbox

minimal_cleanMinimal Clean

Welcome to Outbox

highlight_brandBrand Pop

Accuracy

Why script-based captions beat transcription

Most caption tools start by transcribing audio. That introduces errors — especially on technical terms, product names, and abbreviations. Outbox takes a different path: captions come from your approved script via the scripting stage and AI voiceover. The text is already correct.

Approach	Accuracy	Timing source	Editing required
Speech-to-text transcription	80–95% depending on audio quality	Inferred from audio waveform	Manual correction of errors, names, technical terms
Outbox script-based captions	100% — uses your approved script	Segment timing from the alignment stage	None — the script is already reviewed

Presets

Three caption presets, ready to render

Each preset controls font, color, outline, shadow, alignment, and text behavior. Presets adapt to your video's aspect ratio automatically — a 16:9 video gets different margins and font sizes than a 9:16 Short.

Preset	Look	Best for
classic_bold	Large white text, strong black outline, center-bottom	YouTube long-form, tutorials, product demos
minimal_clean	Smaller text, subtle shadow, lower-third positioning	Professional content, SaaS demos, course videos
highlight_brand	Bold text with accent color highlights, center placement	Short-form content, Reels, Shorts, TikTok

Readability

Smart chunking keeps text readable

A script segment might be 15 seconds and 40 words. Displaying everything as one subtitle block is unreadable. The chunking engine splits segments into smaller events at natural pause points.

Each chunk targets 2–8 words and 1–2 lines. Time allocation is proportional — longer chunks get more screen time. Breaks happen at punctuation, conjunctions, and phrase boundaries.

Time	Caption
0.0–1.2s	Welcome to Outbox.run
1.2–2.6s	The automated video pipeline
2.6–4.1s	that turns raw footage
4.1–5.4s	into published content
5.4–7.0s	with scripting, voiceover,
7.0–8.5s	captions, and metadata

Smart chunking

40 words become 6 readable groups.

0.0sWelcome to Outbox.run

1.2sThe automated video pipeline

2.6sthat turns raw footage

4.1sinto published content

5.4swith scripting, voiceover,

7.0scaptions, and metadata

Formats

Aspect-ratio-aware layouts across platforms

Different platforms demand different formats. Safe-area margins keep captions away from platform UI elements — like TikTok's share button or YouTube's progress bar. Your text stays visible regardless of where the viewer watches.

Aspect ratio	Platform	Caption behavior
16:9	YouTube, Vimeo	Bottom-center placement, larger font, wider margins
9:16	YouTube Shorts, TikTok, Reels	Center-screen placement, safe-area margins to avoid platform UI overlays
1:1	Instagram feed, LinkedIn	Lower-third placement, compact font sizing

Effects

Script effects shape caption presentation

Script segments carry an effects field from the scripting stage. Captions respond to those effects automatically — an upbeat intro looks different from a technical walkthrough without you changing settings mid-run.

Script effect	Caption behavior
upbeat	Slightly larger font, faster chunk transitions
clear	High-contrast white on black outline, no decoration
instructional	Stable lower-third positioning, smaller font
technical	Tighter layout, narrower line lengths
professional	Restrained preset, clean lines
fade_out	Reduced opacity near segment end

Engagement

Active-word highlighting for short-form

The currently spoken word gets a visual accent while the surrounding phrase stays visible. This is the karaoke-style captioning format popular in Hormozi-style Shorts and viral TikToks — generated automatically from your pipeline run.

How it works

1Word-level timing data is extracted from the voiceover alignment.
2Each caption phrase displays with all words visible.
3The active word gets a color highlight or weight change.
4The highlight advances word-by-word through the phrase.

Shipvideos,notedits.

Active-word highlight

Word 3 of 4

Comparison

Auto captions vs. the manual path

Dimension	Manual workflow	Outbox Auto Captions
Tools required	Transcription service + subtitle editor + video editor	One pipeline stage
Time per video	20–45 min (transcribe, correct, style, position, export)	Automatic — generates and passes to next stage
Accuracy	80–95% from speech-to-text (manual cleanup needed)	100% — sourced from your approved script
Multi-format support	Re-position and re-export per aspect ratio	Aspect-ratio-aware layouts generated automatically
Style change impact	Re-edit every video manually	Re-run from captions stage. Upstream cached.
Active-word highlighting	Manual animation in CapCut or After Effects	Generated from word-level timing data
Brand consistency	Depends on the editor remembering the style guide	Preset-locked per workspace

Settings

Caption configuration options

The settings stay intentionally focused: enough control to shape appearance, not enough complexity to force you back into a subtitle editor.

Setting	What it does	Example value
Caption preset	Selects the visual style	classic_bold
Captions enabled	Toggle captions on/off per run	true
Max words per chunk	Controls chunk size for readability	6
Uppercase mode	Force uppercase text per preset	false
Highlight color	Accent color for active-word highlighting	#adff2f
Target aspect ratio	Determines layout and safe areas	16:9, 9:16, or 1:1

Artifacts

The caption data pipeline under the hood

Every artifact is stored and inspectable in the Outbox dashboard. If a caption looks wrong, trace it back through the chain to see where the text, timing, or styling originated.

script_segments.json → word_timings.json → caption_events.json → captions.ass → burned video

Artifact	What it contains	Purpose
script_segments.json	Timed text segments with start/end times and effects	Source of truth from the alignment stage
word_timings.json	Per-word start and end times from voiceover alignment	Enables active-word highlighting
caption_events.json	Chunked, styled caption events with timing and positioning	Intermediate representation for debugging and editing
captions.ass	ASS subtitle file with style definitions and timed events	FFmpeg input for final burn-in

Pipeline

How captions fit the full production flow

Upstream

Voiceover → Align → Captions

The AI voiceover stage renders narration. Alignment syncs audio to footage and produces timed segments. Captions receive those segments — already split, already synced.

AI Voiceover →

Downstream

Captions → Edit → Render

The editing stage receives the ASS subtitle file. FFmpeg burns captions directly into video frames. The render artifact ships with captions embedded.

Stage isolation

Change the preset, keep the rest

Re-run from the captions stage only. Analysis, script, voiceover, and alignment stay cached. New render with updated captions — no full reprocess.

Related feature

AI Voiceover

Renders narration from the script. Caption timing syncs to voiceover pacing.

Related feature

Caption Styles

Visual customization layer for presets, colors, fonts, and positioning rules.

Related feature

Video Scripting

Produces the script text that captions display. Script accuracy means caption accuracy.

Related feature

SEO Metadata

Caption text feeds into description and tag generation for search optimization.

Related feature

Team Workspaces

Locks caption presets per workspace so every team member produces on-brand captions.

Audience

Who uses auto captions in the pipeline?

Faceless YouTube operators

Publish 3–5 videos per week across channels. Captioning every video manually doesn't scale. Outbox captions every run automatically — no extra steps.

Short-form content producers

Shorts, Reels, and TikToks live or die by captions. Active-word highlighting is generated from the same pipeline that produced the narration — no CapCut animation by hand.

SaaS founders

Screen recordings need captions for LinkedIn auto-play and accessibility compliance. Upload the footage and get a captioned demo out of the same pipeline run.

Course creators and educators

Accurate captions on 40-minute tutorials without transcription errors on technical terms, product names, and code references. Script-based means zero guesswork.

Agencies

Fifteen channels, three aspect ratios each, different caption styles per client brand. Preset configurations per workspace handle variation without per-video manual work.

FAQ

Common questions about auto captions

How accurate are the captions?

100% accurate. Captions come from your approved script, not speech-to-text. The words match exactly what the narration says because they share the same source.

Can I edit captions after they generate?

Caption events are stored as inspectable artifacts in the dashboard. A visual caption editor for direct text and timing adjustments is on the roadmap.

What subtitle format does Outbox use?

ASS (Advanced SubStation Alpha). It supports styled text, precise positioning, color, outline, shadow, and timed effects — far more than SRT or VTT. Captions are burned directly into video frames.

Can I get an SRT file instead of burned-in captions?

The pipeline generates an ASS file as an intermediate artifact. SRT export is on the roadmap for creators who need sidecar subtitle files for YouTube closed-caption upload.

Do captions work for short-form and long-form?

Yes. Aspect-ratio-aware layouts adjust font size, positioning, and safe-area margins for 16:9 (YouTube), 9:16 (Shorts, Reels, TikTok), and 1:1 (Instagram feed, LinkedIn).

What happens if the script changes after captions generated?

Outbox re-runs from the affected stage. Voiceover re-renders, alignment re-syncs, and captions regenerate automatically. The analysis stage stays cached.

Can I use different caption styles for different videos?

Yes. Caption presets are set per pipeline run. Your workspace admin can lock default presets, while individual runs override the style.

Does active-word highlighting work on all presets?

Yes. It uses word-level timing data from the voiceover alignment. Set a highlight color (default: #adff2f) and the pipeline handles the rest.

Get started

Your first captioned video is one pipeline run away.

Upload your footage. Let the pipeline script and voice your video. Pick a caption preset. Captions generate, burn into the video, and flow to publishing — automatically.

Join the waitlist Browse feature overview

Feature page

Auto captions that ship inside your video — not as an afterthought.

Problem

What do auto captions actually solve?

Mechanics

How auto captions work in the pipeline

Accuracy

Why script-based captions beat transcription

Presets

Three caption presets, ready to render

Readability

Smart chunking keeps text readable

Formats

Aspect-ratio-aware layouts across platforms

Effects

Script effects shape caption presentation

Engagement

Active-word highlighting for short-form

Comparison

Auto captions vs. the manual path

Settings

Caption configuration options

Artifacts

The caption data pipeline under the hood

Pipeline

How captions fit the full production flow

Audience

Who uses auto captions in the pipeline?

FAQ

Common questions about auto captions

Get started

Your first captioned video is one pipeline run away.

Product

Workflow

Resources

Legal