The AI voiceover stage renders narration. Alignment syncs audio to footage and produces timed segments. Captions receive those segments — already split, already synced.
AI Voiceover →Feature page
Auto captions that ship inside your video — not as an afterthought.
Auto captions are timed on-screen subtitles generated from your narration and burned directly into the final video. In Outbox, captioning is stage five of a nine-stage production pipeline — no separate subtitle tool needed.
Problem
What do auto captions actually solve?
Adding captions is tedious. You export audio, upload to a transcription service, fix errors, download an SRT file, import it into your editor, adjust timing, style everything, and re-export per aspect ratio.
Eight steps. Three tools. Every single video. In Outbox, captions are one pipeline stage. Your narration flows in. Styled, timed captions flow out. The pipeline continues.
- 1Export your final audio or video from the editor.
- 2Upload to a transcription service like Descript or Rev.
- 3Wait for the transcript. Fix errors manually.
- 4Download an .srt or .vtt file.
- 5Import the subtitle file into your video editor.
- 6Adjust timing, fix overlaps, tweak positioning.
- 7Style the captions — font, size, color, outline, shadow.
- 8Re-export for each aspect ratio (16:9, 9:16, 1:1).
Mechanics
How auto captions work in the pipeline
The alignment stage delivers script segments with start time, end time, and text.
Long segments split into 2–8 word chunks at natural pause points.
Font, size, color, positioning, and emphasis rules from your selected style.
Every timed caption event gets written to an ASS file with full styling.
FFmpeg burns captions into the final video frames. No sidecar files.
Accuracy
Why script-based captions beat transcription
Most caption tools start by transcribing audio. That introduces errors — especially on technical terms, product names, and abbreviations. Outbox takes a different path: captions come from your approved script via the scripting stage and AI voiceover. The text is already correct.
| Approach | Accuracy | Timing source | Editing required |
|---|---|---|---|
| Speech-to-text transcription | 80–95% depending on audio quality | Inferred from audio waveform | Manual correction of errors, names, technical terms |
| Outbox script-based captions | 100% — uses your approved script | Segment timing from the alignment stage | None — the script is already reviewed |
Presets
Three caption presets, ready to render
Each preset controls font, color, outline, shadow, alignment, and text behavior. Presets adapt to your video's aspect ratio automatically — a 16:9 video gets different margins and font sizes than a 9:16 Short.
| Preset | Look | Best for |
|---|---|---|
| classic_bold | Large white text, strong black outline, center-bottom | YouTube long-form, tutorials, product demos |
| minimal_clean | Smaller text, subtle shadow, lower-third positioning | Professional content, SaaS demos, course videos |
| highlight_brand | Bold text with accent color highlights, center placement | Short-form content, Reels, Shorts, TikTok |
Readability
Smart chunking keeps text readable
A script segment might be 15 seconds and 40 words. Displaying everything as one subtitle block is unreadable. The chunking engine splits segments into smaller events at natural pause points.
Each chunk targets 2–8 words and 1–2 lines. Time allocation is proportional — longer chunks get more screen time. Breaks happen at punctuation, conjunctions, and phrase boundaries.
| Time | Caption |
|---|---|
| 0.0–1.2s | Welcome to Outbox.run |
| 1.2–2.6s | The automated video pipeline |
| 2.6–4.1s | that turns raw footage |
| 4.1–5.4s | into published content |
| 5.4–7.0s | with scripting, voiceover, |
| 7.0–8.5s | captions, and metadata |
Formats
Aspect-ratio-aware layouts across platforms
Different platforms demand different formats. Safe-area margins keep captions away from platform UI elements — like TikTok's share button or YouTube's progress bar. Your text stays visible regardless of where the viewer watches.
| Aspect ratio | Platform | Caption behavior |
|---|---|---|
| 16:9 | YouTube, Vimeo | Bottom-center placement, larger font, wider margins |
| 9:16 | YouTube Shorts, TikTok, Reels | Center-screen placement, safe-area margins to avoid platform UI overlays |
| 1:1 | Instagram feed, LinkedIn | Lower-third placement, compact font sizing |
Effects
Script effects shape caption presentation
Script segments carry an effects field from the scripting stage. Captions respond to those effects automatically — an upbeat intro looks different from a technical walkthrough without you changing settings mid-run.
| Script effect | Caption behavior |
|---|---|
| upbeat | Slightly larger font, faster chunk transitions |
| clear | High-contrast white on black outline, no decoration |
| instructional | Stable lower-third positioning, smaller font |
| technical | Tighter layout, narrower line lengths |
| professional | Restrained preset, clean lines |
| fade_out | Reduced opacity near segment end |
Engagement
Active-word highlighting for short-form
The currently spoken word gets a visual accent while the surrounding phrase stays visible. This is the karaoke-style captioning format popular in Hormozi-style Shorts and viral TikToks — generated automatically from your pipeline run.
- 1Word-level timing data is extracted from the voiceover alignment.
- 2Each caption phrase displays with all words visible.
- 3The active word gets a color highlight or weight change.
- 4The highlight advances word-by-word through the phrase.
Comparison
Auto captions vs. the manual path
| Dimension | Manual workflow | Outbox Auto Captions |
|---|---|---|
| Tools required | Transcription service + subtitle editor + video editor | One pipeline stage |
| Time per video | 20–45 min (transcribe, correct, style, position, export) | Automatic — generates and passes to next stage |
| Accuracy | 80–95% from speech-to-text (manual cleanup needed) | 100% — sourced from your approved script |
| Multi-format support | Re-position and re-export per aspect ratio | Aspect-ratio-aware layouts generated automatically |
| Style change impact | Re-edit every video manually | Re-run from captions stage. Upstream cached. |
| Active-word highlighting | Manual animation in CapCut or After Effects | Generated from word-level timing data |
| Brand consistency | Depends on the editor remembering the style guide | Preset-locked per workspace |
Settings
Caption configuration options
The settings stay intentionally focused: enough control to shape appearance, not enough complexity to force you back into a subtitle editor.
| Setting | What it does | Example value |
|---|---|---|
| Caption preset | Selects the visual style | classic_bold |
| Captions enabled | Toggle captions on/off per run | true |
| Max words per chunk | Controls chunk size for readability | 6 |
| Uppercase mode | Force uppercase text per preset | false |
| Highlight color | Accent color for active-word highlighting | #adff2f |
| Target aspect ratio | Determines layout and safe areas | 16:9, 9:16, or 1:1 |
Artifacts
The caption data pipeline under the hood
Every artifact is stored and inspectable in the Outbox dashboard. If a caption looks wrong, trace it back through the chain to see where the text, timing, or styling originated.
| Artifact | What it contains | Purpose |
|---|---|---|
| script_segments.json | Timed text segments with start/end times and effects | Source of truth from the alignment stage |
| word_timings.json | Per-word start and end times from voiceover alignment | Enables active-word highlighting |
| caption_events.json | Chunked, styled caption events with timing and positioning | Intermediate representation for debugging and editing |
| captions.ass | ASS subtitle file with style definitions and timed events | FFmpeg input for final burn-in |
Pipeline
How captions fit the full production flow
The editing stage receives the ASS subtitle file. FFmpeg burns captions directly into video frames. The render artifact ships with captions embedded.
Re-run from the captions stage only. Analysis, script, voiceover, and alignment stay cached. New render with updated captions — no full reprocess.
Renders narration from the script. Caption timing syncs to voiceover pacing.
Visual customization layer for presets, colors, fonts, and positioning rules.
Produces the script text that captions display. Script accuracy means caption accuracy.
Caption text feeds into description and tag generation for search optimization.
Locks caption presets per workspace so every team member produces on-brand captions.
Audience
Who uses auto captions in the pipeline?
Publish 3–5 videos per week across channels. Captioning every video manually doesn't scale. Outbox captions every run automatically — no extra steps.
Shorts, Reels, and TikToks live or die by captions. Active-word highlighting is generated from the same pipeline that produced the narration — no CapCut animation by hand.
Screen recordings need captions for LinkedIn auto-play and accessibility compliance. Upload the footage and get a captioned demo out of the same pipeline run.
Accurate captions on 40-minute tutorials without transcription errors on technical terms, product names, and code references. Script-based means zero guesswork.
Fifteen channels, three aspect ratios each, different caption styles per client brand. Preset configurations per workspace handle variation without per-video manual work.
FAQ
Common questions about auto captions
How accurate are the captions?
100% accurate. Captions come from your approved script, not speech-to-text. The words match exactly what the narration says because they share the same source.
Can I edit captions after they generate?
Caption events are stored as inspectable artifacts in the dashboard. A visual caption editor for direct text and timing adjustments is on the roadmap.
What subtitle format does Outbox use?
ASS (Advanced SubStation Alpha). It supports styled text, precise positioning, color, outline, shadow, and timed effects — far more than SRT or VTT. Captions are burned directly into video frames.
Can I get an SRT file instead of burned-in captions?
The pipeline generates an ASS file as an intermediate artifact. SRT export is on the roadmap for creators who need sidecar subtitle files for YouTube closed-caption upload.
Do captions work for short-form and long-form?
Yes. Aspect-ratio-aware layouts adjust font size, positioning, and safe-area margins for 16:9 (YouTube), 9:16 (Shorts, Reels, TikTok), and 1:1 (Instagram feed, LinkedIn).
What happens if the script changes after captions generated?
Outbox re-runs from the affected stage. Voiceover re-renders, alignment re-syncs, and captions regenerate automatically. The analysis stage stays cached.
Can I use different caption styles for different videos?
Yes. Caption presets are set per pipeline run. Your workspace admin can lock default presets, while individual runs override the style.
Does active-word highlighting work on all presets?
Yes. It uses word-level timing data from the voiceover alignment. Set a highlight color (default: #adff2f) and the pipeline handles the rest.
Get started
Your first captioned video is one pipeline run away.
Upload your footage. Let the pipeline script and voice your video. Pick a caption preset. Captions generate, burn into the video, and flow to publishing — automatically.