Back to features

Feature page

Auto captions that ship inside your video — not as an afterthought.

Auto captions are timed on-screen subtitles generated from your narration and burned directly into the final video. In Outbox, captioning is stage five of a nine-stage production pipeline — no separate subtitle tool needed.

TL;DR: Outbox generates styled captions from your narration, chunks them into readable groups, and burns them into the final video. Pick a preset, set your format — captions render and flow to the next stage. No subtitle editor. No manual timing.
3 caption presets100% script accuracyStage 5 of 9Active-word highlightingMulti-format layouts
Pipeline stage
Captions is stage 5 of 9.
Active
01
Analyze
02
Script
03
Voiceover
04
Align
05
Captions
06
Edit
07
Render
08
Metadata
09
Publish

Problem

What do auto captions actually solve?

Adding captions is tedious. You export audio, upload to a transcription service, fix errors, download an SRT file, import it into your editor, adjust timing, style everything, and re-export per aspect ratio.

Eight steps. Three tools. Every single video. In Outbox, captions are one pipeline stage. Your narration flows in. Styled, timed captions flow out. The pipeline continues.

Manual alternative
  1. 1Export your final audio or video from the editor.
  2. 2Upload to a transcription service like Descript or Rev.
  3. 3Wait for the transcript. Fix errors manually.
  4. 4Download an .srt or .vtt file.
  5. 5Import the subtitle file into your video editor.
  6. 6Adjust timing, fix overlaps, tweak positioning.
  7. 7Style the captions — font, size, color, outline, shadow.
  8. 8Re-export for each aspect ratio (16:9, 9:16, 1:1).
Outbox result
One stage instead of eight disconnected steps.
Analyze -> Script -> Voiceover -> Align -> Captions -> Edit -> Render -> Metadata -> Publish
Smartphone displaying video content with captions

Mechanics

How auto captions work in the pipeline

01
Receive timed segments

The alignment stage delivers script segments with start time, end time, and text.

02
Chunk into readable groups

Long segments split into 2–8 word chunks at natural pause points.

03
Apply caption preset

Font, size, color, positioning, and emphasis rules from your selected style.

04
Generate ASS subtitle file

Every timed caption event gets written to an ASS file with full styling.

05
Pass to edit and render

FFmpeg burns captions into the final video frames. No sidecar files.

Caption presets
Three styles. Zero manual work.
3 presets
WELCOME TO OUTBOX
classic_boldClassic Bold
Welcome to Outbox
minimal_cleanMinimal Clean
Welcome to Outbox
highlight_brandBrand Pop

Accuracy

Why script-based captions beat transcription

Most caption tools start by transcribing audio. That introduces errors — especially on technical terms, product names, and abbreviations. Outbox takes a different path: captions come from your approved script via the scripting stage and AI voiceover. The text is already correct.

ApproachAccuracyTiming sourceEditing required
Speech-to-text transcription80–95% depending on audio qualityInferred from audio waveformManual correction of errors, names, technical terms
Outbox script-based captions100% — uses your approved scriptSegment timing from the alignment stageNone — the script is already reviewed

Presets

Three caption presets, ready to render

Each preset controls font, color, outline, shadow, alignment, and text behavior. Presets adapt to your video's aspect ratio automatically — a 16:9 video gets different margins and font sizes than a 9:16 Short.

PresetLookBest for
classic_boldLarge white text, strong black outline, center-bottomYouTube long-form, tutorials, product demos
minimal_cleanSmaller text, subtle shadow, lower-third positioningProfessional content, SaaS demos, course videos
highlight_brandBold text with accent color highlights, center placementShort-form content, Reels, Shorts, TikTok

Readability

Smart chunking keeps text readable

A script segment might be 15 seconds and 40 words. Displaying everything as one subtitle block is unreadable. The chunking engine splits segments into smaller events at natural pause points.

Each chunk targets 2–8 words and 1–2 lines. Time allocation is proportional — longer chunks get more screen time. Breaks happen at punctuation, conjunctions, and phrase boundaries.

TimeCaption
0.0–1.2sWelcome to Outbox.run
1.2–2.6sThe automated video pipeline
2.6–4.1sthat turns raw footage
4.1–5.4sinto published content
5.4–7.0swith scripting, voiceover,
7.0–8.5scaptions, and metadata
Smart chunking
40 words become 6 readable groups.
0.0sWelcome to Outbox.run
1.2sThe automated video pipeline
2.6sthat turns raw footage
4.1sinto published content
5.4swith scripting, voiceover,
7.0scaptions, and metadata

Formats

Aspect-ratio-aware layouts across platforms

Different platforms demand different formats. Safe-area margins keep captions away from platform UI elements — like TikTok's share button or YouTube's progress bar. Your text stays visible regardless of where the viewer watches.

Aspect ratioPlatformCaption behavior
16:9YouTube, VimeoBottom-center placement, larger font, wider margins
9:16YouTube Shorts, TikTok, ReelsCenter-screen placement, safe-area margins to avoid platform UI overlays
1:1Instagram feed, LinkedInLower-third placement, compact font sizing

Effects

Script effects shape caption presentation

Script segments carry an effects field from the scripting stage. Captions respond to those effects automatically — an upbeat intro looks different from a technical walkthrough without you changing settings mid-run.

Script effectCaption behavior
upbeatSlightly larger font, faster chunk transitions
clearHigh-contrast white on black outline, no decoration
instructionalStable lower-third positioning, smaller font
technicalTighter layout, narrower line lengths
professionalRestrained preset, clean lines
fade_outReduced opacity near segment end

Engagement

Active-word highlighting for short-form

The currently spoken word gets a visual accent while the surrounding phrase stays visible. This is the karaoke-style captioning format popular in Hormozi-style Shorts and viral TikToks — generated automatically from your pipeline run.

How it works
  1. 1Word-level timing data is extracted from the voiceover alignment.
  2. 2Each caption phrase displays with all words visible.
  3. 3The active word gets a color highlight or weight change.
  4. 4The highlight advances word-by-word through the phrase.
Shipvideos,notedits.
Active-word highlight
Word 3 of 4

Comparison

Auto captions vs. the manual path

DimensionManual workflowOutbox Auto Captions
Tools requiredTranscription service + subtitle editor + video editorOne pipeline stage
Time per video20–45 min (transcribe, correct, style, position, export)Automatic — generates and passes to next stage
Accuracy80–95% from speech-to-text (manual cleanup needed)100% — sourced from your approved script
Multi-format supportRe-position and re-export per aspect ratioAspect-ratio-aware layouts generated automatically
Style change impactRe-edit every video manuallyRe-run from captions stage. Upstream cached.
Active-word highlightingManual animation in CapCut or After EffectsGenerated from word-level timing data
Brand consistencyDepends on the editor remembering the style guidePreset-locked per workspace

Settings

Caption configuration options

The settings stay intentionally focused: enough control to shape appearance, not enough complexity to force you back into a subtitle editor.

SettingWhat it doesExample value
Caption presetSelects the visual styleclassic_bold
Captions enabledToggle captions on/off per runtrue
Max words per chunkControls chunk size for readability6
Uppercase modeForce uppercase text per presetfalse
Highlight colorAccent color for active-word highlighting#adff2f
Target aspect ratioDetermines layout and safe areas16:9, 9:16, or 1:1

Artifacts

The caption data pipeline under the hood

Every artifact is stored and inspectable in the Outbox dashboard. If a caption looks wrong, trace it back through the chain to see where the text, timing, or styling originated.

script_segments.json word_timings.json caption_events.json captions.ass burned video
ArtifactWhat it containsPurpose
script_segments.jsonTimed text segments with start/end times and effectsSource of truth from the alignment stage
word_timings.jsonPer-word start and end times from voiceover alignmentEnables active-word highlighting
caption_events.jsonChunked, styled caption events with timing and positioningIntermediate representation for debugging and editing
captions.assASS subtitle file with style definitions and timed eventsFFmpeg input for final burn-in

Pipeline

How captions fit the full production flow

Upstream
Voiceover → Align → Captions

The AI voiceover stage renders narration. Alignment syncs audio to footage and produces timed segments. Captions receive those segments — already split, already synced.

AI Voiceover
Downstream
Captions → Edit → Render

The editing stage receives the ASS subtitle file. FFmpeg burns captions directly into video frames. The render artifact ships with captions embedded.

Stage isolation
Change the preset, keep the rest

Re-run from the captions stage only. Analysis, script, voiceover, and alignment stay cached. New render with updated captions — no full reprocess.

Related feature

Renders narration from the script. Caption timing syncs to voiceover pacing.

Related feature

Visual customization layer for presets, colors, fonts, and positioning rules.

Related feature

Produces the script text that captions display. Script accuracy means caption accuracy.

Related feature

Caption text feeds into description and tag generation for search optimization.

Related feature

Locks caption presets per workspace so every team member produces on-brand captions.

Audience

Who uses auto captions in the pipeline?

Faceless YouTube operators

Publish 3–5 videos per week across channels. Captioning every video manually doesn't scale. Outbox captions every run automatically — no extra steps.

Short-form content producers

Shorts, Reels, and TikToks live or die by captions. Active-word highlighting is generated from the same pipeline that produced the narration — no CapCut animation by hand.

SaaS founders

Screen recordings need captions for LinkedIn auto-play and accessibility compliance. Upload the footage and get a captioned demo out of the same pipeline run.

Course creators and educators

Accurate captions on 40-minute tutorials without transcription errors on technical terms, product names, and code references. Script-based means zero guesswork.

Agencies

Fifteen channels, three aspect ratios each, different caption styles per client brand. Preset configurations per workspace handle variation without per-video manual work.

FAQ

Common questions about auto captions

How accurate are the captions?

100% accurate. Captions come from your approved script, not speech-to-text. The words match exactly what the narration says because they share the same source.

Can I edit captions after they generate?

Caption events are stored as inspectable artifacts in the dashboard. A visual caption editor for direct text and timing adjustments is on the roadmap.

What subtitle format does Outbox use?

ASS (Advanced SubStation Alpha). It supports styled text, precise positioning, color, outline, shadow, and timed effects — far more than SRT or VTT. Captions are burned directly into video frames.

Can I get an SRT file instead of burned-in captions?

The pipeline generates an ASS file as an intermediate artifact. SRT export is on the roadmap for creators who need sidecar subtitle files for YouTube closed-caption upload.

Do captions work for short-form and long-form?

Yes. Aspect-ratio-aware layouts adjust font size, positioning, and safe-area margins for 16:9 (YouTube), 9:16 (Shorts, Reels, TikTok), and 1:1 (Instagram feed, LinkedIn).

What happens if the script changes after captions generated?

Outbox re-runs from the affected stage. Voiceover re-renders, alignment re-syncs, and captions regenerate automatically. The analysis stage stays cached.

Can I use different caption styles for different videos?

Yes. Caption presets are set per pipeline run. Your workspace admin can lock default presets, while individual runs override the style.

Does active-word highlighting work on all presets?

Yes. It uses word-level timing data from the voiceover alignment. Set a highlight color (default: #adff2f) and the pipeline handles the rest.

Get started

Your first captioned video is one pipeline run away.

Upload your footage. Let the pipeline script and voice your video. Pick a caption preset. Captions generate, burn into the video, and flow to publishing — automatically.