TTML to WebVTT Caption Converter
Convert TTML (Timed Text Markup Language) caption files to WebVTT format instantly. Handles styles, timing, and nested spans with accurate parsing.
About
TTML (Timed Text Markup Language), defined by W3C as DFXP, encodes subtitle data in XML with a styling model that does not map directly to WebVTT. Incorrect conversion loses italic markers, drops alignment cues, or miscalculates end times when dur must be added to begin rather than treated as an absolute timestamp. This tool parses the full TTML DOM including <styling> blocks and <span> nesting, resolves style references by id, and emits spec-compliant WebVTT with proper cue tags. It handles offset-time formats (12.5s, 500ms) and clock-time with frames. The conversion runs entirely in the browser with no server round-trip. Limitation: TTML region-based positioning is approximated because WebVTT positioning semantics differ from TTML's region model.
Formulas
The core timing computation converts a TTML duration-based cue into an absolute end timestamp:
where tbegin is the parsed begin attribute in milliseconds, and tdur is the parsed dur attribute in milliseconds. If an end attribute is present instead, it is used directly as tend.
Timestamp parsing normalizes all TTML time expressions to milliseconds:
For frame-based timestamps (HH:MM:SS:FF), the frame count F is converted assuming a default frame rate of 30 fps:
For offset-time expressions, the numeric value is multiplied by the unit factor: h → 3600000, m → 60000, s → 1000, ms → 1.
Style resolution follows a lookup chain: each style attribute value on a <p> or <span> is matched against the id of <style> elements in the TTML <head>. The resolved properties are then mapped to WebVTT cue tags: tts:fontStyle="italic" → <i>, tts:fontWeight="bold" → <b>, tts:textDecoration="underline" → <u>.
Reference Data
| TTML Feature | TTML Syntax | WebVTT Equivalent | Support Status |
|---|---|---|---|
| Italic text | tts:fontStyle="italic" | <i>...</i> | Full |
| Bold text | tts:fontWeight="bold" | <b>...</b> | Full |
| Underline | tts:textDecoration="underline" | <u>...</u> | Full |
| Text alignment | tts:textAlign="left|center|right" | align:left|center|right | Full |
| Font color | tts:color="#RRGGBB" | <c.colorname> or inline | Mapped |
| Background color | tts:backgroundColor | <c> with class | Approximated |
| Duration attribute | dur="00:00:05.000" | Computed end time | Full |
| End attribute | end="00:00:10.000" | Direct end time | Full |
| Clock-time format | HH:MM:SS.mmm | HH:MM:SS.mmm | Full |
| Clock-time with frames | HH:MM:SS:FF | Converted to .mmm | Full (assumes 30 fps) |
| Offset-time seconds | 12.5s | Converted to HH:MM:SS.mmm | Full |
| Offset-time milliseconds | 500ms | Converted to HH:MM:SS.mmm | Full |
| Offset-time hours | 2.5h | Converted to HH:MM:SS.mmm | Full |
| Offset-time minutes | 5m | Converted to HH:MM:SS.mmm | Full |
Nested <span> | Inline style spans | Nested WebVTT tags | Full |
| Line breaks | <br /> | Newline character | Full |
| Region positioning | <region> with origin/extent | position/line settings | Approximated |
| Font size | tts:fontSize | Not supported in WebVTT | Dropped |
| Font family | tts:fontFamily | Not supported in WebVTT | Dropped |
| Writing mode | tts:writingMode | vertical cue setting | Partial |
| Opacity | tts:opacity | Not supported in WebVTT | Dropped |
Multiple <div> blocks | Separate content divisions | Sequential cues | Full |
Frequently Asked Questions
<p> element has a dur attribute, the converter adds that duration to the begin timestamp to compute the end time. When an end attribute is present, it is used directly. If both are present, end takes precedence. If neither exists, the cue is skipped with a warning.<p> element. A <span> referencing a style with tts:fontStyle="italic" will produce <i>...</i> in the WebVTT output. Multiple nesting levels (e.g., bold inside italic) produce nested tags like <i><b>text</b></i>.<p> elements across all </p><div> containers within the <body>. Cues are emitted in document order. Each <div> does not create a separate WebVTT file; all cues are merged into a single output.