About

Data URIs embed file content directly into documents using the data: scheme, encoding binary or text as Base64 or percent-encoded strings. Malformed decoding corrupts special characters - é becomes Ã©, breaking internationalized content. This tool parses the URI structure, detects encoding type (Base64 flag ;base64 or implicit percent-encoding), extracts the MIME type, and reconstructs the original UTF-8 byte sequence using the TextDecoder API. Multi-byte characters (Chinese, Arabic, emoji) require proper byte-level reconstruction, not naive character-by-character conversion.

Common failure modes: truncated Base64 padding causes InvalidCharacterError, unescaped + symbols decode as spaces in URL contexts, and BOM markers (0xEF 0xBB 0xBF) may prefix UTF-8 payloads. The tool handles these edge cases and displays the extracted MIME type for verification. Useful for debugging embedded images in CSS, decoding API responses, or extracting text from data URLs in email templates.

Formulas

Data URI structure follows RFC 2397 specification. The parser extracts components using pattern matching:

dataURI = "data:" + [mimeType] + [";base64"] + "," + payload

where mimeType defaults to text/plain;charset=US-ASCII if omitted. Base64 decoding converts the ASCII payload to binary bytes:

bytes = atob(payload) → charCodeAt(i) ∀ i ∈ [0, n)

UTF-8 reconstruction processes the byte array through TextDecoder:

utf8String = new TextDecoder("utf-8").decode(new Uint8Array(bytes))

For percent-encoded URIs (without ;base64 flag), decoding applies the inverse transformation:

decoded = decodeURIComponent(payload)

Multi-byte UTF-8 sequences follow the encoding pattern where byte count n is determined by leading bits:

n = 1 if byte < 0x80, n = 2 if byte < 0xE0, n = 3 if byte < 0xF0, n = 4 otherwise

Reference Data

MIME Type	Common Use	Typical Encoding	Max Practical Size
text/plain	Plain text embedding	Base64 / Percent	~2KB (URL limits)
text/html	Inline HTML frames	Base64	~32KB
text/css	Embedded stylesheets	Base64	~64KB
text/javascript	Inline scripts	Base64	~32KB
application/json	Configuration data	Base64 / Percent	~128KB
application/xml	SVG, config files	Base64	~64KB
image/svg+xml	Vector graphics	Base64 / Percent	~96KB
image/png	Raster images	Base64	~128KB
image/jpeg	Photos	Base64	~96KB
image/gif	Animations	Base64	~64KB
image/webp	Modern images	Base64	~128KB
font/woff	Web fonts	Base64	~256KB
font/woff2	Compressed fonts	Base64	~192KB
audio/mpeg	Sound effects	Base64	~512KB
audio/wav	Uncompressed audio	Base64	~256KB
video/mp4	Short clips	Base64	~1MB
application/pdf	Documents	Base64	~512KB
application/octet-stream	Binary fallback	Base64	~256KB
text/csv	Spreadsheet data	Base64 / Percent	~64KB
application/x-www-form-urlencoded	Form data	Percent	~8KB

Frequently Asked Questions

Corruption occurs when Base64 payloads are decoded character-by-character instead of byte-by-byte. UTF-8 uses multi-byte sequences for non-ASCII characters. The character é requires bytes 0xC3 0xA9. Naive decoding treats each byte as a separate character, producing Ã©. This tool reconstructs the proper byte array before applying TextDecoder, preserving multi-byte sequences correctly.

The detection relies on the presence of the ;base64 flag before the comma separator. Data URIs structured as data:text/plain;base64,SGVsbG8= use Base64. URIs like data:text/plain,Hello%20World use percent encoding. If the flag is absent, percent decoding via decodeURIComponent is applied. Some edge cases exist where payload content could superficially resemble Base64 without the flag - these decode as literal percent-encoded text.

Base64 requires padding to align the payload to a multiple of 4 characters. Truncated URIs may lack trailing = padding characters, causing atob() to fail. Additionally, URL-safe Base64 variants use - and _ instead of + and /, which standard atob() rejects. This tool normalizes URL-safe characters and adds missing padding before decoding, handling both variants transparently.

The tool decodes the binary payload but displays the result as UTF-8 text. For image/png or application/pdf payloads, the output appears as garbled characters because binary data is not valid UTF-8 text. The extracted MIME type indicates when binary content is detected. For actual file extraction, a dedicated binary-to-file converter would be required.

Some text editors prepend the UTF-8 Byte Order Mark (bytes 0xEF 0xBB 0xBF, displayed as ï»¿ or a single invisible character) when saving files. If this BOM-prefixed content was encoded into a Data URI, the marker persists in the decoded output. This is not an error - it reflects the original source encoding. Remove the first three bytes or character if clean output is required.

Browser JavaScript string limits allow approximately 512MB of text data. However, practical limits are lower due to memory constraints. Data URIs over 2MB may cause sluggish UI response. The tool processes synchronously without Web Workers, so payloads exceeding 5MB are not recommended. For extremely large URIs, chunk-based processing or server-side tools would be more appropriate.

For Base64-encoded payloads, the charset parameter is informational - decoding always produces raw bytes, and TextDecoder interprets them as UTF-8 regardless of declared charset. For percent-encoded payloads, decodeURIComponent assumes UTF-8 encoding per the URL specification. If the original content used a different encoding (ISO-8859-1, Shift_JIS), character mapping errors may occur. This tool explicitly targets UTF-8 content.