About

Internationalized source files often contain characters outside the ASCII range (0 - 127). Build systems, Java .properties files, and certain legacy toolchains require every non-ASCII code point expressed as a \uXXXX escape sequence. Getting this wrong means garbled strings at runtime, broken localization bundles, or silent data loss when a pipeline strips high bytes. This tool performs the same transformation as the classic native2ascii utility: each code point with a value > U+007F is replaced by its four-digit hexadecimal escape. Supplementary characters above U+FFFF are emitted as a surrogate pair of two escapes. The reverse mode parses every \uXXXX token and reconstructs the original native text, including recombining surrogate pairs into their proper code points.

The conversion is deterministic and lossless for all Unicode planes (0 - 16). Note: the tool assumes input is valid UTF-8/UTF-16 as delivered by the browser. If your source file uses a single-byte encoding like ISO-8859-1, open it in a text editor with the correct encoding first, then paste the text here. Pro tip: Java .properties files written before Java 9 mandate ASCII-only content, so run every localized bundle through native-to-ASCII before packaging.

Formulas

The native-to-ASCII conversion operates on each code point c in the input string:

{

c unchanged, if c ≤ 0x7F\u + hex(c)₄ , if 0x80 ≤ c ≤ 0xFFFF\u + hex(H)₄ + \u + hex(L)₄ , if c > 0xFFFF

For supplementary code points (c > 0xFFFF), the surrogate pair is computed as:

H = c − 0x100000x400 + 0xD800

L = (c − 0x10000) mod 0x400 + 0xDC00

where H = high surrogate (0xD800 - 0xDBFF), L = low surrogate (0xDC00 - 0xDFFF), and hex(n)₄ denotes the zero-padded four-digit hexadecimal representation.

The reverse operation uses the regex pattern /\\u([0-9A-Fa-f]{4})/g to locate each escape token and replaces it with String.fromCharCode(parseInt(match, 16)). Consecutive surrogate pairs are then recombined by the JavaScript engine into their original supplementary code point.

Reference Data

Character	Code Point	Unicode Escape	Category	Script
é	U+00E9	`\u00E9`	Lowercase Letter	Latin
ñ	U+00F1	`\u00F1`	Lowercase Letter	Latin
ü	U+00FC	`\u00FC`	Lowercase Letter	Latin
中	U+4E2D	`\u4E2D`	CJK Ideograph	Han
日	U+65E5	`\u65E5`	CJK Ideograph	Han
本	U+672C	`\u672C`	CJK Ideograph	Han
한	U+D55C	`\uD55C`	Syllable	Hangul
글	U+AE00	`\uAE00`	Syllable	Hangul
Ω	U+03A9	`\u03A9`	Uppercase Letter	Greek
π	U+03C0	`\u03C0`	Lowercase Letter	Greek
Д	U+0414	`\u0414`	Uppercase Letter	Cyrillic
я	U+044F	`\u044F`	Lowercase Letter	Cyrillic
א	U+05D0	`\u05D0`	Letter	Hebrew
ع	U+0639	`\u0639`	Letter	Arabic
₹	U+20B9	`\u20B9`	Currency Symbol	Common
€	U+20AC	`\u20AC`	Currency Symbol	Common
£	U+00A3	`\u00A3`	Currency Symbol	Common
©	U+00A9	`\u00A9`	Symbol	Common
™	U+2122	`\u2122`	Symbol	Common
∞	U+221E	`\u221E`	Math Symbol	Common
→	U+2192	`\u2192`	Arrow	Common
😀	U+1F600	`\uD83D\uDE00`	Emoji	Common (Surrogate Pair)
🎵	U+1F3B5	`\uD83C\uDFB5`	Emoji	Common (Surrogate Pair)
𝄞	U+1D11E	`\uD834\uDD1E`	Musical Symbol	Common (Surrogate Pair)
𐍈	U+10348	`\uD800\uDF48`	Letter	Gothic (Surrogate Pair)

Frequently Asked Questions

Characters with code points above U+FFFF cannot be represented by a single \uXXXX escape. The converter splits them into a UTF-16 surrogate pair: a high surrogate in the range 0xD800-0xDBFF followed by a low surrogate in 0xDC00-0xDFFF. For example, the emoji 😀 (U+1F600) becomes \uD83D\uDE00. The reverse operation detects adjacent surrogates and recombines them into the original code point.

No. All code points at or below U+007F (decimal 127) pass through unchanged, including control characters such as tab (\t), line feed (\n), and carriage return (\r). Only characters above this threshold are escaped. If you need to escape control characters as well, pre-process them separately with standard backslash notation.

The regex pattern strictly matches four hexadecimal digits [0-9A-Fa-f]. A sequence like \u00GZ does not match and is left as literal text in the output. The converter does not throw an error; it simply skips non-conforming patterns. Check the output for any remaining \u literals to identify malformed sequences.

Yes. Java .properties files prior to Java 9 require all non-ASCII characters to be expressed as \uXXXX escapes in ISO-8859-1 encoding. This converter produces output identical to the JDK native2ascii utility. For Java 9+ properties files that support UTF-8, conversion is optional but still useful for backward compatibility with older toolchains.

In native-to-ASCII mode, a literal backslash followed by "u" and four hex digits in the source text will be escaped character-by-character: the backslash becomes \u005C, and the rest remain ASCII. In reverse mode, the pattern \uXXXX is consumed and replaced. If your input intentionally contains literal \u sequences that should not be decoded, you need to escape the backslash first (\\u) before running the reverse conversion.

The converter runs entirely in the browser. Practical limits depend on available memory. Text inputs up to approximately 5 MB process in under one second on modern hardware. For files exceeding 10 MB, consider splitting them. The file upload feature reads the entire file into memory before conversion, so ensure your device has sufficient RAM for very large files.