About

ANSI is not a single encoding. It is a family of code pages (Windows-1252, ISO-8859-1, Windows-1251, etc.) where bytes 0x80 - 0xFF map to different Unicode code points depending on the locale. Pasting ANSI-encoded data into a UTF-8 system without proper conversion produces mojibake: garbled sequences like “Ã¤” instead of “ä”. Database imports, legacy CSV migrations, and subtitle files are common failure points. This tool performs real byte-level remapping using complete code page lookup tables for 15 ANSI standards. It does not guess. Each byte in the 0x80 - 0xFF range is resolved to its exact Unicode code point per the selected code page, then re-encoded as valid UTF-8. Auto-detection scores your input against all supported code pages and selects the most probable match.

Limitations: auto-detection works best on natural-language text longer than 50 characters. Short strings or binary data may produce ambiguous results. Mixed-encoding files (partially UTF-8, partially ANSI) require manual segment handling. Pro tip: if your source is a database dump, check the COLLATION setting before converting - the declared encoding may differ from the actual byte content.

Formulas

ANSI-to-UTF-8 conversion is a two-stage remapping process. Stage 1 resolves each ANSI byte to a Unicode code point. Stage 2 encodes that code point as a UTF-8 multi-byte sequence.

Stage 1: Code Page Lookup

For input byte b:

{

U = b if b < 0x80 (ASCII)U = TABLE_cp[b − 0x80] if b ≥ 0x80

where U = Unicode code point, cp = selected code page, TABLE_cp = the 128-entry lookup array for bytes 0x80 - 0xFF.

Stage 2: UTF-8 Encoding

{

1 byte: 0xxxxxxx if U ≤ 0x7F2 bytes: 110xxxxx 10xxxxxx if U ≤ 0x7FF3 bytes: 1110xxxx 10xxxxxx 10xxxxxx if U ≤ 0xFFFF4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx if U ≤ 0x10FFFF

Auto-Detection Scoring

For each code page cp, a score S_cp is computed:

S_cp = n∑i=0 w(TABLE_cp[b_i])n

where w(c) returns a weight based on whether code point c is a printable letter (2), punctuation (1), or undefined (−5). The code page with the highest S is selected.

Reference Data

Code Page	Name	Primary Languages	Unique Range	Notable Characters
1250	Windows-1250	Polish, Czech, Hungarian, Romanian	0x80 - 0xFF	Š, š, Ž, ž, Ł, ł
1251	Windows-1251	Russian, Ukrainian, Bulgarian, Serbian	0x80 - 0xFF	À - я, Ё, ё
1252	Windows-1252	English, French, German, Spanish, Portuguese	0x80 - 0x9F	€, „, “, ”, -, -
1253	Windows-1253	Greek	0x80 - 0xFF	Α - Ω, α - ω
1254	Windows-1254	Turkish	0x80 - 0xFF	Ğ, ğ, İ, ı, Ş, ş
1255	Windows-1255	Hebrew	0xC0 - 0xFA	א - ת (Alef - Tav)
1256	Windows-1256	Arabic, Persian, Urdu	0x80 - 0xFF	ء - ي, ی
1257	Windows-1257	Estonian, Latvian, Lithuanian	0x80 - 0xFF	Ā, ā, Č, č, Ē
1258	Windows-1258	Vietnamese	0x80 - 0xFF	Ơ, ơ, Ư, ư
28591	ISO-8859-1	Western European (Latin-1)	0xA0 - 0xFF	¿, Ñ, ñ, ß, þ
28592	ISO-8859-2	Central European (Latin-2)	0xA0 - 0xFF	Ą, ą, Ď, ď
28595	ISO-8859-5	Cyrillic	0xA0 - 0xFF	А - я (sequential block)
28597	ISO-8859-7	Greek	0xA0 - 0xFF	Α - Ω (ISO standard)
28599	ISO-8859-9	Turkish (Latin-5)	0xD0, 0xDD, 0xF0, 0xFD	Ğ, İ, ğ, ı replace Ð, Ý, ð, ý
28605	ISO-8859-15	Western European (Latin-9)	0xA4, 0xA6, 0xA8	€, Š, š replace ¤, ¦, ¨

Frequently Asked Questions

Both cover Western European languages, but they differ in the range 0x80 - 0x9F. ISO-8859-1 maps these bytes to C1 control characters (non-printable). Windows-1252 reassigns them to useful characters: the euro sign € at 0x80, smart quotes “ ” at 0x93 - 0x94, and em-dash - at 0x97. In practice, most files labeled ISO-8859-1 actually use Windows-1252. If you see € or curly quotes in your data, select Windows-1252.

Auto-detection uses statistical scoring against all 15 code page tables. It works well on natural-language text longer than 50 bytes because letter frequency distributions differ between languages. Short strings (under 20 bytes), numeric data, or mixed-encoding files produce ambiguous scores. In those cases, manual code page selection is necessary. The detector cannot distinguish between code pages that share most of their mapping (e.g., ISO-8859-1 vs ISO-8859-15 differ in only 8 positions).

No. The converter treats every byte as an ANSI-encoded value. If your file contains valid UTF-8 multi-byte sequences, those bytes will be individually remapped through the code page table, producing corrupted output. For mixed-encoding files, you must first identify and split the UTF-8 and ANSI segments, then convert only the ANSI portions. A common sign of mixed encoding: some characters display correctly while others appear as mojibake.

The Unicode replacement character � (U+FFFD) appears when a byte maps to an undefined position in the selected code page. For example, Windows-1253 (Greek) leaves positions 0xAA, 0xD2, and 0xFF undefined. If your source data uses those byte values, the original encoding is not Windows-1253. Try a different code page or use auto-detection.

ANSI encodings are single-byte and have no byte order. UTF-8 output is also byte-order independent by design (unlike UTF-16 which requires a BOM). However, some applications prepend a UTF-8 BOM (0xEF 0xBB 0xBF) to signal the encoding. This tool offers an optional BOM toggle. Enable it if your target application requires BOM detection (e.g., older versions of Excel reading CSV files).

The tool processes files up to 50 MB using a Web Worker to prevent browser UI freezing. Files above 100 KB are automatically offloaded to the worker thread and processed in chunks of 64 KB. Processing speed depends on your device: a typical laptop converts 10 MB in under 2 seconds. For files larger than 50 MB, use a command-line tool like iconv.