User Rating 0.0
Total Usage 0 times
0 characters
Drop file here or click to browse TXT, CSV, SRT, LOG, XML — up to 50 MB
Processing...
UTF-8 Output
Byte Comparison
Is this tool helpful?

Your feedback helps us improve.

About

ANSI is not a single encoding. It is a family of code pages (Windows-1252, ISO-8859-1, Windows-1251, etc.) where bytes 0x80 - 0xFF map to different Unicode code points depending on the locale. Pasting ANSI-encoded data into a UTF-8 system without proper conversion produces mojibake: garbled sequences like “ä” instead of “ä”. Database imports, legacy CSV migrations, and subtitle files are common failure points. This tool performs real byte-level remapping using complete code page lookup tables for 15 ANSI standards. It does not guess. Each byte in the 0x80 - 0xFF range is resolved to its exact Unicode code point per the selected code page, then re-encoded as valid UTF-8. Auto-detection scores your input against all supported code pages and selects the most probable match.

Limitations: auto-detection works best on natural-language text longer than 50 characters. Short strings or binary data may produce ambiguous results. Mixed-encoding files (partially UTF-8, partially ANSI) require manual segment handling. Pro tip: if your source is a database dump, check the COLLATION setting before converting - the declared encoding may differ from the actual byte content.

ansi to utf-8 character encoding converter windows-1252 to utf-8 code page converter text encoding iso-8859 converter charset converter

Formulas

ANSI-to-UTF-8 conversion is a two-stage remapping process. Stage 1 resolves each ANSI byte to a Unicode code point. Stage 2 encodes that code point as a UTF-8 multi-byte sequence.

Stage 1: Code Page Lookup

For input byte b:

{
U = b if b < 0x80 (ASCII)U = TABLEcp[b 0x80] if b 0x80

where U = Unicode code point, cp = selected code page, TABLEcp = the 128-entry lookup array for bytes 0x80 - 0xFF.

Stage 2: UTF-8 Encoding

{
1 byte: 0xxxxxxx if U 0x7F2 bytes: 110xxxxx 10xxxxxx if U 0x7FF3 bytes: 1110xxxx 10xxxxxx 10xxxxxx if U 0xFFFF4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx if U 0x10FFFF

Auto-Detection Scoring

For each code page cp, a score Scp is computed:

Scp = ni=0 w(TABLEcp[bi])n

where w(c) returns a weight based on whether code point c is a printable letter (2), punctuation (1), or undefined (−5). The code page with the highest S is selected.

Reference Data

Code PageNamePrimary LanguagesUnique RangeNotable Characters
1250Windows-1250Polish, Czech, Hungarian, Romanian0x80 - 0xFFŠ, š, Ž, ž, Ł, ł
1251Windows-1251Russian, Ukrainian, Bulgarian, Serbian0x80 - 0xFFÀ - я, Ё, ё
1252Windows-1252English, French, German, Spanish, Portuguese0x80 - 0x9F€, „, “, ”, -, -
1253Windows-1253Greek0x80 - 0xFFΑ - Ω, α - ω
1254Windows-1254Turkish0x80 - 0xFFĞ, ğ, İ, ı, Ş, ş
1255Windows-1255Hebrew0xC0 - 0xFAא - ת (Alef - Tav)
1256Windows-1256Arabic, Persian, Urdu0x80 - 0xFFء - ي, ی
1257Windows-1257Estonian, Latvian, Lithuanian0x80 - 0xFFĀ, ā, Č, č, Ē
1258Windows-1258Vietnamese0x80 - 0xFFƠ, ơ, Ư, ư
28591ISO-8859-1Western European (Latin-1)0xA0 - 0xFF¿, Ñ, ñ, ß, þ
28592ISO-8859-2Central European (Latin-2)0xA0 - 0xFFĄ, ą, Ď, ď
28595ISO-8859-5Cyrillic0xA0 - 0xFFА - я (sequential block)
28597ISO-8859-7Greek0xA0 - 0xFFΑ - Ω (ISO standard)
28599ISO-8859-9Turkish (Latin-5)0xD0, 0xDD, 0xF0, 0xFDĞ, İ, ğ, ı replace Ð, Ý, ð, ý
28605ISO-8859-15Western European (Latin-9)0xA4, 0xA6, 0xA8€, Š, š replace ¤, ¦, ¨

Frequently Asked Questions

Both cover Western European languages, but they differ in the range 0x80 - 0x9F. ISO-8859-1 maps these bytes to C1 control characters (non-printable). Windows-1252 reassigns them to useful characters: the euro sign € at 0x80, smart quotes “ ” at 0x93 - 0x94, and em-dash - at 0x97. In practice, most files labeled ISO-8859-1 actually use Windows-1252. If you see € or curly quotes in your data, select Windows-1252.
Auto-detection uses statistical scoring against all 15 code page tables. It works well on natural-language text longer than 50 bytes because letter frequency distributions differ between languages. Short strings (under 20 bytes), numeric data, or mixed-encoding files produce ambiguous scores. In those cases, manual code page selection is necessary. The detector cannot distinguish between code pages that share most of their mapping (e.g., ISO-8859-1 vs ISO-8859-15 differ in only 8 positions).
No. The converter treats every byte as an ANSI-encoded value. If your file contains valid UTF-8 multi-byte sequences, those bytes will be individually remapped through the code page table, producing corrupted output. For mixed-encoding files, you must first identify and split the UTF-8 and ANSI segments, then convert only the ANSI portions. A common sign of mixed encoding: some characters display correctly while others appear as mojibake.
The Unicode replacement character � (U+FFFD) appears when a byte maps to an undefined position in the selected code page. For example, Windows-1253 (Greek) leaves positions 0xAA, 0xD2, and 0xFF undefined. If your source data uses those byte values, the original encoding is not Windows-1253. Try a different code page or use auto-detection.
ANSI encodings are single-byte and have no byte order. UTF-8 output is also byte-order independent by design (unlike UTF-16 which requires a BOM). However, some applications prepend a UTF-8 BOM (0xEF 0xBB 0xBF) to signal the encoding. This tool offers an optional BOM toggle. Enable it if your target application requires BOM detection (e.g., older versions of Excel reading CSV files).
The tool processes files up to 50 MB using a Web Worker to prevent browser UI freezing. Files above 100 KB are automatically offloaded to the worker thread and processed in chunks of 64 KB. Processing speed depends on your device: a typical laptop converts 10 MB in under 2 seconds. For files larger than 50 MB, use a command-line tool like iconv.