Zawgyi and Unicode Myanmar Text Converter
Convert Myanmar text between Zawgyi and Unicode encoding instantly. Auto-detects encoding, supports bulk conversion with the complete Parabaik ruleset.
About
Myanmar digital text exists in two incompatible encodings: Zawgyi-One (a legacy font-encoding hybrid) and Unicode (the international standard, specifically the Myanmar block U+1000 - U+109F). Misidentifying or failing to convert between them causes garbled text ("mojibake"), broken search indexing, and database corruption. This tool implements the complete Parabaik conversion ruleset - approximately 200 sequential regex substitution rules per direction - to perform accurate, deterministic transliteration. It auto-detects the input encoding by scoring occurrences of Zawgyi-specific code points (e.g., U+1033, U+1060 - U+1097) that fall outside the standard Unicode Myanmar block.
Limitation: the detection heuristic requires at least a few Myanmar syllables to reach confidence. Single-character inputs or text mixing both encodings in one string may produce ambiguous results. For mixed-encoding documents, split and convert sections independently. The conversion is purely rule-based and does not handle Shan, Mon, or other extended Myanmar script subsets that use code points above U+109F.
Formulas
Encoding detection uses a weighted scoring heuristic. For each input string, the detector counts occurrences of Zawgyi-exclusive code points and computes a confidence score:
where nzawgyi is the count of characters matching the Zawgyi-only set {U+1033, U+1060 - U+1097} and ntotal is the total Myanmar character count in range U+1000 - U+109F. If Szg > 0, the text is classified as Zawgyi.
The conversion itself is a sequential rule application pipeline:
where each Ri is a regex substitution rule: Ri = replace(patterni, replacementi). Order matters: rules handle medial reordering, Kinzi normalization, vowel sign remapping, and stacked consonant decomposition in a specific sequence to avoid conflicts.
Reference Data
| Feature | Zawgyi-One | Unicode (Myanmar) |
|---|---|---|
| Standard | Proprietary font hack | Unicode Consortium (ISO 10646) |
| Block Range | Reuses U+1000 - U+109F + PUA | U+1000 - U+109F (strict) |
| Character Order | Visual (left-to-right render order) | Logical (phonetic order) |
| Medial Stacking | Separate code points per visual form | Virama + consonant sequence |
| Kinzi (ααΊαΉ) | Dedicated code point U+1064 | Sequence: U+1004 U+103A U+1039 |
| U+1033 | Used as vowel sign | Not used (replaced by U+102F) |
| U+1060 - U+1097 | Pre-composed stacked consonants | Not used (use Virama stacking) |
| Sorting | Broken (visual order) | Correct (logical order) |
| Search/Regex | Unreliable | Standards-compliant |
| OS Support (2020+) | Requires font install | Native on Android 10+, iOS 13+, Win 10+ |
| Database Storage | Collation errors common | UTF-8 compatible |
| NLP / ML Usage | Requires pre-conversion | Direct tokenization possible |
| Myanmar NRC Format | Often garbled in forms | Correctly rendered |
| Facebook (Myanmar) | Supported until 2019 | Forced migration in 2019 |
| Government Mandate | No official status | Mandated by Myanmar govt (2019) |
| Conversion Direction | Zawgyi β Unicode (recommended) | Unicode β Zawgyi (legacy support) |
| Detection Method | Code point frequency scoring | Absence of Zawgyi-only points |
| Rule Count (Parabaik) | ~200 regex rules per direction | |
| Common Error | αα¬αα (wrong rendering) | Correct syllable boundary |
| Virama Usage | Rare / visual workaround | Core mechanism (U+1039) |