About

Myanmar digital text exists in two incompatible encodings: Zawgyi-One (a legacy font-encoding hybrid) and Unicode (the international standard, specifically the Myanmar block U+1000 - U+109F). Misidentifying or failing to convert between them causes garbled text ("mojibake"), broken search indexing, and database corruption. This tool implements the complete Parabaik conversion ruleset - approximately 200 sequential regex substitution rules per direction - to perform accurate, deterministic transliteration. It auto-detects the input encoding by scoring occurrences of Zawgyi-specific code points (e.g., U+1033, U+1060 - U+1097) that fall outside the standard Unicode Myanmar block.

Limitation: the detection heuristic requires at least a few Myanmar syllables to reach confidence. Single-character inputs or text mixing both encodings in one string may produce ambiguous results. For mixed-encoding documents, split and convert sections independently. The conversion is purely rule-based and does not handle Shan, Mon, or other extended Myanmar script subsets that use code points above U+109F.

Formulas

Encoding detection uses a weighted scoring heuristic. For each input string, the detector counts occurrences of Zawgyi-exclusive code points and computes a confidence score:

S_zg = n_zawgyin_total

where n_zawgyi is the count of characters matching the Zawgyi-only set {U+1033, U+1060 - U+1097} and n_total is the total Myanmar character count in range U+1000 - U+109F. If S_zg > 0, the text is classified as Zawgyi.

The conversion itself is a sequential rule application pipeline:

T_out = R_n ∘ R_n−1 ∘ … ∘ R₂ ∘ R₁(T_in)

where each R_i is a regex substitution rule: R_i = replace(pattern_i, replacement_i). Order matters: rules handle medial reordering, Kinzi normalization, vowel sign remapping, and stacked consonant decomposition in a specific sequence to avoid conflicts.

Reference Data

Feature	Zawgyi-One	Unicode (Myanmar)
Standard	Proprietary font hack	Unicode Consortium (ISO 10646)
Block Range	Reuses U+1000 - U+109F + PUA	U+1000 - U+109F (strict)
Character Order	Visual (left-to-right render order)	Logical (phonetic order)
Medial Stacking	Separate code points per visual form	Virama + consonant sequence
Kinzi (င်္)	Dedicated code point U+1064	Sequence: U+1004 U+103A U+1039
U+1033	Used as vowel sign	Not used (replaced by U+102F)
U+1060 - U+1097	Pre-composed stacked consonants	Not used (use Virama stacking)
Sorting	Broken (visual order)	Correct (logical order)
Search/Regex	Unreliable	Standards-compliant
OS Support (2020+)	Requires font install	Native on Android 10+, iOS 13+, Win 10+
Database Storage	Collation errors common	UTF-8 compatible
NLP / ML Usage	Requires pre-conversion	Direct tokenization possible
Myanmar NRC Format	Often garbled in forms	Correctly rendered
Facebook (Myanmar)	Supported until 2019	Forced migration in 2019
Government Mandate	No official status	Mandated by Myanmar govt (2019)
Conversion Direction	Zawgyi → Unicode (recommended)	Unicode → Zawgyi (legacy support)
Detection Method	Code point frequency scoring	Absence of Zawgyi-only points
Rule Count (Parabaik)	~200 regex rules per direction
Common Error	လာမရ (wrong rendering)	Correct syllable boundary
Virama Usage	Rare / visual workaround	Core mechanism (U+1039)

Frequently Asked Questions

The detector scans for code points that exist only in Zawgyi encoding, specifically U+1033 and the range U+1060 - U+1097. These code points are not used in standard Unicode Myanmar text. If any are found, the text is classified as Zawgyi. If none are present and the text contains characters in the Myanmar block (U+1000 - U+109F), it is classified as Unicode. For very short strings with no Zawgyi-exclusive characters, the result defaults to Unicode since it cannot be distinguished.

Zawgyi stores characters in visual rendering order, while Unicode uses logical phonetic order. Converting between them requires reordering medial consonants, vowel signs, and tone marks. A rule that moves a medial consonant may create a substring that matches a later rule. If rules execute out of order, intermediate states get corrupted. The Parabaik ruleset is specifically ordered to process Kinzi (U+1004 U+103A U+1039) first, then stacked consonants, then medials, then vowel signs, and finally tone marks.

No. Mixed-encoding strings are inherently ambiguous because certain code points (U+1000 - U+1032, U+1034 - U+105F) are shared between both encodings. The converter treats the entire input as one encoding. For mixed documents, you must isolate Zawgyi and Unicode sections manually, convert each separately, then recombine. Attempting to convert an already-Unicode segment with the Zawgyi→Unicode ruleset will corrupt it.

This tool covers the core Myanmar (Burmese) script block U+1000 - U+109F. Extended characters for Shan (U+1050 - U+1059), Mon (Myanmar Extended-A: U+AA60 - U+AA7F), and Kayah Li are partially handled only where they overlap with the Parabaik ruleset. Dedicated Shan or Mon conversion requires additional rules not included here. Characters outside the base Myanmar block pass through unchanged.

The converter processes text in the browser with no server round-trip. For texts under 100 KB (roughly 50,000 Myanmar characters), conversion is near-instantaneous. For larger texts, the tool chunks processing to prevent browser freezing. Tested up to 1 MB of Myanmar text without errors. Beyond that, browser memory limits may apply depending on the device. For multi-megabyte datasets, split into smaller files.

Three common causes: (1) The original Zawgyi text was malformed, using non-standard code point sequences that no ruleset can parse. (2) The rendering font does not support Unicode Myanmar correctly - ensure your OS or browser uses a Unicode-compliant Myanmar font like Noto Sans Myanmar or Padauk. (3) The text contains PUA (Private Use Area) characters from custom Zawgyi font variants that differ from the standard Zawgyi-One encoding this tool targets.