User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
Input
Output
Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

Myanmar digital text exists in two incompatible encodings: Zawgyi-One (a legacy font-encoding hybrid) and Unicode (the international standard, specifically the Myanmar block U+1000 - U+109F). Misidentifying or failing to convert between them causes garbled text ("mojibake"), broken search indexing, and database corruption. This tool implements the complete Parabaik conversion ruleset - approximately 200 sequential regex substitution rules per direction - to perform accurate, deterministic transliteration. It auto-detects the input encoding by scoring occurrences of Zawgyi-specific code points (e.g., U+1033, U+1060 - U+1097) that fall outside the standard Unicode Myanmar block.

Limitation: the detection heuristic requires at least a few Myanmar syllables to reach confidence. Single-character inputs or text mixing both encodings in one string may produce ambiguous results. For mixed-encoding documents, split and convert sections independently. The conversion is purely rule-based and does not handle Shan, Mon, or other extended Myanmar script subsets that use code points above U+109F.

zawgyi unicode myanmar converter burmese text encoding parabaik

Formulas

Encoding detection uses a weighted scoring heuristic. For each input string, the detector counts occurrences of Zawgyi-exclusive code points and computes a confidence score:

Szg = nzawgyintotal

where nzawgyi is the count of characters matching the Zawgyi-only set {U+1033, U+1060 - U+1097} and ntotal is the total Myanmar character count in range U+1000 - U+109F. If Szg > 0, the text is classified as Zawgyi.

The conversion itself is a sequential rule application pipeline:

Tout = Rn Rnβˆ’1 … R2 R1(Tin)

where each Ri is a regex substitution rule: Ri = replace(patterni, replacementi). Order matters: rules handle medial reordering, Kinzi normalization, vowel sign remapping, and stacked consonant decomposition in a specific sequence to avoid conflicts.

Reference Data

FeatureZawgyi-OneUnicode (Myanmar)
StandardProprietary font hackUnicode Consortium (ISO 10646)
Block RangeReuses U+1000 - U+109F + PUAU+1000 - U+109F (strict)
Character OrderVisual (left-to-right render order)Logical (phonetic order)
Medial StackingSeparate code points per visual formVirama + consonant sequence
Kinzi (င်္)Dedicated code point U+1064Sequence: U+1004 U+103A U+1039
U+1033Used as vowel signNot used (replaced by U+102F)
U+1060 - U+1097Pre-composed stacked consonantsNot used (use Virama stacking)
SortingBroken (visual order)Correct (logical order)
Search/RegexUnreliableStandards-compliant
OS Support (2020+)Requires font installNative on Android 10+, iOS 13+, Win 10+
Database StorageCollation errors commonUTF-8 compatible
NLP / ML UsageRequires pre-conversionDirect tokenization possible
Myanmar NRC FormatOften garbled in formsCorrectly rendered
Facebook (Myanmar)Supported until 2019Forced migration in 2019
Government MandateNo official statusMandated by Myanmar govt (2019)
Conversion DirectionZawgyi β†’ Unicode (recommended)Unicode β†’ Zawgyi (legacy support)
Detection MethodCode point frequency scoringAbsence of Zawgyi-only points
Rule Count (Parabaik)~200 regex rules per direction
Common Errorα€œα€¬α€™α€› (wrong rendering)Correct syllable boundary
Virama UsageRare / visual workaroundCore mechanism (U+1039)

Frequently Asked Questions

The detector scans for code points that exist only in Zawgyi encoding, specifically U+1033 and the range U+1060 - U+1097. These code points are not used in standard Unicode Myanmar text. If any are found, the text is classified as Zawgyi. If none are present and the text contains characters in the Myanmar block (U+1000 - U+109F), it is classified as Unicode. For very short strings with no Zawgyi-exclusive characters, the result defaults to Unicode since it cannot be distinguished.
Zawgyi stores characters in visual rendering order, while Unicode uses logical phonetic order. Converting between them requires reordering medial consonants, vowel signs, and tone marks. A rule that moves a medial consonant may create a substring that matches a later rule. If rules execute out of order, intermediate states get corrupted. The Parabaik ruleset is specifically ordered to process Kinzi (U+1004 U+103A U+1039) first, then stacked consonants, then medials, then vowel signs, and finally tone marks.
No. Mixed-encoding strings are inherently ambiguous because certain code points (U+1000 - U+1032, U+1034 - U+105F) are shared between both encodings. The converter treats the entire input as one encoding. For mixed documents, you must isolate Zawgyi and Unicode sections manually, convert each separately, then recombine. Attempting to convert an already-Unicode segment with the Zawgyi→Unicode ruleset will corrupt it.
This tool covers the core Myanmar (Burmese) script block U+1000 - U+109F. Extended characters for Shan (U+1050 - U+1059), Mon (Myanmar Extended-A: U+AA60 - U+AA7F), and Kayah Li are partially handled only where they overlap with the Parabaik ruleset. Dedicated Shan or Mon conversion requires additional rules not included here. Characters outside the base Myanmar block pass through unchanged.
The converter processes text in the browser with no server round-trip. For texts under 100 KB (roughly 50,000 Myanmar characters), conversion is near-instantaneous. For larger texts, the tool chunks processing to prevent browser freezing. Tested up to 1 MB of Myanmar text without errors. Beyond that, browser memory limits may apply depending on the device. For multi-megabyte datasets, split into smaller files.
Three common causes: (1) The original Zawgyi text was malformed, using non-standard code point sequences that no ruleset can parse. (2) The rendering font does not support Unicode Myanmar correctly - ensure your OS or browser uses a Unicode-compliant Myanmar font like Noto Sans Myanmar or Padauk. (3) The text contains PUA (Private Use Area) characters from custom Zawgyi font variants that differ from the standard Zawgyi-One encoding this tool targets.