User Rating 0.0
Total Usage 0 times
Ready to generate
Is this tool helpful?

Your feedback helps us improve.

About

Testing text rendering engines, database storage constraints, or parsing algorithms requires robust data inputs. Standard ASCII or localized text often fails to trigger edge cases associated with multi-byte encoding or unassigned Unicode planes. This generator synthesizes random valid Unicode code points represented in UTF-32 (4-byte fixed length).

Unlike variable-length encodings such as UTF-8 or UTF-16, UTF-32 represents every character using exactly 32 bits. This tool mathematically guarantees the exclusion of the UTF-16 surrogate halves range (0xD8000xDFFF), ensuring all generated values represent valid, standalone characters applicable across the full 0x000000 to 0x10FFFF spectrum.

unicode utf-32 encoding fuzzing developer-tools

Formulas

When selecting the full valid Unicode range, the algorithm must skip the surrogate gap. Generating a raw random number between 0 and 1114111 (0x10FFFF) introduces the risk of hitting an invalid surrogate. Instead, we calculate the total valid space:

V = Max Scount

Where Max is 1114112 and Scount (surrogates) is 2048. We generate a random integer R such that 0 R < V. To map R back to a valid code point C:

{
C = R if R < 0xD800C = R + 0x0800 if R 0xD800

This ensures uniform distribution across all valid Unicode characters without requiring recalculation or looping upon hitting an invalid block.

Reference Data

PlaneNameRange (Hex)Primary Content
0Basic Multilingual Plane (BMP)0000FFFFCommon modern languages, symbols
1Supplementary Multilingual (SMP)100001FFFFHistoric scripts, musical symbols, emojis
2Supplementary Ideographic (SIP)200002FFFFRare and historic CJK ideographs
3Tertiary Ideographic (TIP)300003FFFFAdditional ancient CJK ideographs
4-13Unassigned40000DFFFFReserved for future use
14Supplementary Special (SSP)E0000EFFFFFormat control characters
15Private Use Area (PUA-A)F0000FFFFFCustom corporate/user defined
16Private Use Area (PUA-B)10000010FFFFCustom corporate/user defined

Frequently Asked Questions

The Unicode standard defines a maximum code point of 0x10FFFF, which requires 21 bits to represent. Because computer architectures operate on byte boundaries (8 bits), the next logical fixed size is 32 bits (4 bytes). This makes indexing operations O(1), unlike UTF-8 where finding the nth character requires parsing from the beginning.
This tool strictly prevents the generation of surrogate code points (0xD800 to 0xDFFF). In UTF-32, surrogates are invalid because the encoding is wide enough to represent the actual character directly. Attempting to decode an isolated surrogate in UTF-32 violates the Unicode standard and may cause parser failure.
Endianness determines the byte order. For the code point U+1F600 (represented in 32 bits as 00 01 F6 00), Big Endian stores the most significant byte first: [00, 01, F6, 00]. Little Endian stores the least significant byte first: [00, F6, 01, 00]. Different systems expect different byte orders, often denoted by a Byte Order Mark (BOM).
The tool generates the mathematical code point. If your operating system or browser lacks a font file that contains a glyph for that specific code point (especially common in Planes 2-14), it will render a "tofu" block (a hollow rectangle) as a fallback mechanism.