About

Romanized Japanese (romaji) introduces ambiguity that compounds with text length. A single syllable like shi maps unambiguously to し or シ, but ki maps to over 40 distinct kanji including 気, 木, 機, and 記 - each with radically different meaning. Misselecting a kanji in formal writing or signage is not a typo; it is a semantic error that can change contracts, addresses, or medical instructions. This converter implements a greedy left-to-right tokenizer against the complete Modified Hepburn romanization table (107 mora entries) covering gojūon, dakuten (゛), handakuten (゜), yōon combinations, and sokuon (っ) doubling rules. The kanji dictionary returns all common readings per mora so you can cross-reference, not guess.

Limitations: kanji selection is reading-based, not context-based. Natural language disambiguation (e.g., distinguishing 橋 from 箸 for hashi) requires sentence-level NLP beyond this tool's scope. For particle-aware conversion or compound-word kanji, consult a full IME. This tool approximates dictionary lookup assuming isolated mora input.

Formulas

The converter uses a greedy left-to-right tokenization algorithm over the input romaji string. At each position i, the algorithm attempts to match the longest possible romaji token.

tokenize(s) : i = 0, scan s[i..i+3], then s[i..i+2], then s[i] against lookup table T

Sokuon detection: if s[i] = s[i+1] and both are consonants ≠ n, emit っ (hiragana) or ッ (katakana) and advance i by 1. Syllabic n detection: emit ん/ン when n appears before a consonant (not a, i, u, e, o, y) or at string end.

Where s = input romaji string, i = current scan position, T = mora lookup dictionary containing 107 entries mapping Modified Hepburn romaji to Unicode kana codepoints. Kanji mode queries a secondary dictionary K keyed by romaji reading, returning an array of all kanji sharing that on'yomi or kun'yomi reading.

Reference Data

Romaji	Hiragana	Katakana	Type	Notes
a	あ	ア	Gojūon	Vowel
ka	か	カ	Gojūon	K-row
shi	し	シ	Gojūon	Hepburn: し, not si
chi	ち	チ	Gojūon	Hepburn: ち, not ti
tsu	つ	ツ	Gojūon	Hepburn: つ, not tu
fu	ふ	フ	Gojūon	Hepburn: ふ, not hu
n	ん	ン	Gojūon	Syllabic nasal; standalone before consonant
ga	が	ガ	Dakuten	Voiced K-row
za	ざ	ザ	Dakuten	Voiced S-row
da	だ	ダ	Dakuten	Voiced T-row
ba	ば	バ	Dakuten	Voiced H-row
pa	ぱ	パ	Handakuten	Semi-voiced H-row
kya	きゃ	キャ	Yōon	K-row combo
sha	しゃ	シャ	Yōon	S-row combo (Hepburn)
cha	ちゃ	チャ	Yōon	T-row combo (Hepburn)
nya	にゃ	ニャ	Yōon	N-row combo
hya	ひゃ	ヒャ	Yōon	H-row combo
mya	みゃ	ミャ	Yōon	M-row combo
rya	りゃ	リャ	Yōon	R-row combo
gya	ぎゃ	ギャ	Yōon	Voiced K-row combo
ja	じゃ	ジャ	Yōon	Voiced S-row combo
bya	びゃ	ビャ	Yōon	Voiced H-row combo
pya	ぴゃ	ピャ	Yōon	Semi-voiced H-row combo
kk*	っk*	ッk*	Sokuon	Double consonant → っ/ッ prefix
ss*	っs*	ッs*	Sokuon	Double consonant → っ/ッ prefix
tt*	っt*	ッt*	Sokuon	Double consonant → っ/ッ prefix
pp*	っp*	ッp*	Sokuon	Double consonant → っ/ッ prefix
wo	を	ヲ	Gojūon	Particle を
wi	ゐ	ヰ	Archaic	Historical kana
we	ゑ	ヱ	Archaic	Historical kana
di	ぢ	ヂ	Dakuten	Voiced T-row alternate
du	づ	ヅ	Dakuten	Voiced T-row alternate

Frequently Asked Questions

The tokenizer uses lookahead. When it encounters "n", it checks the next character. If the next character is a vowel (a, i, u, e, o) or "y", it treats "n" as the start of a multi-character mora (e.g., "na" → な). If the next character is a consonant, a space, or the string ends, it emits ん (or ン). To force syllabic n before a vowel, use "n'" with an apostrophe - standard Modified Hepburn notation (e.g., "shin'ichi" → しんいち, not しにち).

Japanese kanji are logographic. Multiple kanji can share identical readings. The mora "ki" maps to 木 (tree), 気 (spirit), 機 (machine), 記 (record), and over 35 others. This tool returns all common kanji for that reading. Selecting the correct kanji requires sentence context (semantic disambiguation), which is the domain of full Input Method Editors (IME), not a reading-based lookup tool.

Yes. The dictionary accepts both systems. Hepburn "shi" and Kunrei-shiki "si" both map to し/シ. Similarly, 'chi'/'ti' → ち, 'tsu'/'tu' → つ, 'fu'/'hu' → ふ, 'ja'/'zya' → じゃ. Hepburn forms are prioritized in the reference table because they are the ISO 3602 Strict standard and more widely used internationally.

When the tokenizer detects two identical consecutive consonants (excluding 'n'), it emits the sokuon character っ (hiragana) or ッ (katakana) for the first consonant, then processes the second consonant normally as part of the next mora. So "gakkou" becomes が・っ・こ・う (gakkō). The sokuon represents a geminate consonant - a brief pause or glottal stop before the following consonant sound.

Non-alphabetic characters pass through unchanged. Numbers, spaces, punctuation marks, and existing Japanese characters (hiragana, katakana, kanji) are preserved in their original position. Only sequences of Latin letters a-z are tokenized and converted. This allows mixed-script input like "Tokyo 2024" to produce "とうきょう 2024" without data loss.

The converter handles long vowels through standard romaji doubling: "ou" → おう, "uu" → うう, "oo" → おお. Macron characters (ō, ū, ā) are not natively supported because they are display conventions, not input standards. If you need おう, type "ou". For katakana long vowels using the chōon mark ー, type a double vowel (e.g., "raamen" → ラーメン uses explicit mapping for common patterns).