About

DOCX files store content as compressed XML (OOXML, ISO/IEC 29500). Converting them to Markdown requires parsing ZIP archives, resolving XML namespaces, and mapping WordprocessingML elements to plain-text syntax. Getting this wrong produces broken formatting: lost headings, collapsed lists, stripped hyperlinks. This tool performs real client-side ZIP decompression and XML tree traversal. No file leaves your browser. It handles w:pStyle mappings for headings (1 - 6), nested w:numPr structures for ordered and unordered lists, w:tbl grids for pipe tables, and inline run properties for bold, italic, strikethrough, and code spans.

Limitations: embedded images are extracted as base64 data URIs, which inflates output size. Complex layouts (text boxes, SmartArt, equations via OMML) are approximated as plain text. Footnotes and endnotes are appended at the document end. The parser assumes well-formed OOXML; files produced by non-Microsoft editors (LibreOffice, Google Docs export) may have non-standard namespace prefixes that this tool normalizes.

Formulas

DOCX files conform to the Office Open XML (OOXML) standard. The file is a ZIP archive. The conversion pipeline follows a deterministic sequence:

parse(file) → unzip(buffer) → extractXML(entries) → walkTree(dom) → markdown

The ZIP local file header structure defines entry locations:

Offset = 0x04034b50 + headerSize + filenameLen + extraLen

Each compressed entry uses DEFLATE (method 8). Stored entries (method 0) are read directly. The browser's DecompressionStream("deflate-raw") handles inflation natively without external libraries.

Heading level mapping:

level = parseInt(styleName.match(/Heading(\d)/)[1])

prefix = "#".repeat(clamp(level, 1, 6))

List indentation depth:

indent = " ".repeat(ilvl)

Where ilvl = indent level from w:ilvl attribute (0-based). styleName = value of w:pStyle w:val attribute. buffer = raw ArrayBuffer of the uploaded DOCX file. entries = map of filename → decompressed Uint8Array. dom = parsed XML document from DOMParser.

Reference Data

OOXML Element	XML Path	Markdown Output	Notes
Heading 1	w:pStyle val="Heading1"	`# Text`	Mapped via pStyle name matching
Heading 2	w:pStyle val="Heading2"	`## Text`	Levels 1 - 6 supported
Bold	w:b / w:b val="true"	`text`	Handles toggle & explicit
Italic	w:i / w:i val="true"	`text`	Combined: `*text*`
Strikethrough	w:strike	`~~text~~`	GFM extension
Hyperlink	w:hyperlink r:id	`[text](url)`	Resolved via .rels file
Unordered List	w:numPr + bullet numFmt	`- item`	Indent via w:ilvl
Ordered List	w:numPr + decimal numFmt	`1. item`	Counter resets per list
Table	w:tbl → w:tr → w:tc	Pipe table `\| a \| b \|`	Alignment from w:jc
Code (Inline)	w:rFonts monospace family	`code`	Courier, Consolas, monospace detection
Block Quote	w:pStyle val="Quote"	`> text`	Also IntenseQuote
Horizontal Rule	w:pBdr bottom border only	`---`	Paragraph border detection
Line Break	w:br	Two trailing spaces + newline	Soft break within paragraph
Page Break	w:br type="page"	`---`	Converted to thematic break
Image	w:drawing → a:blip r:embed	`![alt](data:...)`	Base64 embedded or downloadable
Footnote Ref	w:footnoteReference	`[^1]`	Collected and appended at end
Superscript	w:vertAlign val="superscript"	`<sup>text</sup>`	HTML fallback in Markdown
Subscript	w:vertAlign val="subscript"	`<sub>text</sub>`	HTML fallback in Markdown
Underline	w:u	`<u>text</u>`	No native MD; HTML used
Highlight/Color	w:highlight	Plain text (stripped)	Color info discarded

Frequently Asked Questions

No. All processing happens entirely in your browser using the File API, native ZIP decompression (DecompressionStream), and DOMParser. Your DOCX never leaves your device. No network requests are made during conversion.

Markdown has no native image embedding format beyond URL references. Since the images exist only inside the DOCX ZIP archive, they are extracted and encoded as base64 data URIs (e.g., data:image/png;base64,...). For large documents with many images, consider downloading the Markdown file and using a post-processor to extract images into separate files.

The converter parses word/numbering.xml to determine list type (bullet vs. decimal) per numId and ilvl combination. Nested levels are indented with two spaces per level. If numbering.xml is missing or malformed (common in Google Docs exports), the converter falls back to unordered list syntax (- item) for all list items.

Markdown pipe tables do not support cell spanning (colspan/rowspan). Merged cells detected via w:gridSpan or w:vMerge are expanded into individual cells with duplicated content. The converter adds an HTML comment to flag these for manual review. For complex table layouts, consider using the HTML table output option instead.

Yes, with caveats. These applications sometimes use non-standard style names (e.g., "heading 1" instead of 'Heading1') or omit the word/ prefix in relationship targets. The converter normalizes style names via case-insensitive matching and resolves relative paths in .rels files. However, proprietary extensions specific to LibreOffice (e.g., lo:custom-style) are ignored.

The converter processes files up to 50 MB. Larger files may cause memory pressure in the browser tab. For documents exceeding this limit, the ZIP parser operates in a streaming fashion where possible, but the full decompressed XML must fit in memory. A typical 50 MB DOCX contains roughly 200-300 pages with embedded images.

Footnote references in the body text are converted to Markdown footnote syntax [^N] where N is the footnote number. The actual footnote content is parsed from word/footnotes.xml and appended at the end of the document as [^N]: content. Endnotes from word/endnotes.xml follow the same pattern. Self-referential or nested footnotes are flattened to a single level.