About

WordPress WXR (WordPress eXtended RSS) export files use a namespaced XML schema containing item nodes with content encoded in content:encoded CDATA blocks. Migrating to Octopress (a Jekyll-based framework) requires each post to become a discrete Markdown file named YYYY-MM-DD-slug.markdown with YAML front matter specifying layout, title, date, categories, and comments fields. Getting the date format wrong breaks Jekyll's build. Misescaped YAML colons in titles cause silent parse failures that surface only at deploy time. This tool parses the raw WXR XML using the browser's native DOMParser, walks each item node to extract WordPress-specific namespaced elements, converts embedded HTML content to clean Markdown via recursive DOM traversal, and packages the result as downloadable files. It handles draft vs. published status, preserves category and tag taxonomies, and strips WordPress shortcodes that have no Octopress equivalent. The conversion assumes UTF-8 encoding and standard WXR 1.2 schema. Custom post types and attachments are excluded by default.

Formulas

The conversion pipeline follows a deterministic transformation sequence from WXR XML to Octopress-compatible Markdown files.

Parse(WXR) → Filter(post_type = post) → Extract(metadata + content) → HTMLtoMD(content) → YAML(front matter) → File(date-slug.markdown)

The filename generation follows the Octopress convention:

filename = format(post_date, YYYY-MM-DD) + "-" + slugify(post_name) + ".markdown"

YAML title escaping rule prevents parser breakage:

safe_title =

{

""" + title + """ if title contains ":" or "#" or """title otherwise

Where WXR = WordPress eXtended RSS export format (XML), post_type = WordPress content type discriminator, post_date = publication timestamp from wp:post_date element, post_name = URL slug from wp:post_name element, slugify = lowercase alphanumeric + hyphen normalization function, HTMLtoMD = recursive DOM tree walker converting HTML nodes to Markdown syntax.

Reference Data

WordPress WXR Element	Octopress Front Matter	Notes
title	title	Wrapped in quotes if contains colons
wp:post_date	date	Format: YYYY-MM-DD HH:MM
content:encoded	Body (below ---)	HTML converted to Markdown
excerpt:encoded	description	Plain text, truncated to 160 chars
wp:status = publish	published: true	Drafts set to FALSE
wp:status = draft	published: false	File still generated
category domain="category"	categories	YAML array format
category domain="post_tag"	tags (custom)	Optional inclusion
wp:post_name	Filename slug	Sanitized to lowercase alphanumeric + hyphens
wp:post_type = post	Included	Only post type processed
wp:post_type = page	Optional	Toggled via checkbox
wp:post_type = attachment	Excluded	Media files not downloadable from XML
wp:comment	Excluded	Octopress uses Disqus; comments not migrated
wp:post_id	Not mapped	WordPress internal ID discarded
dc:creator	author	Included in front matter
link	Not mapped	Original URL preserved in comment only
wp:meta_key	Excluded	Custom fields not portable
HTML <a>	Markdown [text](url)	Relative URLs preserved
HTML <img>	Markdown ![alt](src)	WordPress media URLs kept as-is
HTML <pre><code>	Fenced code block ```	Language detection not attempted
HTML <blockquote>	Markdown > prefix	Nested blockquotes supported
WordPress [shortcode]	Stripped	No Octopress equivalent; logged in console
HTML <table>	Raw HTML preserved	Markdown tables unreliable for complex layouts

Frequently Asked Questions

WordPress shortcodes have no equivalent in Octopress or Jekyll. The converter strips all content matching the pattern [shortcode attr="val"]...[/shortcode] and logs each removed shortcode to the conversion report. If a shortcode wraps meaningful content (like [caption] around an image), the inner content is preserved but the shortcode wrapper is removed. Review the conversion log to identify posts requiring manual shortcode replacement.

YAML front matter treats colons as key-value delimiters. A title like Node.js: A Guide would break parsing if left unquoted. The converter detects titles containing colons (:), hash symbols (#), square brackets, or existing quotes and wraps them in double quotes with internal quotes escaped. This follows the YAML 1.2 specification for scalar quoting.

The converter processes the XML in the browser's main thread using DOMParser, which loads the entire document into memory. For exports exceeding approximately 100 MB, browser memory limits may cause failures. WordPress splits large exports into multiple files by default (typically at the 15 MB boundary). Process each split file individually. The converter displays a file size warning above 50 MB.

Yes. Draft posts are converted with published: false in their YAML front matter. Octopress and Jekyll will not render these posts during build unless you pass the --unpublished flag. You can toggle draft inclusion off in the converter settings to exclude them entirely from the output.

Image tags (<img>) are converted to Markdown image syntax ![alt](src). The src URL is preserved as-is from the WordPress export, meaning it still points to your original WordPress media uploads directory (typically /wp-content/uploads/YYYY/MM/). If your WordPress site is offline, these links will break. You must manually download and re-host images, then find-and-replace the URLs in the generated files.

The converter uses the wp:post_date element which stores local time without timezone offset, formatted as YYYY-MM-DD HH:MM:SS. The output front matter uses YYYY-MM-DD HH:MM format (seconds dropped). If your WordPress installation was configured for a non-UTC timezone, the times reflect that local timezone. Octopress treats dates as local time by default. No timezone conversion is applied.

No. Page builders (Elementor, WPBakery, Divi) inject deeply nested <div> structures with proprietary CSS classes. The converter strips all <div> wrappers and extracts only semantic content elements (<p>, <a>, <img>, headings, lists, code blocks). Posts built entirely with page builders will produce structurally correct but visually simplified Markdown. Manual review is recommended for such posts.