User Rating 0.0 ★★★★★

Total Usage 0 times

Category HTML/XML Utilities

HTML Input

Upload

Characters: 0 Lines: 0

Preserve link URLs Show image alt text Uppercase headings Mark bold/italic (*/_) Auto-convert on input

Ctrl + Enter

Plain Text Output

Characters: 0 Words: 0 Lines: 0

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

HTML markup carries semantic weight that vanishes when tags are removed carelessly. A naive regex strip (s/<[^>]*>/g) destroys paragraph boundaries, collapses table data into unreadable streams, and leaves encoded entities like & as literal text. This converter walks the parsed DOM tree node-by-node, applying block-level line break rules for elements like <p>, <div>, and <li>, while preserving inline flow for <span> and <a>. Entities are decoded natively by the browser's own parser. The result is structurally faithful plain text suitable for email bodies, CMS migrations, accessibility audits, or NLP preprocessing.

Limitation: this tool approximates visual layout. Complex CSS-driven layouts (flexbox reordering, display:none content) cannot be inferred from markup alone. Table output uses tab separation, which works for simple grids but may misalign in deeply nested structures. For best results, supply clean semantic HTML.

Formulas

The conversion is not a mathematical formula but a deterministic tree-walking algorithm. The core logic can be expressed as a recursive function T(node) that maps each DOM node to a plain text string:

T(n) = {

n.textContent if n is TEXT_NODE"" if n ∈ {SCRIPT, STYLE, HEAD}"\n" if n = BR"---\n" if n = HRprefix + join(T(child)) + suffix if n is ELEMENT

Where prefix and suffix are determined by the element's display category. For block elements: suffix = "\n" (or "\n\n" for paragraph-like elements). For inline elements: prefix = suffix = "". A final normalization pass collapses runs of 3+ consecutive newlines down to 2, and trims trailing whitespace per line.

Reference Data

HTML Element	Category	Plain Text Behavior	Output Example
<p>	Block	Content + double newline after	Text\n\n
<div>	Block	Content + newline after	Text\n
<br>	Void	Single newline	\n
<h1> - <h6>	Block	Content uppercased (optional) + double newline	HEADING\n\n
<li>	Block	Bullet prefix + content + newline	• Item\n
<ol> <li>	Block	Numbered prefix + content + newline	1. Item\n
<a>	Inline	Text [href] (if option enabled)	Click [https://...]
<img>	Void	Alt text in brackets (if option enabled)	[Photo of cat]
<table>	Block	Tab-separated columns, newline per row	A\tB\nC\tD
<hr>	Void	Horizontal separator line	---\n
<blockquote>	Block	Indented with > prefix	> Quote text
<pre> / <code>	Block/Inline	Whitespace preserved exactly	code as-is
<span>	Inline	Content only, no breaks	text
<strong> / <b>	Inline	Content only (or wrapped if option)	bold
<em> / <i>	Inline	Content only (or _wrapped_ if option)	_italic_
<script>	Meta	Stripped entirely	(nothing)
<style>	Meta	Stripped entirely	(nothing)
<head>	Meta	Stripped entirely	(nothing)
<!-- -->	Comment	Stripped	(nothing)
&	Entity	Decoded to &	&
	Entity	Decoded to space	(space)
< / >	Entity	Decoded to < / >	< / >

Frequently Asked Questions

A regex like s/<[^>]*>//g operates on raw text and has no understanding of document structure. It will merge a paragraph ending and the next paragraph start into one continuous string with no whitespace. It cannot decode HTML entities (& stays as &). It also fails on malformed HTML, edge cases like angle brackets inside attribute values, and CDATA sections. This tool parses the HTML into a real DOM tree using the browser's native DOMParser, then walks the tree with full knowledge of element semantics.

No. The converter explicitly skips SCRIPT, STYLE, HEAD, NOSCRIPT, and TEMPLATE elements. Their entire subtrees are excluded from the output. This prevents JavaScript code or CSS rules from polluting your plain text.

Each table row becomes one line of output. Cells within a row are separated by tab characters (\t). This produces output that can be pasted directly into a spreadsheet application. For deeply nested tables or complex colspan/rowspan layouts, the output approximates the visual structure but may not perfectly align.

Yes. The browser's DOMParser is highly tolerant of malformed HTML. It applies the same error-recovery rules used when rendering web pages. Unclosed tags, missing attributes, and improperly nested elements are handled gracefully. You can paste a partial HTML snippet (e.g., just a

without wrapper) and it will be processed correctly.

All HTML entities are decoded to their Unicode equivalents by the native DOM parser. becomes a regular space (U+00A0, then normalized to U+0020 during whitespace collapsing). & becomes &, < becomes <, € becomes €, and so on. This includes all numeric character references like — (em dash).

Yes. Enable the "Preserve link URLs" option. When active, anchor elements render as: link text [https://example.com]. When disabled, only the link text is kept. This is useful for creating reference-style documents or archiving web content with functional context.