User Rating 0.0
Total Usage 0 times
HTML Input
Characters: 0 Lines: 0
Ctrl + Enter
Plain Text Output
Characters: 0 Words: 0 Lines: 0
Is this tool helpful?

Your feedback helps us improve.

About

HTML markup carries semantic weight that vanishes when tags are removed carelessly. A naive regex strip (s/<[^>]*>/g) destroys paragraph boundaries, collapses table data into unreadable streams, and leaves encoded entities like & as literal text. This converter walks the parsed DOM tree node-by-node, applying block-level line break rules for elements like <p>, <div>, and <li>, while preserving inline flow for <span> and <a>. Entities are decoded natively by the browser's own parser. The result is structurally faithful plain text suitable for email bodies, CMS migrations, accessibility audits, or NLP preprocessing.

Limitation: this tool approximates visual layout. Complex CSS-driven layouts (flexbox reordering, display:none content) cannot be inferred from markup alone. Table output uses tab separation, which works for simple grids but may misalign in deeply nested structures. For best results, supply clean semantic HTML.

html to text strip html tags html converter plain text extractor remove html formatting html cleanup

Formulas

The conversion is not a mathematical formula but a deterministic tree-walking algorithm. The core logic can be expressed as a recursive function T(node) that maps each DOM node to a plain text string:

T(n) = {
n.textContent if n is TEXT_NODE"" if n {SCRIPT, STYLE, HEAD}"\n" if n = BR"---\n" if n = HRprefix + join(T(child)) + suffix if n is ELEMENT

Where prefix and suffix are determined by the element's display category. For block elements: suffix = "\n" (or "\n\n" for paragraph-like elements). For inline elements: prefix = suffix = "". A final normalization pass collapses runs of 3+ consecutive newlines down to 2, and trims trailing whitespace per line.

Reference Data

HTML ElementCategoryPlain Text BehaviorOutput Example
<p>BlockContent + double newline afterText\n\n
<div>BlockContent + newline afterText\n
<br>VoidSingle newline\n
<h1> - <h6>BlockContent uppercased (optional) + double newlineHEADING\n\n
<li>BlockBullet prefix + content + newline• Item\n
<ol> <li>BlockNumbered prefix + content + newline1. Item\n
<a>InlineText [href] (if option enabled)Click [https://...]
<img>VoidAlt text in brackets (if option enabled)[Photo of cat]
<table>BlockTab-separated columns, newline per rowA\tB\nC\tD
<hr>VoidHorizontal separator line---\n
<blockquote>BlockIndented with > prefix> Quote text
<pre> / <code>Block/InlineWhitespace preserved exactly code as-is
<span>InlineContent only, no breakstext
<strong> / <b>InlineContent only (or *wrapped* if option)*bold*
<em> / <i>InlineContent only (or _wrapped_ if option)_italic_
<script>MetaStripped entirely(nothing)
<style>MetaStripped entirely(nothing)
<head>MetaStripped entirely(nothing)
<!-- -->CommentStripped(nothing)
&EntityDecoded to &&
 EntityDecoded to space(space)
< / >EntityDecoded to < / >< / >

Frequently Asked Questions

A regex like s/<[^>]*>//g operates on raw text and has no understanding of document structure. It will merge a paragraph ending and the next paragraph start into one continuous string with no whitespace. It cannot decode HTML entities (& stays as &). It also fails on malformed HTML, edge cases like angle brackets inside attribute values, and CDATA sections. This tool parses the HTML into a real DOM tree using the browser's native DOMParser, then walks the tree with full knowledge of element semantics.
No. The converter explicitly skips SCRIPT, STYLE, HEAD, NOSCRIPT, and TEMPLATE elements. Their entire subtrees are excluded from the output. This prevents JavaScript code or CSS rules from polluting your plain text.
Each table row becomes one line of output. Cells within a row are separated by tab characters (\t). This produces output that can be pasted directly into a spreadsheet application. For deeply nested tables or complex colspan/rowspan layouts, the output approximates the visual structure but may not perfectly align.
Yes. The browser's DOMParser is highly tolerant of malformed HTML. It applies the same error-recovery rules used when rendering web pages. Unclosed tags, missing attributes, and improperly nested elements are handled gracefully. You can paste a partial HTML snippet (e.g., just a
without wrapper) and it will be processed correctly.
All HTML entities are decoded to their Unicode equivalents by the native DOM parser.   becomes a regular space (U+00A0, then normalized to U+0020 during whitespace collapsing). & becomes &, < becomes <, € becomes €, and so on. This includes all numeric character references like — (em dash).
Yes. Enable the "Preserve link URLs" option. When active, anchor elements render as: link text [https://example.com]. When disabled, only the link text is kept. This is useful for creating reference-style documents or archiving web content with functional context.