HTML to Plain Text Converter - Strip Tags & Extract Clean Text
Convert HTML markup to clean plain text instantly. Strips all tags, decodes entities, preserves structure with line breaks. Free online tool.
About
HTML markup carries semantic weight that vanishes when tags are removed carelessly. A naive regex strip (s/<[^>]*>/g) destroys paragraph boundaries, collapses table data into unreadable streams, and leaves encoded entities like & as literal text. This converter walks the parsed DOM tree node-by-node, applying block-level line break rules for elements like <p>, <div>, and <li>, while preserving inline flow for <span> and <a>. Entities are decoded natively by the browser's own parser. The result is structurally faithful plain text suitable for email bodies, CMS migrations, accessibility audits, or NLP preprocessing.
Limitation: this tool approximates visual layout. Complex CSS-driven layouts (flexbox reordering, display:none content) cannot be inferred from markup alone. Table output uses tab separation, which works for simple grids but may misalign in deeply nested structures. For best results, supply clean semantic HTML.
Formulas
The conversion is not a mathematical formula but a deterministic tree-walking algorithm. The core logic can be expressed as a recursive function T(node) that maps each DOM node to a plain text string:
Where prefix and suffix are determined by the element's display category. For block elements: suffix = "\n" (or "\n\n" for paragraph-like elements). For inline elements: prefix = suffix = "". A final normalization pass collapses runs of 3+ consecutive newlines down to 2, and trims trailing whitespace per line.
Reference Data
| HTML Element | Category | Plain Text Behavior | Output Example |
|---|---|---|---|
| <p> | Block | Content + double newline after | Text\n\n |
| <div> | Block | Content + newline after | Text\n |
| <br> | Void | Single newline | \n |
| <h1> - <h6> | Block | Content uppercased (optional) + double newline | HEADING\n\n |
| <li> | Block | Bullet prefix + content + newline | • Item\n |
| <ol> <li> | Block | Numbered prefix + content + newline | 1. Item\n |
| <a> | Inline | Text [href] (if option enabled) | Click [https://...] |
| <img> | Void | Alt text in brackets (if option enabled) | [Photo of cat] |
| <table> | Block | Tab-separated columns, newline per row | A\tB\nC\tD |
| <hr> | Void | Horizontal separator line | ---\n |
| <blockquote> | Block | Indented with > prefix | > Quote text |
| <pre> / <code> | Block/Inline | Whitespace preserved exactly | code as-is |
| <span> | Inline | Content only, no breaks | text |
| <strong> / <b> | Inline | Content only (or *wrapped* if option) | *bold* |
| <em> / <i> | Inline | Content only (or _wrapped_ if option) | _italic_ |
| <script> | Meta | Stripped entirely | (nothing) |
| <style> | Meta | Stripped entirely | (nothing) |
| <head> | Meta | Stripped entirely | (nothing) |
| <!-- --> | Comment | Stripped | (nothing) |
| & | Entity | Decoded to & | & |
| Entity | Decoded to space | (space) | |
| < / > | Entity | Decoded to < / > | < / > |