HTML to JSON Parser
Parse and convert any HTML page or snippet into structured JSON format with custom extraction rules, CSS selectors, and JSONP callback support.
Define CSS selector rules to extract specific fields from the HTML. Each rule maps a JSON key to a CSS selector.
About
Structured data extraction from HTML remains error-prone when done manually. A missing closing tag, an unexpected attribute, or a deeply nested div hierarchy can silently corrupt your parsed output. This tool uses the browser's native DOMParser API to build a standards-compliant DOM tree from raw HTML, then recursively walks every node to produce a JSON Abstract Syntax Tree. Each element becomes an object with tag, attributes, and children fields. You can also define custom CSS selector rules to extract only the fields you need, similar to server-side libraries like Cheerio. The parser handles malformed HTML gracefully because browsers auto-correct structure during parsing. Note: this tool approximates server-side scraping behavior but runs entirely client-side. CORS restrictions apply when fetching remote URLs, so a public proxy is used as a fallback.
Formulas
The parser performs a recursive depth-first traversal of the DOM tree produced by the browser's native DOMParser. For each node at depth d, the algorithm produces a JSON object based on the node type.
Where nodeType = 1 is an Element node and nodeType = 3 is a Text node. The maxDepth parameter defaults to 50 to prevent stack overflow on pathological inputs.
Custom extraction rules use CSS selectors evaluated via querySelectorAll(selector). Each rule maps a key → selector pair, and the extracted value is either textContent or an attribute value depending on configuration. Multiple matches produce an array.
JSONP wrapping appends the callback function name around the JSON output: callback(json);
Reference Data
| Node Type | JSON Output Key | Description | Example Input | Example Output |
|---|---|---|---|---|
| Element | tag | HTML tag name, lowercased | <div> | "div" |
| Element | attributes | Object of all attributes | <a href="/"> | {"href":"/"} |
| Element | children | Array of child nodes | <ul><li>A</li></ul> | Nested array |
| Text | text | Trimmed text content | Hello | "Hello" |
| Comment | comment | Comment content (optional) | <!-- note --> | "note" |
| Document | doctype | DOCTYPE declaration info | <!DOCTYPE html> | "html" |
| CDATA | cdata | CDATA section content | <![CDATA[...]]> | Raw string |
| Custom Rule | key | User-defined extraction field | Selector: .price | Matched text/html |
| Attribute Filter | data-* | Extract only data attributes | data-id="5" | {"data-id":"5"} |
| Meta Tags | meta | Extracted meta information | <meta name="desc"> | {"name":"desc"} |
| Max Depth | depth | Recursion limit guard | Depth > 50 | Truncated marker |
| Empty Text | (skipped) | Whitespace-only text nodes removed | \n | Not included |
| Script Tag | (skipped) | Script elements excluded by default | <script>...</script> | Not included |
| Style Tag | (skipped) | Style elements excluded by default | <style>...</style> | Not included |
| SVG Element | tag | SVG tags preserved with namespace | <svg><circle> | Nested object |
| Boolean Attr | attributes | Boolean attributes set to true | <input disabled> | {"disabled":true} |
| JSONP Wrap | Callback wrapper | Wraps output in function call | callback=fn | fn({...}) |
Frequently Asked Questions
<tbody> elements are inserted into tables, and nested block elements inside inline elements are restructured. The resulting JSON reflects the corrected DOM, not the raw source. If you need to preserve the original broken structure, use the raw text input mode without DOM parsing.<script>, <style>, and <noscript> by default. You can toggle this behavior in the settings panel to include them if your use case requires parsing inline scripts or stylesheets..product-title, meta[name='description']) mapped to named JSON keys. This is analogous to how server-side scrapers like Cheerio work. You define a rule like title → h1.main, and the output becomes {"title": "extracted text"}. Rules support extracting textContent, a specific attribute value, or the innerHTML of matched elements.api.allorigins.win). Some sites may still block proxy requests or return CAPTCHAs. For reliable results, copy the page source manually (Ctrl+U in browser) and paste it into the input field. The URL fetch feature works best with APIs and pages that serve permissive CORS headers.myCallback({...});. This was historically used to bypass same-origin policy restrictions before CORS existed. If you specify a callback name like gogogo, the output becomes gogogo({...}); instead of raw JSON. This is useful when integrating with legacy systems that expect JSONP responses. Leave the callback field empty for standard JSON output.element.attributes NamedNodeMap in the order the browser exposes them. If attribute order is critical for your use case, verify against the source HTML.