User Rating 0.0
Total Usage 0 times
Category JSON Tools
Examples:
Parsing...
Is this tool helpful?

Your feedback helps us improve.

About

Structured data extraction from HTML remains error-prone when done manually. A missing closing tag, an unexpected attribute, or a deeply nested div hierarchy can silently corrupt your parsed output. This tool uses the browser's native DOMParser API to build a standards-compliant DOM tree from raw HTML, then recursively walks every node to produce a JSON Abstract Syntax Tree. Each element becomes an object with tag, attributes, and children fields. You can also define custom CSS selector rules to extract only the fields you need, similar to server-side libraries like Cheerio. The parser handles malformed HTML gracefully because browsers auto-correct structure during parsing. Note: this tool approximates server-side scraping behavior but runs entirely client-side. CORS restrictions apply when fetching remote URLs, so a public proxy is used as a fallback.

html to json html parser dom to json html converter json parser css selector extractor web scraping tool html ast

Formulas

The parser performs a recursive depth-first traversal of the DOM tree produced by the browser's native DOMParser. For each node at depth d, the algorithm produces a JSON object based on the node type.

parse(node, d) =
{
{ tag, attributes, children: map(node.childNodes, d + 1) } if nodeType = 1{ text: trim(node.textContent) } if nodeType = 3 trim ""NULL if d > maxDepth

Where nodeType = 1 is an Element node and nodeType = 3 is a Text node. The maxDepth parameter defaults to 50 to prevent stack overflow on pathological inputs.

Custom extraction rules use CSS selectors evaluated via querySelectorAll(selector). Each rule maps a key selector pair, and the extracted value is either textContent or an attribute value depending on configuration. Multiple matches produce an array.

result[key] = querySelectorAll(selector).map(el el.textContent)

JSONP wrapping appends the callback function name around the JSON output: callback(json);

Reference Data

Node TypeJSON Output KeyDescriptionExample InputExample Output
ElementtagHTML tag name, lowercased<div>"div"
ElementattributesObject of all attributes<a href="/">{"href":"/"}
ElementchildrenArray of child nodes<ul><li>A</li></ul>Nested array
TexttextTrimmed text contentHello"Hello"
CommentcommentComment content (optional)<!-- note -->"note"
DocumentdoctypeDOCTYPE declaration info<!DOCTYPE html>"html"
CDATAcdataCDATA section content<![CDATA[...]]>Raw string
Custom RulekeyUser-defined extraction fieldSelector: .priceMatched text/html
Attribute Filterdata-*Extract only data attributesdata-id="5"{"data-id":"5"}
Meta TagsmetaExtracted meta information<meta name="desc">{"name":"desc"}
Max DepthdepthRecursion limit guardDepth > 50Truncated marker
Empty Text(skipped)Whitespace-only text nodes removed \n Not included
Script Tag(skipped)Script elements excluded by default<script>...</script>Not included
Style Tag(skipped)Style elements excluded by default<style>...</style>Not included
SVG ElementtagSVG tags preserved with namespace<svg><circle>Nested object
Boolean AttrattributesBoolean attributes set to true<input disabled>{"disabled":true}
JSONP WrapCallback wrapperWraps output in function callcallback=fnfn({...})

Frequently Asked Questions

The tool relies on the browser's native DOMParser API, which implements the HTML5 parsing specification. This means it applies the same error-recovery algorithms that browsers use when rendering pages. Unclosed tags are auto-closed, missing <tbody> elements are inserted into tables, and nested block elements inside inline elements are restructured. The resulting JSON reflects the corrected DOM, not the raw source. If you need to preserve the original broken structure, use the raw text input mode without DOM parsing.
Script and style elements contain code, not content. Including them inflates the JSON output with non-structural data and introduces potential security concerns if the JSON is later injected into another document. The parser excludes tags matching <script>, <style>, and <noscript> by default. You can toggle this behavior in the settings panel to include them if your use case requires parsing inline scripts or stylesheets.
The parser enforces a 5 MB input limit to prevent browser tab crashes. Documents exceeding approximately 500 KB are processed in a Web Worker to avoid blocking the UI thread. For very deeply nested documents (depth > 50 levels), the recursion is truncated with a "[MAX_DEPTH]" marker in the output. Typical web pages with 2000 - 5000 nodes parse in under 200 ms.
Full conversion produces a complete AST of every node in the document. Custom rules let you define targeted CSS selectors (e.g., .product-title, meta[name='description']) mapped to named JSON keys. This is analogous to how server-side scrapers like Cheerio work. You define a rule like title h1.main, and the output becomes {"title": "extracted text"}. Rules support extracting textContent, a specific attribute value, or the innerHTML of matched elements.
Yes, but with caveats. Browsers enforce CORS (Cross-Origin Resource Sharing) policies, which block direct fetch requests to most external domains. The tool attempts a direct fetch first, then falls back to a public CORS proxy (api.allorigins.win). Some sites may still block proxy requests or return CAPTCHAs. For reliable results, copy the page source manually (Ctrl+U in browser) and paste it into the input field. The URL fetch feature works best with APIs and pages that serve permissive CORS headers.
JSONP (JSON with Padding) wraps the JSON output in a function call, e.g., myCallback({...});. This was historically used to bypass same-origin policy restrictions before CORS existed. If you specify a callback name like gogogo, the output becomes gogogo({...}); instead of raw JSON. This is useful when integrating with legacy systems that expect JSONP responses. Leave the callback field empty for standard JSON output.
The HTML specification does not guarantee attribute order, and browsers may reorder attributes during parsing. The DOMParser typically preserves source order in modern browsers (Chrome, Firefox, Safari), but this is an implementation detail, not a guarantee. The JSON output iterates attributes via element.attributes NamedNodeMap in the order the browser exposes them. If attribute order is critical for your use case, verify against the source HTML.