About

Structured data extraction from HTML remains error-prone when done manually. A missing closing tag, an unexpected attribute, or a deeply nested div hierarchy can silently corrupt your parsed output. This tool uses the browser's native DOMParser API to build a standards-compliant DOM tree from raw HTML, then recursively walks every node to produce a JSON Abstract Syntax Tree. Each element becomes an object with tag, attributes, and children fields. You can also define custom CSS selector rules to extract only the fields you need, similar to server-side libraries like Cheerio. The parser handles malformed HTML gracefully because browsers auto-correct structure during parsing. Note: this tool approximates server-side scraping behavior but runs entirely client-side. CORS restrictions apply when fetching remote URLs, so a public proxy is used as a fallback.

Formulas

The parser performs a recursive depth-first traversal of the DOM tree produced by the browser's native DOMParser. For each node at depth d, the algorithm produces a JSON object based on the node type.

parse(node, d) =

{

{ tag, attributes, children: map(node.childNodes, d + 1) } if nodeType = 1{ text: trim(node.textContent) } if nodeType = 3 ∧ trim ≠ ""NULL if d > maxDepth

Where nodeType = 1 is an Element node and nodeType = 3 is a Text node. The maxDepth parameter defaults to 50 to prevent stack overflow on pathological inputs.

Custom extraction rules use CSS selectors evaluated via querySelectorAll(selector). Each rule maps a key → selector pair, and the extracted value is either textContent or an attribute value depending on configuration. Multiple matches produce an array.

result[key] = querySelectorAll(selector).map(el → el.textContent)

JSONP wrapping appends the callback function name around the JSON output: callback(json);

Reference Data

Node Type	JSON Output Key	Description	Example Input	Example Output
Element	tag	HTML tag name, lowercased	`<div>`	"div"
Element	attributes	Object of all attributes	`<a href="/">`	{"href":"/"}
Element	children	Array of child nodes	`<ul><li>A</li></ul>`	Nested array
Text	text	Trimmed text content	`Hello`	"Hello"
Comment	comment	Comment content (optional)	`<!-- note -->`	"note"
Document	doctype	DOCTYPE declaration info	`<!DOCTYPE html>`	"html"
CDATA	cdata	CDATA section content	`<![CDATA[...]]>`	Raw string
Custom Rule	key	User-defined extraction field	Selector: `.price`	Matched text/html
Attribute Filter	data-*	Extract only data attributes	`data-id="5"`	{"data-id":"5"}
Meta Tags	meta	Extracted meta information	`<meta name="desc">`	{"name":"desc"}
Max Depth	depth	Recursion limit guard	Depth > 50	Truncated marker
Empty Text	(skipped)	Whitespace-only text nodes removed	`\n`	Not included
Script Tag	(skipped)	Script elements excluded by default	`<script>...</script>`	Not included
Style Tag	(skipped)	Style elements excluded by default	`<style>...</style>`	Not included
SVG Element	tag	SVG tags preserved with namespace	`<svg><circle>`	Nested object
Boolean Attr	attributes	Boolean attributes set to true	`<input disabled>`	{"disabled":true}
JSONP Wrap	Callback wrapper	Wraps output in function call	callback=fn	fn({...})

Frequently Asked Questions

The tool relies on the browser's native DOMParser API, which implements the HTML5 parsing specification. This means it applies the same error-recovery algorithms that browsers use when rendering pages. Unclosed tags are auto-closed, missing <tbody> elements are inserted into tables, and nested block elements inside inline elements are restructured. The resulting JSON reflects the corrected DOM, not the raw source. If you need to preserve the original broken structure, use the raw text input mode without DOM parsing.

Script and style elements contain code, not content. Including them inflates the JSON output with non-structural data and introduces potential security concerns if the JSON is later injected into another document. The parser excludes tags matching <script>, <style>, and <noscript> by default. You can toggle this behavior in the settings panel to include them if your use case requires parsing inline scripts or stylesheets.

The parser enforces a 5 MB input limit to prevent browser tab crashes. Documents exceeding approximately 500 KB are processed in a Web Worker to avoid blocking the UI thread. For very deeply nested documents (depth > 50 levels), the recursion is truncated with a "[MAX_DEPTH]" marker in the output. Typical web pages with 2000 - 5000 nodes parse in under 200 ms.

Full conversion produces a complete AST of every node in the document. Custom rules let you define targeted CSS selectors (e.g., .product-title, meta[name='description']) mapped to named JSON keys. This is analogous to how server-side scrapers like Cheerio work. You define a rule like title → h1.main, and the output becomes {"title": "extracted text"}. Rules support extracting textContent, a specific attribute value, or the innerHTML of matched elements.

Yes, but with caveats. Browsers enforce CORS (Cross-Origin Resource Sharing) policies, which block direct fetch requests to most external domains. The tool attempts a direct fetch first, then falls back to a public CORS proxy (api.allorigins.win). Some sites may still block proxy requests or return CAPTCHAs. For reliable results, copy the page source manually (Ctrl+U in browser) and paste it into the input field. The URL fetch feature works best with APIs and pages that serve permissive CORS headers.

JSONP (JSON with Padding) wraps the JSON output in a function call, e.g., myCallback({...});. This was historically used to bypass same-origin policy restrictions before CORS existed. If you specify a callback name like gogogo, the output becomes gogogo({...}); instead of raw JSON. This is useful when integrating with legacy systems that expect JSONP responses. Leave the callback field empty for standard JSON output.

The HTML specification does not guarantee attribute order, and browsers may reorder attributes during parsing. The DOMParser typically preserves source order in modern browsers (Chrome, Firefox, Safari), but this is an implementation detail, not a guarantee. The JSON output iterates attributes via element.attributes NamedNodeMap in the order the browser exposes them. If attribute order is critical for your use case, verify against the source HTML.