About

Extracting tabular data from HTML is error-prone. Tables use colspan and rowspan attributes that create merged cell regions. A naive row-by-row copy loses alignment: column j in row i may actually map to column j + k due to a preceding span. This tool parses the full merge grid, reconstructing a rectangular matrix of m × n cells before export. It generates RFC 4180-compliant CSV with proper quoting and a valid OOXML XLSX binary (ZIP archive of XML parts) - not an HTML file renamed to .xlsx.

Limitations: nested tables (a <table> inside a <td>) are flattened to text. Formatting (colors, fonts, borders) is not preserved in output - only raw cell text. The XLSX writer uses shared strings without compression, so files with > 50000 cells may produce larger-than-expected files. For production spreadsheets, validate column alignment against the original after export.

Formulas

The core challenge is resolving merged cells. Given a table with R rows, a fill-grid algorithm constructs a rectangular matrix G of dimensions R × C, where C is the effective column count.

fillGrid(row, cell): cs = cell.colspan || 1, rs = cell.rowspan || 1 → for dr = 0 to rs − 1, dc = 0 to cs − 1: G[row + dr][col + dc] = value

For each cell in the source HTML, the algorithm skips to the first unoccupied column in the current row, then fills a rs × cs block. Only the top-left cell of a merged region receives the text value; remaining cells are set to empty strings.

CSV encoding per RFC 4180:

escapeCSV(field) = if field contains , or " or \n → " + field.replace(", "") + "

XLSX structure follows the OOXML standard (ECMA-376). The minimal valid archive contains 7 XML parts packed into a ZIP container using the store method (no deflate compression). Cell references use the column-letter system: column index c maps to letters via base-26 conversion where 0 → A, 25 → Z, 26 → AA.

colToLetter(c) = if c < 26: String.fromCharCode(65 + c), else: colToLetter(c26 − 1) + String.fromCharCode(65 + (c mod 26))

Where c = zero-based column index. The ZIP local file header uses signature 0x04034b50, with CRC-32 computed per ISO 3309 for each file entry.

Reference Data

Feature	CSV Output	XLSX Output
File Format	Plain text (RFC 4180)	OOXML ZIP archive
Excel Compatible	Yes (with BOM)	Yes (native)
Google Sheets Compatible	Yes	Yes
LibreOffice Compatible	Yes	Yes
Unicode Support	UTF-8 with BOM	UTF-8 XML
Multiple Sheets	No (single file per table)	Single sheet per file
Colspan Handling	Empty cells inserted	Empty cells inserted
Rowspan Handling	Value repeated / empty fill	Value repeated / empty fill
Cell Formatting	None (text only)	None (text only)
Max Rows (practical)	Unlimited	~100000 (browser memory)
Max Columns (XLSX spec)	N/A	16384 (XFD)
Delimiter	Comma (,)	N/A (XML cells)
Quoting Rule	Double-quote if field contains , " or newline	N/A
File Extension	.csv	.xlsx
MIME Type	text/csv	application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Nested Tables	Flattened to text	Flattened to text
Header Detection	<thead> / <th> treated as regular cells	Same
Empty Cells	Empty field (,,)	Omitted cell element (Excel reads as blank)
Line Endings	CRLF (per RFC 4180)	N/A

Frequently Asked Questions

The converter builds a fill-grid: a 2D matrix sized to the effective row × column dimensions of the table. When a cell has colspan=3 and rowspan=2, the algorithm writes the cell text to the top-left position and fills the remaining 5 positions with empty strings. This preserves column alignment in the output. The original merged cell's text appears once; it is not duplicated across the spanned region.

Excel on Windows defaults to the system locale encoding (often Windows-1252) when opening CSV files. This converter prepends a UTF-8 Byte Order Mark (BOM: EF BB BF) to signal UTF-8 encoding to Excel. If characters still appear broken, use Excel's Data → From Text/CSV import wizard and explicitly select UTF-8 (65001) as the encoding. The XLSX format avoids this issue entirely since it uses XML with declared UTF-8 encoding.

Nested tables (a element inside a

) are flattened: the inner table's text content is extracted as plain text and concatenated into the parent cell's value. The structural rows and columns of the inner table are lost. If you need to preserve nested table structure, extract each table separately - the converter lists all tables found in the HTML and lets you select which one to export.

Practical limits depend on available browser memory. Tables up to approximately 100,000 cells (e.g., 1000 rows × 100 columns) convert reliably. Beyond that, XLSX generation may cause memory pressure because the ZIP archive is assembled in memory as an ArrayBuffer. CSV output is more lightweight and can handle larger datasets. If the browser tab crashes, reduce the table size or split it into multiple exports.

It is a genuine OOXML spreadsheet. The tool constructs a valid ZIP archive containing the required XML parts: [Content_Types].xml, workbook.xml, sheet1.xml, sharedStrings.xml, styles.xml, and relationship files. Excel, Google Sheets, and LibreOffice all open it natively. Unlike some exporters that wrap HTML in an .xls extension (which triggers compatibility warnings), this produces a standards-compliant .xlsx binary.

The parser first scans all rows to compute the effective column count by summing colspan values per row and taking the maximum. The fill-grid is then pre-allocated to this width. Rows with fewer cells than the maximum simply leave trailing grid positions empty, which become blank cells in the output. This matches how browsers render ragged tables.