Drag PDF or Paste Text

We use local processing. No data leaves your browser.

Upload File

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

This is not just a converter; it is a forensic data reconstruction engine designed for financial analysts, accountants, and data engineers. Standard PDF converters often fail because they treat text as a linear stream, ignoring the spatial relationships that define a table. When a bank statement or invoice is saved as a PDF, the grid structure is lost, replaced by absolute x,y coordinates.

Our tool utilizes a Projection Profile Algorithm combined with a massive heuristic database. It scans the document for vertical whitespace rivers to identify column boundaries and horizontal densities to define rows. Unlike simple parsers, it includes an integrated OCR Correction Layer that fixes common scanning errors (interpreting 'O' as '0' or 'l' as '1') based on column context context. The system performs real-time type inference to distinguish between Dates, Currencies, and Strings, ensuring that the Excel file you download is ready for immediate calculation via Pivot Tables or VLOOKUP.

Formulas

The core logic utilizes a density-based clustering approach to reconstruct the grid. We calculate the horizontal whitespace probability.

1. Column Probability Density:

Let W(x) be the whitespace function where 1 is empty space and 0 is text.

Gap(i) = h∫0 W(x_i, y) dy

If Gap(i) > threshold, a column divider is placed at index i.

2. OCR Correction Probability:

For a cell value v, if the column type T is numeric (N), we apply a transformation map M:

v′ =

{

sub(v, 'O', '0') if T ∈ Nv otherwise

This ensures that strictly numeric columns do not contain alphanumeric noise that breaks Excel formulas.

Reference Data

Feature	Standard Converters	Our Heuristic Engine	Mathematical Basis
Spatial Logic	Linear Text Stream	2D Grid Projection	P(x) > threshold
Data Cleaning	None (Raw Strings)	Context-Aware Sanitization	f: S → R
OCR Repair	Manual Fix Required	Auto-Correction Dictionary	Probabilistic Mapping
Multi-Page	Separate Sheets	Intelligent Stitching	Header Pattern Matching
Number Parsing	Text (e.g. "1,200 cr")	Floats (-1200.00)	Regex Tokenizer
Dates	Unrecognized	ISO-8601 Normalization	DD/MM ↔ MM/DD
Confidence	Binary (Hit/Miss)	Heatmap Visualization	Low-Confidence Highlighting
Security	Server Upload	100% Client-Side (Sandboxed)	Local Memory Only

Frequently Asked Questions

Direct PDF parsing can be hit-or-miss due to the hundreds of different internal PDF encoding standards (Identity-H, CID, etc.). By copying text directly from the PDF viewer (Ctrl+A, Ctrl+C) and pasting it into our "Deep Parser", you utilize the Operating System's native clipboard rendering, which often solves encoding issues before our algorithms even touch the data.

We maintain a massive internal dictionary of "look-alike" characters. If the tool detects a column is 90% numeric, it assumes the remaining 10% are OCR errors. For example, it will automatically convert a lowercase "l" to the number "1", or a capital "S" to "5", saving hours of manual cleanup.

Absolutely. This application runs entirely in your browser's "Sandbox". No data leaves your device. You can verify this by loading the tool, turning off your Wi-Fi, and processing your documents offline. It requires no server connection to function.

We support "Sanitized CSV" (Standard), "Raw CSV" (Unmodified), "TSV" (Tab Separated for old systems), and "HTML-XLS" (which preserves basic formatting for Excel). We also offer a JSON export for developers.

Use the "Settings" panel to define your source Date Locale. The tool differentiates between 10/02 (October 2nd - US) and 10/02 (February 10th - EU) based on this setting. We normalize everything to ISO-8601 (YYYY-MM-DD) before export to ensure Excel interprets it correctly.