About

Optical Character Recognition (OCR) converts static images into editable text formats. Most commercial tools upload documents to remote servers for processing. This creates a security vulnerability when handling sensitive data like identification cards or legal contracts. This tool mitigates that risk by executing the extraction engine directly within the browser client. No data leaves the local device environment.

Accuracy in OCR relies heavily on input quality. Shadows, low contrast, and skewing significantly degrade performance. The integrated pre-processing engine allows users to manipulate the image histogram before extraction. Converting color channels to grayscale and applying binary thresholding isolates glyphs from background noise. This step is critical for legacy documents or scans with poor lighting conditions.

Formulas

The core mechanism for separating text from the background involves image binarization. We utilize a thresholding function to convert the pixel array into binary data.

P(x,y) =

{

255 if I(x,y) ≥ T0 if I(x,y) < T

Where I(x,y) is the intensity of the pixel at coordinates x, y and T is the user-defined threshold value Z.

Reference Data

Factor	Ideal Value / State	Impact on Confidence	Correction Method
Resolution (DPI)	300 dpi	High	Rescan source document
Text Size	10-12 pt (min)	Critical	Crop and scale
Skew Angle	±0.5°	Severe	Deskew algorithm
Binarization	Black & White	High	Threshold filter
Noise Level	Zero Salt/Pepper	Moderate	Median blur filter
Font Type	Sans-Serif	Variable	N/A (Engine limitation)
Contrast Ratio	7:1	High	Histogram equalization
Language Pack	Matching Source	Critical	Load correct Tesseract data

Frequently Asked Questions

Handwriting lacks the uniform geometric structure of printed typesets. The Tesseract engine utilized here is trained primarily on standard fonts (Arial, Times New Roman). While it can interpret block capitals, cursive or inconsistent handwriting usually results in low confidence scores.

No. The entire process runs inside a Web Worker on your browser. The network tab in your developer tools will confirm that no image data is sent to a remote server. The only network activity involves downloading the language training data files initially.

Thermal paper receipts often fade or have low contrast. Use the "Binarization" toggle in the settings panel. Adjust the threshold slider until the text appears clearly black against a purely white background. This removes the grey noise that confuses the OCR algorithm.

The browser allocates a specific amount of memory for the WebAssembly heap. Generally, images up to 4000x4000 pixels process efficiently. Exceeding this may cause the browser tab to crash due to memory exhaustion during the tensor operations.

This occurs when the engine has low confidence in a specific glyph. It often happens with similar characters like "l" (lowercase L) and "1" (one). Increasing the DPI of the source image or using the "Contrast" pre-processing filter helps the engine distinguish these edges.