Source Text

📁 Load File

Pre-Processing Remove Duplicates Clean UTM/Ref

Filters Images Emails

0 Found

0 Unique

0 Domains

🔗

Links will appear here

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

In the high-stakes world of Search Engine Optimization (SEO) and data mining, the precision of data extraction is paramount. Manual link auditing is error-prone and inefficient. This Link Extractor is engineered to eliminate noise, utilizing RFC-compliant heuristic patterns to identify HTTP/HTTPS resources, FTP endpoints, and email addresses from massive unstructured datasets.

Accuracy matters. A broken link or a missed redirect can cost valuable link equity. This tool does not merely regex-match; it normalizes paths, detects duplicates, and offers granular filtering (by extension, protocol, or domain). It addresses the common pitfall of phantom characters attached to URLs during copy-paste operations, ensuring the output is strictly valid and ready for server-side analysis or reporting.

Formulas

The core extraction logic relies on a composite Regular Expression pattern designed to capture standard Internet RFC 3986 URIs. The simplified logic flow is represented below:

{

Match P ≡ (Protocol) ∨ (www.)Match D ≡ [a-z0-9.-]+ (Domain)Match T ≡ .[a-z]{2,63} (TLD)Match R ≡ [:/?#][^\s]* (Resource Path)

The extraction efficiency E is calculated as the ratio of unique valid links to the total candidate strings found:

L_uniqueL_total

where L represents the set of extracted entities. Deduplication is handled via Set Theory logic:

S_clean = { x ∈ L | ∀ y ∈ S_clean, x ≠ y }

Reference Data

Category	Extensions / Patterns	MIME Type Estimate	Usage Context
Raster Images	.jpg, .jpeg, .png, .gif, .webp, .bmp, .ico	image/*	Visual assets, favicons, banners.
Vector & Design	.svg, .eps, .ai, .psd, .cdr	image/svg+xml, application/*	Logos, source design files.
Documents	.pdf, .doc, .docx, .xls, .xlsx, .ppt, .pptx, .txt, .rtf	application/pdf, application/msword	Whitepapers, reports, data sheets.
Web Code	.html, .htm, .php, .asp, .jsp, .css, .js, .json, .xml	text/html, application/json	Source code, stylesheets, scripts.
Archives	.zip, .rar, .7z, .tar, .gz, .iso	application/zip	Software packages, backups.
Audio	.mp3, .wav, .aac, .ogg, .flac, .m4a	audio/*	Podcasts, music, sound effects.
Video	.mp4, .avi, .mov, .mkv, .webm, .flv	video/*	Streaming content, movies.
Protocols	http://, https://, ftp://, sftp://, mailto:, tel:	N/A	Communication & Transfer layers.
Executables	.exe, .msi, .dmg, .apk, .bat, .sh	application/octet-stream	Installers, scripts (High Risk).
Data/Font	.csv, .sql, .ttf, .otf, .woff, .woff2	text/csv, font/*	Database dumps, typography.

Frequently Asked Questions

Browsers enforce Cross-Origin Resource Sharing (CORS) policies. When you click "Check Health", the tool sends a HEAD request. If the target server (e.g., Google, Facebook) blocks direct browser requests, the check will return a "Network Error" or "Restricted". This does not necessarily mean the link is broken (404), only that it cannot be verified client-side without a backend proxy. We flag these as "Unknown/CORS".

Yes. The regex engine is protocol-agnostic. If you paste raw HTML (e.g., <a href='...'>), it will extract the URL inside the quotes. It ignores the HTML tags and focuses purely on the URL structure (http/https/www).

By default, the tool extracts the exact URL found. However, if you enable the "Clean UTM/Ref" option in the settings panel, the tool parses the query string and strips standard marketing parameters (utm_source, ref, etc.) before deduplication, ensuring you get the canonical URL.

The extractor prioritizes web resources (http/https). To extract emails, ensure the "Include Emails (mailto)" filter is checked. The system uses a specific RFC-5322 compliant regex for email detection which looks for the "@" symbol flanked by valid domain characters.

Partial support is enabled. While standard Punycode (xn--) is fully supported, raw UTF-8 characters in domains (e.g., http://сайт.рф) are complex to validate via regex alone. The tool attempts to capture them if they follow standard URI percent-encoding formats.