User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
Category SEO Tools

Source Text

Pre-Processing
Filters
0 Found
0 Unique
0 Domains
🔗

Links will appear here

Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

In the high-stakes world of Search Engine Optimization (SEO) and data mining, the precision of data extraction is paramount. Manual link auditing is error-prone and inefficient. This Link Extractor is engineered to eliminate noise, utilizing RFC-compliant heuristic patterns to identify HTTP/HTTPS resources, FTP endpoints, and email addresses from massive unstructured datasets.

Accuracy matters. A broken link or a missed redirect can cost valuable link equity. This tool does not merely regex-match; it normalizes paths, detects duplicates, and offers granular filtering (by extension, protocol, or domain). It addresses the common pitfall of phantom characters attached to URLs during copy-paste operations, ensuring the output is strictly valid and ready for server-side analysis or reporting.

url-parser link-scraper seo-audit regex-tool href-extractor

Formulas

The core extraction logic relies on a composite Regular Expression pattern designed to capture standard Internet RFC 3986 URIs. The simplified logic flow is represented below:

{
Match P ≑ (Protocol) ∨ (www.)Match D ≑ [a-z0-9.-]+ (Domain)Match T ≑ .[a-z]{2,63} (TLD)Match R ≑ [:/?#][^\s]* (Resource Path)

The extraction efficiency E is calculated as the ratio of unique valid links to the total candidate strings found:

LuniqueLtotal

where L represents the set of extracted entities. Deduplication is handled via Set Theory logic:

Sclean = { x ∈ L | βˆ€ y ∈ Sclean, x β‰  y }

Reference Data

CategoryExtensions / PatternsMIME Type EstimateUsage Context
Raster Images.jpg, .jpeg, .png, .gif, .webp, .bmp, .icoimage/*Visual assets, favicons, banners.
Vector & Design.svg, .eps, .ai, .psd, .cdrimage/svg+xml, application/*Logos, source design files.
Documents.pdf, .doc, .docx, .xls, .xlsx, .ppt, .pptx, .txt, .rtfapplication/pdf, application/mswordWhitepapers, reports, data sheets.
Web Code.html, .htm, .php, .asp, .jsp, .css, .js, .json, .xmltext/html, application/jsonSource code, stylesheets, scripts.
Archives.zip, .rar, .7z, .tar, .gz, .isoapplication/zipSoftware packages, backups.
Audio.mp3, .wav, .aac, .ogg, .flac, .m4aaudio/*Podcasts, music, sound effects.
Video.mp4, .avi, .mov, .mkv, .webm, .flvvideo/*Streaming content, movies.
Protocolshttp://, https://, ftp://, sftp://, mailto:, tel:N/ACommunication & Transfer layers.
Executables.exe, .msi, .dmg, .apk, .bat, .shapplication/octet-streamInstallers, scripts (High Risk).
Data/Font.csv, .sql, .ttf, .otf, .woff, .woff2text/csv, font/*Database dumps, typography.

Frequently Asked Questions

Browsers enforce Cross-Origin Resource Sharing (CORS) policies. When you click "Check Health", the tool sends a HEAD request. If the target server (e.g., Google, Facebook) blocks direct browser requests, the check will return a "Network Error" or "Restricted". This does not necessarily mean the link is broken (404), only that it cannot be verified client-side without a backend proxy. We flag these as "Unknown/CORS".
Yes. The regex engine is protocol-agnostic. If you paste raw HTML (e.g., <a href='...'>), it will extract the URL inside the quotes. It ignores the HTML tags and focuses purely on the URL structure (http/https/www).
By default, the tool extracts the exact URL found. However, if you enable the "Clean UTM/Ref" option in the settings panel, the tool parses the query string and strips standard marketing parameters (utm_source, ref, etc.) before deduplication, ensuring you get the canonical URL.
The extractor prioritizes web resources (http/https). To extract emails, ensure the "Include Emails (mailto)" filter is checked. The system uses a specific RFC-5322 compliant regex for email detection which looks for the "@" symbol flanked by valid domain characters.
Partial support is enabled. While standard Punycode (xn--) is fully supported, raw UTF-8 characters in domains (e.g., http://сайт.Ρ€Ρ„) are complex to validate via regex alone. The tool attempts to capture them if they follow standard URI percent-encoding formats.