Link Extractor from Text
Advanced SEO utility to parse, validate, and organize URLs from unstructured text. Features robust RegEx extraction, duplicate removal, UTM cleaning, and CORS-aware status checking.
Source Text
Links will appear here
About
In the high-stakes world of Search Engine Optimization (SEO) and data mining, the precision of data extraction is paramount. Manual link auditing is error-prone and inefficient. This Link Extractor is engineered to eliminate noise, utilizing RFC-compliant heuristic patterns to identify HTTP/HTTPS resources, FTP endpoints, and email addresses from massive unstructured datasets.
Accuracy matters. A broken link or a missed redirect can cost valuable link equity. This tool does not merely regex-match; it normalizes paths, detects duplicates, and offers granular filtering (by extension, protocol, or domain). It addresses the common pitfall of phantom characters attached to URLs during copy-paste operations, ensuring the output is strictly valid and ready for server-side analysis or reporting.
Formulas
The core extraction logic relies on a composite Regular Expression pattern designed to capture standard Internet RFC 3986 URIs. The simplified logic flow is represented below:
The extraction efficiency E is calculated as the ratio of unique valid links to the total candidate strings found:
where L represents the set of extracted entities. Deduplication is handled via Set Theory logic:
Sclean = { x β L | β y β Sclean, x β y }
Reference Data
| Category | Extensions / Patterns | MIME Type Estimate | Usage Context |
|---|---|---|---|
| Raster Images | .jpg, .jpeg, .png, .gif, .webp, .bmp, .ico | image/* | Visual assets, favicons, banners. |
| Vector & Design | .svg, .eps, .ai, .psd, .cdr | image/svg+xml, application/* | Logos, source design files. |
| Documents | .pdf, .doc, .docx, .xls, .xlsx, .ppt, .pptx, .txt, .rtf | application/pdf, application/msword | Whitepapers, reports, data sheets. |
| Web Code | .html, .htm, .php, .asp, .jsp, .css, .js, .json, .xml | text/html, application/json | Source code, stylesheets, scripts. |
| Archives | .zip, .rar, .7z, .tar, .gz, .iso | application/zip | Software packages, backups. |
| Audio | .mp3, .wav, .aac, .ogg, .flac, .m4a | audio/* | Podcasts, music, sound effects. |
| Video | .mp4, .avi, .mov, .mkv, .webm, .flv | video/* | Streaming content, movies. |
| Protocols | http://, https://, ftp://, sftp://, mailto:, tel: | N/A | Communication & Transfer layers. |
| Executables | .exe, .msi, .dmg, .apk, .bat, .sh | application/octet-stream | Installers, scripts (High Risk). |
| Data/Font | .csv, .sql, .ttf, .otf, .woff, .woff2 | text/csv, font/* | Database dumps, typography. |