Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

patternhelloworld/url-knife

Repository files navigation

Url-knife

Overview

Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with Area-Pattern-based modularity.

  • This library is currently being refactored into TypeScript, as it was originally developed in JavaScript.

URL knife

LIVE DEMO

Area-Pattern-Based Modularity

The Area represents a designated section of content, such as general text, XML (HTML) areas, URL areas, or EMAIL areas. Each Area is associated with a specific set of Patterns (regular expressions) tailored to its context.

Example:

  1. In a TextArea (general plain text), the system applies a URL-specific regular expression to extract potential URLs.
  2. Once the area is narrowed down to contain URLs, UrlArea logic is used, applying URL-specific patterns to decompose the URL into its components (e.g., protocol, domain, path, query parameters).

Enhanced Accuracy with Regular Expression Indexes:

To further improve accuracy, the system leverages the index (or offset) values from regular expressions. These indexes help pinpoint exact locations of matches within the text, ensuring precise extraction and minimizing false positives.

For example:

  • If a CommentArea is processed using its specific patterns, the system identifies indexes for matches within that area.
  • These indexes can then be used to exclude matched URLs from a broader TextArea, ensuring only relevant URLs are processed and avoiding redundant or incorrect extractions.

Key Benefits:

This modular approach ensures that each Area is processed efficiently with the most relevant and optimized regular expressions. By incorporating index-based matching, it enables robust, scalable, and highly accurate parsing for various content types while preventing conflicts between overlapping patterns.

Installation

For ES5 users, refer to public/index.html.

<html>
<body>
<script src="../dist/url-knife.bundle.js">script>
<--! OR !-->
<script src="https://cdn.jsdelivr.net/gh/patternknife/url-knife@4.1.6/dist/url-knife.bundle.min.js">script>
body>
html>

For ES6 npm users, run 'npm install --save url-knife' in the console. (Requred Node v18.20.4)

import {TextArea, UrlArea, XmlArea} from 'url-knife';

For ES5, add Pattern before usage:

Pattern.UrlArea...

Syntax & Usage

Chapter 1. Normalize or parse one URL

Chapter 2. Extract all URLs or emails

Chapter 3. Extract URIs with certain names

Chapter 4. Extract all URLs in raw HTML or XML

Chapter 1. Normalize or parse one URL

The following two methods should be used for processing a single URL, not for multiple URLs within a text. (For handling multiple URLs, refer to Chapters 2 and 4.)

normalizeUrl vs parseUrl

If you need to parse a standard URL without any typos, it is safe to use parseUrl. However, normalizeUrl is designed to handle URLs that may contain human errors.

  • Run normalizeUrl
/**
* @brief
* Normalize an url with potential human errors (Intranet urls are not allowed.)
*/
var sample1 = Pattern.UrlArea.normalizeUrl("htp/:/abcgermany.,def;:9094 #park//noon??abc=retry")
var sample2 = Pattern.UrlArea.normalizeUrl("'://abc.jppp:9091 /park/noon'")
var sample3 = Pattern.UrlArea.normalizeUrl("ss hd : /university,.acd. ;jpkp: 9091/adc??abc=.com")
  • Results
{
"url": "htp/:/abcgermany.,def;:9094 #park//noon??abc=retry",
"normalizedUrl": "http://abcgermany.de:9094#park/noon?abc=retry",
"removedTailOnUrl": "",
"protocol": "http",
"onlyDomain": "abcgermany.de",
"onlyParams": "?abc=retry",
"onlyUri": "#park/noon",
"onlyUriWithParams": "#park/noon?abc=retry",
"onlyParamsJsn": {
"abc": "retry"
},
"type": "domain",
"port": "9094"
}
{
"url": "'://abc.jppp:9091 /park/noon'",
"normalizedUrl": "abc.jp:9091/park/noon",
"removedTailOnUrl": "'",
"protocol": null,
"onlyDomain": "abc.jp",
"onlyParams": null,
"onlyUri": "/park/noon'",
"onlyUriWithParams": "/park/noon'",
"onlyParamsJsn": null,
"type": "domain",
"port": "9091"
}
{
"url": "ss hd : /university,.acd. ;jpkpro jeobsog",
"normalizedUrl": "ssh://university.ac.jp",
"removedTailOnUrl": "",
"protocol": "ssh",
"onlyDomain": "university.ac.jp",
"onlyParams": null,
"onlyUri": null,
"onlyUriWithParams": null,
"onlyParamsJsn": null,
"type": "domain",
"port": null
}
  • Run parseUrl
/**
* @brief
* Parse an url with no potential human errors
*/
var url = Pattern.UrlArea.parseUrl("xtp://gooppalgo.com/park/tree/?abc=1")
console.log()
{
"url": "xtp://gooppalgo.com/park/tree/?abc=1",
"removedTailOnUrl": "",
"protocol": "xtp (unknown protocol)",
"onlyDomain": "gooppalgo.com",
"onlyParams": "?abc=1",
"onlyUri": "/park/tree/",
"onlyUriWithParams": "/park/tree/?abc=1",
"onlyParamsJsn": {
"abc": "1"
},
"type": "domain",
"port": null
}

Chapter 2. Extract all URLs or emails

The following methods are recommended to use in most cases.
  • extractAllUrls
ganada@pacbook.net; abc.com/ad/fg/?kk=5 abc@daum.net' + 'Have you visited http://goasidaio.ac.kr?abd=5annyeonghaseyo?5...,.&kkk=5rk.,, ' + 'http://df.ws/123\n' + 'http://142.42.1.1:8080/\n' + 'http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com ' + 'Have you visited goasidaio.ac.kr?abd=5hell0?5...&kkk=5rk.,. '; /** * @brief * Distill all urls from normal text * @author Andrew Kang * @param textStr string required * @param noProtocolJsn object * default : { 'ipV4' : false, 'ipV6' : false, 'localhost' : false, 'intranet' : false } var urls = Pattern.TextArea.extractAllUrls(textStr, { 'ipV4' : true, 'ipV6' : false, 'localhost' : false, 'intranet' : true })"> var textStr = 'http://[::1]:8000eseo http ://www.example.com/wpstyle/?p=364 is ok \n' +
'HTTP://foo.com/blah_blah_(wikipedia) https://www.google.com/maps/place/USA/@36.2218457,... tnae1ver.com:8000on the internet Asterisk\n ' +
'the packed1book.net. fakeshouldnotbedetected.url?abc=fake s5houl7Shi Qi Ri dbedetected.jp?japan=go&html=ganada@pacbook.net; abc.com/ad/fg/?kk=5 abc@daum.net' +
'Have you visited http://goasidaio.ac.kr?abd=5annyeonghaseyo?5...,.&kkk=5rk.,, ' +
'http://df.ws/123\n' +
'http://142.42.1.1:8080/\n' +
'http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com ' +
'Have you visited goasidaio.ac.kr?abd=5hell0?5...&kkk=5rk.,. ';

/**
* @brief
* Distill all urls from normal text
* @author Andrew Kang
* @param textStr string required
* @param noProtocolJsn object
* default : {
'ipV4' : false,
'ipV6' : false,
'localhost' : false,
'intranet' : false
}

var urls = Pattern.TextArea.extractAllUrls(textStr, {
'ipV4' : true,
'ipV6' : false,
'localhost' : false,
'intranet' : true
})
  • extractAllEmails
/**
* @brief
* Distill all emails from normal text
* @author Andrew Kang
* @param textStr string required
* @param prefixSanitizer boolean (default : false)
* @return array
*/

var emails = Pattern.TextArea.extractAllEmails(textStr, true)

console.log()
You may be wondering what the 'pass' property below means. If 'pass' is true, that is the email pattern is strictly true following RFC rules.
[{
"value": {
"email": "ganada@apacbook.ac.kr",
"removedTailOnEmail": null,
"type": "domain"
},
"area": "text",
"index": {
"start": 222,
"end": 240
},
"pass": false
},
{
"value": {
"email": "adssd@asdasd.ac.jp",
"removedTailOnEmail": null,
"type": "domain",
"removedTailOnUrl": "..."
},
"area": "text",
"index": {
"start": 242,
"end": 263
},
"pass": true
}]

LIVE DEMO

Chapter 3. Extract URIs with certain names

var sampleText = 'https://google.com/abc/777?a=5&b=7 abc/def 333/kak abc/55eseo abc/53 abc/533/ka abc/53a/ka /123a/abc/556/dd /abc/123?a=5&b=tkt /xyj/asff' +
'a333/kak nice/guy/ bad/or/nice/guy ssh://nice.guy.com/?a=dkdfl';

/**
* @brief
* Distill uris with certain names from normal text
* @author Andrew Kang
* @param textStr string required
* @param uris array required
* for example, [['a','b'], ['c','d']]
* If you use {number}, this means 'only number' ex) [['a','{number}'], ['c','d']]
* @param endBoundary boolean (default : false)
* @return array
*/

var uris = Pattern.TextArea.extractCertainUris(sampleText,
[['{number}', 'kak'], ['nice','guy'],['abc', '{number}']], true)

// 'If endBoundary is set to false, more uris are detected.'
// This detects all URIs containing '{number}/kak' or nice/guy' or 'abc/{number}'
console.log()
[
{
"uriDetected": {
"value": {
"url": "/abc/777?a=5&b=7",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": "",
"onlyParams": "?a=5&b=7",
"onlyUri": "/abc/777",
"onlyUriWithParams": "/abc/777?a=5&b=7",
"onlyParamsJsn": {
"a": "5",
"b": "7"
},
"type": "domain",
"port": null
},
"area": "text",
"index": {
"start": 18,
"end": 34
}
},
"inWhatUrl": {
"value": {
"url": "https://google.com/abc/777?a=5&b=7",
"removedTailOnUrl": "",
"protocol": "https",
"onlyDomain": "google.com",
"onlyParams": "?a=5&b=7",
"onlyUri": "/abc/777",
"onlyUriWithParams": "/abc/777?a=5&b=7",
"onlyParamsJsn": {
"a": "5",
"b": "7"
},
"type": "domain",
"port": null
},
"area": "text",
"index": {
"start": 0,
"end": 34
}
}
},
{
"uriDetected": {
"value": {
"url": "333/kak",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "333/kak",
"onlyUriWithParams": "333/kak",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 43,
"end": 51
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "abc/53",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "abc/53",
"onlyUriWithParams": "abc/53",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 60,
"end": 67
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "abc/533/ka",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "abc/533/ka",
"onlyUriWithParams": "abc/533/ka",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 67,
"end": 77
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "/123a/abc/556/dd",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "/123a/abc/556/dd",
"onlyUriWithParams": "/123a/abc/556/dd",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 89,
"end": 105
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "/abc/123?a=5&b=tkt",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": "?a=5&b=tkt",
"onlyUri": "/abc/123",
"onlyUriWithParams": "/abc/123?a=5&b=tkt",
"onlyParamsJsn": {
"a": "5",
"b": "tkt"
},
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 106,
"end": 124
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "nice/guy",
"removedTailOnUrl": "/",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "nice/guy",
"onlyUriWithParams": "nice/guy",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 144,
"end": 153
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "/or/nice/guy",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "/or/nice/guy",
"onlyUriWithParams": "/or/nice/guy",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 157,
"end": 170
}
},
"inWhatUrl": null
}
]

Chapter 4. Extract all URLs in raw HTML or XML

packed1book.net?user[name][first]=tj&user[name][last]=holowaychuk

\n' + 'fakeshouldnotbedetected.url?abc=fake -s5houl7Shi Qi Ri dbedetected.jp?japan=go- ' + 'plus.google.co.kr0eseo.., \n' + 'https://plus.google.com/+google\n' + 'https://www.google.com/maps/place/USA/@36.2218457,...' + ' float : none ; height: 200px;max-width: 50%;margin-top : 3%\' alt="undefined" src="http://www.aaagaga.com/image/showWorkOrderImg?fileName=12345.png"/>\n' + ' "abc@daum.net"ro bonaejuseyo. ' + '-gigi.dau.ac.kr?mac=10 -dau.ac.kr?mac=10

abcd@daum.co.kreseo ganada@pacbook.netPlease align the paper to the left. 

\n' + '

guru.com undefined

\n' + 'http: //ne1ver.com:8000?abc=1&dd=5 localhost:80 estonia.ee/ estonia.ee?

https://flaviocopes.com/how-to-inspect-javascript-object/ *Please ask 203.35.33.555:8000 if you have any issues! *    

Have you visited goasidaioaaa.ac.kr'; var urls = PatternExtractor.XmlArea.extractAllUrls(xmlStr); "> // The sample of 'XML (HTML)'
var xmlStr =
'en.wikipedia.org/wiki/Wikipedia:About\n' +
'

packed1book.net?user[name][first]=tj&user[name][last]=holowaychuk

\n'
+
'fakeshouldnotbedetected.url?abc=fake -s5houl7Shi Qi Ri dbedetected.jp?japan=go- ' +
'plus.google.co.kr0eseo.., \n' +
'https://plus.google.com/+google\n' +
'https://www.google.com/maps/place/USA/@36.2218457,...' +
' float : none ; height: 200px;max-width: 50%;margin-top : 3%\' alt="undefined" src="http://www.aaagaga.com/image/showWorkOrderImg?fileName=12345.png"/>\n' +
' "abc@daum.net"ro bonaejuseyo. ' +
'-gigi.dau.ac.kr?mac=10 -dau.ac.kr?mac=10

abcd@daum.co.kreseo ganada@pacbook.netPlease align the paper to the left. 

\n'
+
'

guru.com undefined

\n'
+
'http: //ne1ver.com:8000?abc=1&dd=5 localhost:80 estonia.ee/ estonia.ee?

https://flaviocopes.com/how-to-inspect-javascript-object/ *Please ask 203.35.33.555:8000 if you have any issues! *    

Have you visited goasidaioaaa.ac.kr'
;

var urls = PatternExtractor.XmlArea.extractAllUrls(xmlStr);
console.log()
[
// Not all listed
{
"value": {
"url": "packed1book.net?user[name][first]=tj&user[name][last]=holowaychuk",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": "packed1book.net",
"onlyParams": "?user[name][first]=tj&user[name][last]=holowaychuk",
"onlyUri": null,
"onlyUriWithParams": "?user[name][first]=tj&user[name][last]=holowaychuk",
"onlyParamsJsn": {
"user": {
"name": {
"first": "tj",
"last": "holowaychuk"
}
}
},
"type": "domain",
"port": null
},
"area": "text"
},
{
"value": {
"url": "adackedbooked.co.kr",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": "adackedbooked.co.kr",
"onlyParams": null,
"onlyUri": null,
"onlyUriWithParams": null,
"onlyParamsJsn": null,
"type": "domain",
"port": null
},
"area": "comment"
}
.....
]

About

Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with Area-Pattern-based modularity

Topics

Resources

Readme

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
license.txt

Stars

Watchers

Forks

Packages

Contributors