|
URL normalization is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a canonical URL so it is possible to determine if two syntactically different URLs are equivalent. URL normalization is also sometimes called URL canonicalization. A Uniform Resource Locator (URL) is a string of characters conforming to a standardized format, which refers to a resource on the Internet (such as a document or an image) by its location. ...
Canonical is an adjective derived from canon. ...
Web crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached. A web crawler (also known as a web spider or web robot) is a program which browses the World Wide Web in a methodical, automated manner. ...
It has been suggested that Comparison of web browsers be merged into this article or section. ...
Normalization process There are several type of normalization that may be performed: - Converting the scheme and host to lower case – The scheme and host portion of the URL is case insensitive, and therefore most normalizers will convert them to lowercase. Example:
HTTP://www.FooBar.com/ → http://www.foobar.com/ - Converting the entire URL to lower case – Some web servers that run on top of case-insensitive file systems allow URLs to be case insensitive. Therefore all URLs from a case-insensitive web server may be converted to lowercase to avoid ambiguity. Example:
http://foo.org/BAR.html → http://foo.org/bar.html - Capitalizing hexadecimal digits – All hexadecimal digits within a percent-encoding triplet (e.g., "%3a") are case-insensitive, and therefore the digits A-F should be capitalized. Example:
http://foo.org/?mode=%3a%b1+abc → http://foo.org/?mode=%3A%B1+abc - Removing the fragment – The fragment portion of a URL is usually removed because a URL with and without the fragment represent the same resource. Example:
http://foo.org/bar.html#section1 → http://foo.org/bar.html - Removing port 80 – The default port (80) may be removed from (or added to) a URL. Example:
http://foo.org:80/bar.html → http://foo.org/bar.html - Removing ".." and "." segments – The ".." and "." segments are usually removed from a URL. Many normalizers use the algorithm described in RFC 3986 (or a similar algorithm) to remove the segments. Example:
http://foo.org/../a/b/../c/./d.html → http://foo.org/a/c/d.html - Add terminating slash – A terminating slash may be added at the end of a URL that points to a directory. Most web servers will redirect HTTP requests that are missing a terminating slash to a URL with the terminating slash. Example:
http://foo.org → http://foo.org/ http://foo.org/dir → http://foo.org/dir/ - Removing "www" prefix – Some websites allow access to them through using an optional "www" prefix. For example, http://foo.org/ and http://www.foo.org/ may access the same website. Although many websites will redirect the user to the non-www prefix version (or vice versa), some do not. A normalizer may perform extra processing to determine if there is a non-www prefix version and then normalize all URLs to the non-www prefix. Example:
http://www.foo.org/ → http://foo.org/ In computing, a file system is a method for storing and organizing computer files and the data they contain to make it easy to find and access them. ...
Flowcharts are often used to represent algorithms. ...
URL redirection is a technique on the world wide web for making a web page available under many URLs. ...
HyperText Transfer Protocol (HTTP) is the method used to transfer or convey information on the World Wide Web. ...
References - RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
- Sang Ho Lee, Sung Jin Kim and Seok Hoo Hong (2005). "On URL normalization". Proceedings of the International Conference on Computational Science and its Applications (ICCSA 2005), 1076-1085.
See also |