FACTOID # 155: Australia has more than 28 times the land area of New Zealand, but its coastline is not even twice as long.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

FACTS & STATISTICS    Simple view

  1. Select countries to view: (hold down Control key and click to select several)

     

     

    Compare:

     

     

  1. Select fact or statistic: (* = graphable)

     

     

     

  2. (OPTIONAL) Compare to statistic: (both need to be graphable)

     

     

     

  3. View result as:

     

       
(OR) SEARCH ALL encyclopedia, stats & forums:   

Encyclopedia > Character encodings in HTML
HTML

Character encodings
Dynamic HTML
Font family
HTML editor
HTML element
HTML scripting
Layout engine comparison
Style sheets
Unicode and HTML
W3C
Web browsers comparison
Web colors
XHTML HTML, short for Hypertext Markup Language, is the predominant markup language for web pages. ... Dynamic HTML or DHTML is a collection of technologies used together to create interactive and animated web sites by using a combination of a static markup language (such as HTML), a client-side scripting language (such as JavaScript), a presentation definition language (Cascading Style Sheets, CSS), and the Document Object... In HTML and XHTML, a font face or font family is the typeface that is applied to some text. ... An HTML editor is a software application for creating web pages. ... In computing, an HTML element indicates structure in an HTML document and a way of hierarchically arranging content. ... The W3C HTML standard includes support for client-side scripting. ... This article or section is incomplete and may require expansion and/or cleanup. ... It has been suggested that Tableless web design be merged into this article or section. ... The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike. ... It has been suggested that W3C Markup Validation Service be merged into this article or section. ... The following tables compare general and technical information for a number of web browsers. ... Web colors are colors used in designing web pages, and the methods for describing and specifying those colors. ... The Extensible HyperText Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax. ...

 This box: view  talk  edit 

HTML has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII two goals are worth considering: the information's integrity, and universal browser display. HTML, short for Hypertext Markup Language, is the predominant markup language for web pages. ... Year 1991 (MCMXCI) was a common year starting on Tuesday (link will display the 1991 Gregorian calendar). ... Image:ASCII fullsvg There are 95 printable ASCII characters, numbered 32 to 126. ... This article is about the ethical concept. ... An example of a Web browser (Internet Explorer 7) A Web browser is a software application that enables a user to display and interact with text, images, and other information typically located on a Web page at a website on the World Wide Web or a local area network. ...

Contents

The document character encoding

When HTML documents are served there are three ways to tell the browser what specific character encoding is to be used for display to the reader. First, HTTP headers can be sent by the web server along with each web page (HTML document). A typical HTTP header looks like this: HTTP (for HyperText Transfer Protocol) is the primary method used to convey information on the World Wide Web. ... The inside/front of a Dell PowerEdge web server The term Web server can mean one of two things: A computer program that is responsible for accepting HTTP requests from clients, which are known as Web browsers, and serving them HTTP responses along with optional data contents, which usually are...

 Content-Type: text/html; charset=ISO-8859-1 

For HTML (not usually XHTML), the other method is for the HTML document to include this information at its top, inside the HEAD element. HTML, short for Hypertext Markup Language, is the predominant markup language for web pages. ... The Extensible HyperText Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax. ...

 <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII"> 

XHTML documents have a third option: to express the character encoding in the XML preamble, for example The Extensible HyperText Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax. ... The Extensible Markup Language (XML) is a general-purpose markup language. ...

 <?xml version="1.0" encoding="ISO-8859-1"?> 

These methods each advise the receiver that the file being sent uses the character encoding specified. The character encoding is often referred to as the "character set" and it indeed does limit the characters in the raw source text. However, the HTML standard states that the "charset" is to be treated as an encoding of Unicode characters and provides a way to specify characters that the "charset" does not cover. The term code page is also used similarly. The Unicode Standard, Version 5. ... Code page is the traditional IBM term used for a specific character encoding table: a mapping in which a sequence of bits, usually a single octet representing integer values 0 through 255, is associated with a specific character. ...


It is a bad idea to send incorrect information about the character encoding used by a document. For example, a server where multiple users may place files created on different machines cannot promise that all the files it sends will conform to the server's specification — some users may have machines with different character sets. For this reason, many servers simply do not send the information at all, thus avoiding making false promises. However, this may result in the equally bad situation where the user agent displays the document incorrectly because neither sending party has specified a character encoding. A user agent is the client application used with a particular network protocol; the phrase is most commonly used in reference to those which access the World Wide Web. ...


The HTTP header specification supersedes all HTML (or XHTML) meta tag specifications, which can be a problem if the header is incorrect and one does not have the access or the knowledge to change them. Meta tags are used to provide structured data about data. ...


Browsers receiving a file with no character encoding information must make a blind assumption. For Western European languages, it is typical and fairly safe to assume windows-1252 (which is similar to ISO-8859-1 but has printable characters in place of some control codes that are forbidden in HTML anyway), but it is also common for browsers to assume the character set native to the machine on which they are running. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for English-speaking users, but other languages regularly — in some cases, always — require characters outside that range. In CJK environments where there are several different multi-byte encodings in use, auto-detection is often employed. ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. ... ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. ... The English language is a West Germanic language that originates in England. ... CJK is a collective term for Chinese, Japanese, and Korean, which comprise the main East Asian languages. ...


It is increasingly common for multilingual websites to use one of the Unicode/ISO 10646 transformation formats, as this allows use of the same encoding for all languages. Generally UTF-8 is used rather than UTF-16 or UTF-32 because it is easier to handle in programming languages that assume a byte-oriented ASCII superset encoding, and it is efficient for ASCII-heavy text (which HTML tends to be). The Unicode Standard, Version 5. ... The Universal Character Set is a character encoding that is defined by the international standard ISO/IEC 10646. ... In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. ... UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. ... In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ... A communication is byte oriented or character oriented when the transmitted information is grouped into bytes. ...


Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some machine-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers with different native sets will not see the page as intended.


Character references

In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal or hexadecimal) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML. HTML has been in use since 1991 (note that the W3C international standard is now XHTML), but the first standardized version with a reasonably complete treatment of international characters was version 4. ... A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-based markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set (UCS) of Unicode. ... For other uses, see Decimal (disambiguation). ... In mathematics and computer science, hexadecimal, base-16, or simply hex, is a numeral system with a radix, or base, of 16, usually written using the symbols 0–9 and A–F, or a–f. ... The Standard Generalized Markup Language (SGML) is a metalanguage in which one can define markup languages for documents. ...


Character entity references have the format &name; where "name" is a case-sensitive alphanumeric string. For example, the character 'λ' can be encoded as &lambda; in an HTML 4 document. Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references &lt;, &gt;, &quot; and &amp;, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters.


Numeric character references can be in decimal format, &#DD;, where DD is a variable-width string of decimal digits. Similarly there is a hexadecimal format, &#xHHHH;, where HHHH is a variable-width string of hexadecimal digits, though many consider it good practice to never use fewer than four hex digits, and never use an odd number of hex digits (due to the correspondence of two hex digits to one byte). Unlike named entities, hexadecimal character references are case-insensitive in HTML. For example, λ can also be represented as &#955;, &#x03BB; or &#X03bb;.


Numeric references always refer to Universal Character Set code points, regardless of the page's encoding. Using numeric references that refer to UCS control code ranges is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference —so "&#153;", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding. The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ...


Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a few special characters (or not at all if a native Unicode encoding like UTF-8 is used). The Unicode Standard, Version 5. ... UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. ...


XML character entity references

Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts: The Extensible Markup Language (XML) is a general-purpose markup language. ...

  • &amp; → & (ampersand, U+0026)
  • &lt; → < (less-than sign, U+003C)
  • &gt; → > (greater-than sign, U+003E)
  • &quot; → " (quotation mark, U+0022)
  • &apos; → ' (apostrophe, U+0027)

All other character entity references have to be defined before they can be used. For example, use of &eacute; (which gives é, Latin small letter E with acute, U+00E9, in HTML) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example &#xA1b rather than &#XA1b. XHTML, which is an XML application, supports the HTML 4 entity set and XML's &apos; entity, which does not appear in HTML 4. The Extensible HyperText Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax. ...


However, use of &apos; in XHTML should generally be avoided for compatibility reasons. &#39; may be used instead.


HTML character entity references

For a list of all named HTML character entity references, see List of XML and HTML character entity references (approximately 250 entries). In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference, of which there are two types: a...


See also

Information Integration is a field of study known by various terms, Information Fusion, Deduplication, Referential Integrity and so on. ... The Unicode Standard, Version 5. ...

External links


  Results from FactBites:
 
HTML - Wikipedia, the free encyclopedia (1652 words)
The HTML 3.0 standard was proposed by the newly formed W3C in March 1995, and provided many new capabilities such as support for tables, text flow around figures, and the display of complex math elements.
HTML 3.1 was never officially proposed, and the next standard proposal was HTML 3.2 (code-named "Wilbur"), which dropped the majority of the new features in HTML 3.0 and instead adopted many browser-specific element types and attributes which had been created for the Netscape and Mosaic web browsers.
HTML 4.0 likewise adopted many browser-specific element types and attributes, but at the same time began to try to "clean up" the standard by marking some of them as deprecated, and suggesting they not be used.
Creating Multilingual Web Pages: Unicode Support in HTML, HTML Editors and Web Browsers (2017 words)
The character encoding of an HTML document specifies the technical details of how the characters in the document character set should be represented as bits when stored in a computer file or transmitted over the Internet.
However, characters that are not allowed for in a character encoding can still be included in an HTML document by using character references.
Character encoding is also referred to by other names, including character encoding scheme, character coding, charset, coded character set, encoding and transmission character set.
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.