FACTOID # 121: About one-quarter of all nations drive on the left-hand-side of the road. Most of them are former British colonies.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

Encyclopedia > Character entity reference

HTML has been in use since 1991 (note that the W3C international standard is now XHTML), but the first standardized version with a reasonably complete treatment of international characters was version 4.0, not published until 1997. Considerable care must be exercised when creating HTML pages with special characters outside the range of 7-bit ASCII to ensure two goals: the integrity of the information stored in the HTML document, and proper display of the document by the largest possible variety of browsers.


The document character set

When HTML documents are served to the viewer, there are two ways to tell the browser what specific character encoding is used. First, HTTP headers can be sent by the server along with each page. A typical header looks like this:

Content-Type: text/html; charset=ISO-8859-1

The other method is for the HTML document to include this information at its top, inside the HEAD element.

<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">

Either method advises the receiver that the file being sent uses the character set specified. Of course, it would be a very bad idea to send incorrect information. For example, a server where multiple users may place files created on different machines cannot promise that all the files it sends will conform (some users may have machines with different character sets). For this reason, many servers simply do not send the information at all, to avoid making any false promises.


Browsers receiving a file with no character set information must make a blind assumption. The safest is probably to assume ISO 8859-1, but it is also common for browsers to assume the character set native to the machine on which they are running. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) may appear incorrectly. This presents few problems for English-speaking users, but other languages require characters outside that range for everyday use.


For maximum compatibility, it is increasingly common for multilingual websites to use the UTF_8 encoding of the ISO 10646/Unicode character set, which provides a superset of almost all existing character sets.


It is important to point out that successful viewing of a page is not necessarily an indication that it is encoded correctly. If the creator of a page and the reader are both assuming some machine_specific character set, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers with different native sets will not.


Character entity references

In addition to native character encodings, characters can also be encoded as HTML entities, using the encoding format derived from the use of entities in SGML.


Many symbolic character entities have been defined. For example, the character 'λ' can be encoded as &lambda;. This use of the '&' character as an escape character for character entities means that literal '&' characters in HTML need to be encoded as an entity themselves, as &amp;. A similar escapes is required for the '<' character, encoded as &lt;. The '>' character only needs to be encoded if it is part of an attribute value: it should then be encoded as &gt;. Note that this encoding is different from URL encoding, which uses a different method and is far more strict.


Decimal and hexadecimal HTML character references can also be used, based on the Unicode numeric code for the character encoded. For example, λ can also be represented as a decimal_coded character reference as &#955;. It is important to note that numeric references always refer to Unicode, irrespective of page encoding. Using numeric references which lie within the reserved control area of Unicode (and therefore also ISO 8859_1) is therefore illegal. That is, all characters in the (hex) ranges 00–1F, 7F, and 80–9F, or &#0; to &#31; and &#127; to &#159;.


Note that unnecessary use of HTML character references may significantly reduce the readability of HTML. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a few special characters. The characters & and < always need to be encoded, as noted above.


In XML there are five built-in character entity references:

  • &amp; = & (ampersand, U+0026)
  • &lt; = < (left angle bracket, less-than sign, U+003C)
  • &gt; = > (right angle bracket, greater-than sign, U+003E)
  • &quot; = " (quotation mark, U+0022)
  • &apos; = ' (apostrophe, U+0027)

All other character entity references have to be defined before they can be used. For example, use of &eacute; (which gives é, Latin small letter E with acute, U+00E9, in HTML) in an XML document will generate an error unless the entity has already been defined.


External link



  Results from FactBites:
 
A Simple Character Entity Chart (677 words)
A character entity reference is an SGML construct that references a character of the document character set.
Display of glyphs for these characters may be obtained by being able to display the relevant [ISO10646] characters or by other means, such as internally mapping the listed entities, numeric character references, and characters to the appropriate position in some font that contains the requisite glyphs.
The character entity references in this section are for escaping markup-significant characters (these are the same as those in HTML 2.0 and 3.2), for denoting spaces and dashes.
  More results at FactBites »

 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your location
Your comments
Please enter the 5-letter protection code


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.