FACTOID # 141: Norwegians drink 10.7 kilograms of coffee per person each year. They also lead the globe in anxiety disorders. Maybe it’s time to switch to herbal tea.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Universal Character Set

The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. It contains nearly a hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its code point. Standardization, in the context related to technologies and industries, is the process of establishing a technical standard among competing entities in a market, where this will bring benefits without hurting competition. ... This article does not cite any references or sources. ... The International Electrotechnical Commission (IEC) is an international standards organization dealing with electrical, electronic and related technologies. ... A character encoding or character set (sometimes referred to as code page) consists of a code that pairs a sequence of characters from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers... The integers are commonly denoted by the above symbol. ...


Characters (letters, numbers, symbols, ideograms, logograms, etc.) from the many languages, scripts, and traditions of the world are represented in the UCS with unique code points. The inclusiveness of the UCS is continually improving as characters from previously unrepresented writing systems are added. Writing systems of the world today. ...


Since 1991, the Unicode Consortium has worked with ISO to develop The Unicode Standard ("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Version 2.0 of Unicode exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After the publication of Unicode 3.0 in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. Year 1991 (MCMXCI) was a common year starting on Tuesday (link will display the 1991 Gregorian calendar). ... In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. ... Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ... 2000 (MM) was a leap year starting on Saturday of the Gregorian calendar. ...


The UCS has over 1.1 million code points, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) mandated in 2000 that computer systems sold in its territory must support GB18030, which required that computer systems intended for sale in the PRC must move beyond the BMP. Unicode reserves 1,114,112 (= 220 + 216) code points, and currently assigns characters to more than 96,000 of those code points. ... GB18030 is the registered Internet name for the official character set of the Peoples Republic of China (PRC). ...


The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimize conflicts with other encoding forms.

Unicode
Encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Contents

Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ... This page compares Unicode encodings. ... UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ... UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. ... CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report 26. ... In computing, UTF-16 is a variable-length (16 or 32 bits) character encoding. ... UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ... UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ... The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ... This article or section may be confusing for some readers, and should be edited to be clearer. ... GB18030 is the registered Internet name for the official character set of the Peoples Republic of China (PRC) superseding GB2312. ... Unicode’s Universal Character Set potentially supports over 1 million code points (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points. ... Some writing systems of the world, such as Arabic and Hebrew, are written in a form known as right-to-left (RTL), in which writing begins at the right-hand side of a page and concludes at the left-hand side. ... A Byte Order Mark (BOM) is the character at code point U+FEFF (zero-width no-break space), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text... Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ... The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike. ... Many e-mail clients are now able to use Unicode. ... Unicode typefaces (also known as UCS fonts and Unicode fonts) contains wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc, which are collectively mapped into Universal Character Set, also known as, UCS (which is an international standard ISO/IEC 10646), derived from many different languages, scripts from all...

Encoding forms of the Universal Character Set

ISO 10646 defines several character encoding forms for the Universal Character Set. The simplest, UCS-2, uses a single code value (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP. In computer science a byte is a unit of measurement of information storage, most often consisting of eight bits. ... This article is about the unit of information. ...


The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Each pair consists of an "RC-element" (a two-octet sequence comprising the R-octet and the C-octet from the four octet sequence that corresponds to a cell in the coding space of a coded character set) from the high-half zone and an "RC-element" from the low-half zone. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates". In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ...


Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2. UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ...


Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.


History of ISO 10646

The International Organization for Standardization (ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990. Hugh McGregor Ross was one of its principal architects. That standard differed markedly from the current one. It defined 128 groups of 256 planes of 256 rows of 256 cells, for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values of control characters (0x00 to 0x1F and 0x80 to 0x9F, in hexadecimal notation) anywhere. The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41. This article does not cite any references or sources. ... Year 1989 (MCMLXXXIX) was a common year starting on Sunday (link displays 1989 Gregorian calendar). ... Year 1990 (MCMXC) was a common year starting on Monday (link displays the 1990 Gregorian calendar). ... Hugh McGregor Ross, 88 years of age when the photo was taken in January 2006, with a copy of the 1987 Draft Proposal for ISO/IEC 10646 Hugh McGregor Ross (born August 31, 1917 in Nairobi, Kenya) is an early pioneer in the history of British computing. ... In mathematics and computer science, hexadecimal, base-16, or simply hex, is a numeral system with a radix, or base, of 16, usually written using the symbols 0–9 and A–F, or a–f. ...


One could code the characters of this primordial ISO 10646 standard in one of three ways:

  1. UCS-4, four bytes for every character, enabling the simple encoding of all characters;
  2. UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with ISO 2022 escape sequences;
  3. UTF-1, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control characters).

In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it. The ISO standardisers realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control character values), thus permitting characters like 0x0000101F; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode. ISO 2022, more formally ISO/IEC 2022, is an ISO standard (equivalent to the ECMA standard ECMA-35) specifying a technique for including multiple character sets in a single character encoding. ... UTF-1 is a way of transforming ISO 10646/Unicode into a stream of bytes. ... Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ...


Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 characters by means of the UTF-16 surrogate mechanism. For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 2,000 million. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32. As for UTF-1, no-one used it, because of its bad design (no way of distinguishing between single bytes, lead bytes and trail bytes, a problem similar to that of the Shift-JIS encoding of Japanese) and its poor performance (many division operations). Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed width encoding, which came to be called UTF-8. In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ... The title given to this article is incorrect due to technical limitations. ... Rob Pike (born 1956) is a software engineer and author. ... Ken Thompson Kenneth Thompson (born February 4, 1943) is a pioneer of computer science notable for his contributions to the development of the C programming language and the UNIX operating system. ... Plan 9 from Bell Labs is a distributed operating system, primarily used as a research vehicle. ... UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. ...


Differences between ISO 10646 and Unicode

ISO 10646 and Unicode have an identical repertoire and numbers — the same characters with the same numbers exist on both standards. The difference between them is that Unicode adds rules and specifications that are outside the scope of ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. In contrast, Unicode adds rules for collation, normalisation of forms, and the bidirectional algorithm for scripts like Hebrew and Arabic. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO 10646; Unicode must be implemented. ISO 8859, more formally ISO/IEC 8859, is a joint ISO and IEC standard for 8-bit character encodings for use by computers. ... “Hebrew” redirects here. ... The Arabic alphabet is the script used for writing languages such as Arabic, Persian, Urdu, and others. ...


To support these rules and algorithms, Unicode adds many properties to each character in the set such as properties determining a character’s default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number ‘8’, or the vulgar fraction ‘¼’, that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.


Some applications support ISO 10646 characters but do not fully support Unicode. One such application, Linux xterm, can properly display all ISO 10646 characters that have a one-to-one character-to-glyph mapping and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional), Devanagari (one character to many glyphs) or Arabic (both features). Most GUI applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly. For instance, selecting text in certain scripts in Mozilla Firefox causes the text to jump around. Linux (IPA pronunciation: ) is a Unix-like computer operating system. ... Rigveda manuscript in Devanagari (early 19th century) Devanāgarī (देवनागरी — in English pronounced ) (ISCII – IS13194:1991) [1] is an abugida alphabet used to write several Indian languages, including Sanskrit, Hindi, Marathi, Kashmiri, Sindhi, Bihari, Bhili, Konkani, Bhojpuri and Nepali from Nepal. ... A graphical user interface (GUI) is a type of user interface which allows people to interact with a computer and computer-controlled devices which employ graphical icons, visual indicators or special graphical elements called widgets, along with text labels or text navigation to represent the information and actions available to... Mozilla Firefox is a graphical web browser developed by the Mozilla Corporation and a large community of external contributors. ...


Citing the Universal Character Set

ISO 10646, a general, informal citation for the ISO/IEC 10646 family of standards, is acceptable in most prose. And even though it is a separate standard, the term Unicode is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite a particular part and version, using the form ISO/IEC 10646-{part}:{year}; for example: ISO/IEC 10646-1:1993.


Correlation to Unicode

  • ISO/IEC 10646-1:1993 ≈ Unicode 1.1
  • ISO/IEC 10646-1:2000 ≈ Unicode 3.0
  • ISO/IEC 10646-2:2001 ≈ Unicode 3.2
  • ISO/IEC 10646:2003 ≈ Unicode 4.0
  • ISO/IEC 10646:2003 plus Amendment 1 ≈ Unicode 4.1
  • ISO/IEC 10646:2003 plus Amendment 1, Amendment 2, and part of Amendment 3 ≈ Unicode 5.0

See §D.1 of The Unicode Standard for more detail.


See also

Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ... A character encoding or character set (sometimes referred to as code page) consists of a code that pairs a sequence of characters from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers... UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. ... In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ... This is an incomplete list of ISO standards. ... ISO 646 is an ISO standard that specifies a 7 bit character code from which several national standards are derived, the best known of which is ASCII. Since the portion of ISO 646 shared by all countries specified only the letters used in the English alphabet, other countries using the... ISO 2022, more formally ISO/IEC 2022, is an ISO standard (equivalent to the ECMA standard ECMA-35) specifying a technique for including multiple character sets in a single character encoding. ... The VT100 was a video terminal made by Digital Equipment Corporation (DEC) which became the de facto standard used by terminal emulators. ... ISO 8859, more formally ISO/IEC 8859, is a joint ISO and IEC standard for 8-bit character encodings for use by computers. ... In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference, of which there are two types: a... This is a list of typefaces. ...

External links

  • Freely available ISO standards – includes a copy of ISO 10646:2003 (82 MB ZIP file, released 2006-09-28)
  • ISO/IEC JTC1/SC2/WG2, the working group in charge of ISO 10646
  • UTF-8 and Unicode FAQ
  • SIL's freeware fonts, editors and documentation


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.