|
In computing, UCS-2 and UTF-16 are alternative names for a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. Often UTF-16 is used to refer to systems that support surrogates whilst UCS-2 is used to refer to systems that only support the BMP but this usage is not universal and the terms are often used interchangably. UTF-16 is officially defined in Annex Q of ISO/IEC 10646-1. It is also described in The Unicode Standard version 3.0 and higher, as well as in the IETF's RFC 2781. In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. ...
A character encoding consists of a code that pairs a set of natural language characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses, to facilitate the storage and transmission of text in computers and through telecommunication networks. ...
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ...
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...
...
UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ...
UTF-EBCDIC is an encoding of unicode as a superset of the basic ranges of EBCDIC and has similar advantages for existing EBCDIC based systems as UTF-8 has for existing ascii based systems. ...
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ...
Punycode, defined in RFC 3492, is a self-proclaimed Bootstring encoding of Unicode strings into the limited character set supported by the Domain Name System. ...
GB18030 is the registered internet name for the official character set of the Peoples Republic of China (PRC). ...
The Universal Character Set (UCS) is a character encoding that is defined by the international standard ISO/IEC 10646. ...
The writing systems of some languages, such as Persian (Farsi), Hebrew, and Arabic are written from right to left (RTL). ...
A Byte Order Mark (BOM) is the character at code point FEFF (ZERO-WIDTH NO-BREAK SPACE), when that character is used to denote the Endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32. ...
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ...
HTML 4. ...
Many email clients are now able to use Unicode. ...
Originally, the word computing was synonymous with counting and calculating, and a computer was a person who computes. ...
In computer science, 16-bit is an adjective used to describe integers that are at most two bytes wide, or to describe CPU architectures based on registers, address buses, or data buses of that size. ...
In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. ...
A character encoding consists of a code that pairs a set of natural language characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses, to facilitate the storage and transmission of text in computers and through telecommunication networks. ...
In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. ...
The Universal Character Set (UCS) is a character encoding that is defined by the international standard ISO/IEC 10646. ...
A bit (abbreviated b) is the most basic information unit used in computing and information theory. ...
The terms storage and memory refer to the parts of a digital computer that retain physical state (data) for some interval of time, possibly even after electrical power to the computer is turned off. ...
Unicode reserves 1,114,112 (= 220 + 216) code points, and currently assigns characters to more than 96,000 of those code points. ...
UTF-16 represents a character that has been assigned within the lower 65536 code points of Unicode or ISO/IEC 10646 as a single code value equivalent to the character's code point: 0 for 0, hexadecimal FFFD for FFFD, for example. UTF-16 represents a character above hexadecimal FFFF as a surrogate pair of code values from the range D800-DFFF. For example, the character at code point hexadecimal 10000 becomes the code value sequence D800 DC00, and the character at hexadecimal 10FFFD, the upper limit of Unicode, becomes the code value sequence DBFF DFFD. Unicode and ISO/IEC 10646 do not assign characters to any of the code points in the D800-DFFF range, so an individual code value from a surrogate pair does not ever represent a character. These code values are then serialized as 16-bit words, one word per code value. Because the endianness of these words varies according to the computer architecture, UTF-16 specifies three encoding schemes: UTF-16, UTF-16LE, and UTF-16BE. When integers or any other data are represented with multiple bytes, there are different ways those bytes can be arranged in memory or in transmission over some medium. ...
The UTF-16 encoding scheme mandates that the byte order must be declared by prepending a Byte Order Mark before the first serialized character. This BOM is the encoded version of the Zero-Width No-Break Space character, Unicode number FEFF in hex, manifesting as the byte sequence FE FF for big-endian, or FF FE for little-endian. A BOM at the beginning of UTF-16 encoded data is considered to be a signature separate from the text itself; it is for the benefit of the decoder. A Byte Order Mark (BOM) is the character at code point FEFF (ZERO-WIDTH NO-BREAK SPACE), when that character is used to denote the Endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32. ...
The UTF-16LE and UTF-16BE encoding schemes are identical to the UTF-16 encoding scheme, but rather than using a BOM, the byte order is implicit in the name of the encoding (LE for little-endian, BE for big-endian). A BOM at the beginning of UTF-16LE or UTF-16BE encoded data is not considered to be a BOM; it is part of the text itself. The IANA has approved UTF-16, UTF-16BE, and UTF-16LE for use on the Internet, by those exact names (case insensitively). The aliases UTF_16 or UTF16 may be meaningful in some programming languages or software applications, but they are not standard names. The Internet Assigned Numbers Authority (IANA) is an organisation that oversees IP address, top level domain and Internet protocol code point allocations. ...
UTF-16 is the native internal representation of text in the NT based versions of Windows and in the Java and .NET bytecode environments, as well as in Mac OS X's Cocoa and Core Foundation frameworks. Windows NT is an operating system produced by Microsoft. ...
Microsoft Windows is a range of operating environments for personal computers and servers. ...
Java is an object-oriented programming language developed initially by James Gosling and colleagues at Sun Microsystems. ...
Microsoft . ...
Mac OS X is the latest version of the Mac OS, the operating system software for Macintosh computers. ...
UTF-16 examples
| code point | character | UTF-16 code value(s) | glyph* | | 122 (hex 7A) | small Z (Latin) | 007A | z | | 27700 (hex 6C34) | water (Chinese) | 6C34 | 水 | | 119070 (hex 1D11E) | musical G clef | D834 DD1E | 𝄞 | | "水z𝄞" (water, z, G clef), UTF-16 encoded | | labeled encoding | byte order | byte sequence | | UTF-16LE | little-endian | 34 6C, 7A 00, 34 D8 1E DD | | UTF-16BE | big-endian | 6C 34, 00 7A, D8 34 DD 1E | | UTF-16 | little-endian, with BOM | FF FE, 34 6C, 7A 00, 34 D8 1E DD | | UTF-16 | big-endian, with BOM | FE FF, 6C 34, 00 7A, D8 34 DD 1E | * Appropriate font and software are required to see the correct glyphs.
Example UTF-16 Encoding Procedure The character at code point 64321 (hexadecimal) is to be encoded. Since it is above FFFF, it must be encoded with a surrogate pair, as follows: v = 0x64321 v′ = v - 0x10000 = 0x54321 = 0101 0100 0011 0010 0001 vh = 0101010000 // higher 10 bits of v′ vl = 1100100001 // lower 10 bits of v′ w1 = 0xD800 // the resulting 1st word is initialized with the lower bracket w2 = 0xDC00 // the resulting 2nd word is initialized with the lower bracket w1 = w1 | vh = 1101 1000 0000 0000 | 01 0101 0000 = 1101 1001 0101 0000 = 0xD950 w2 = w2 | vl = 1101 1100 0000 0000 | 11 0010 0001 = 1101 1111 0010 0001 = 0xDF21 The correct encoding for this character is thus the following word sequence: 0xD950 0xDF21 External links - Unicode Technical Note #12: UTF-16 for Processing
|