|
This page compares Unicode encodings. Two situations are considered: eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size. Technical note: Due to technical limitations, some web browsers may not display some special characters in this article. ...
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ...
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...
CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report 26. ...
In computing, UCS-2 and UTF-16 are the names of two nearly identical 16-bit Unicode Transformation Formats: character encoding forms that provide a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or...
UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ...
UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ...
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ...
This article or section may be confusing for some readers, and should be edited to be clearer. ...
GB18030 is the registered internet name for the official character set of the Peoples Republic of China. ...
The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ...
Many of the major writing systems of the world, such as Arabic and Hebrew, are written in a form known as right-to-left (RTL), in which writing begins at the right-hand side of a page and concludes at the left-hand side. ...
A Byte Order Mark (BOM) is the character at code point U+FEFF (zero-width no-break space), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text...
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ...
HTML 4. ...
Many email clients are now able to use Unicode. ...
Technical note: Due to technical limitations, some web browsers may not display some special characters in this article. ...
Simple Mail Transfer Protocol (SMTP) is the de facto standard for email transmission across the Internet. ...
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ...
BOCU-1 is a MIME compatible Unicode compression scheme. ...
Summary of size issues
UTF-32 requires four bytes to encode any character. Since characters outside the basic multilingual plane are rare, a document encoded in UTF-32 will usually be nearly twice as large as its UTF-16–encoded equivalent. On the other hand, UTF-8 uses anywhere between one and four bytes to encode a character; it may use fewer, the same, or more bytes than UTF-16 to encode the same character. UTF-EBCDIC is always as bad as or worse than UTF-8 for printable characters due to a decision made to allow encoding the C1 control codes as single bytes. UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ...
Unicode reserves 1,114,112 (= 220 + 216) code points, and currently assigns characters to more than 96,000 of those code points. ...
In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ...
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...
UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ...
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...
For seven-bit environments, UTF-7 clearly wins over the combination of other Unicode encodings with quoted printable or base64. Quoted-printable is an encoding using printable characters, alphanumeric and the equals sign =, to transmit 8bit data over a 7bit data path. ...
Base 64 literally means a positional numbering system using a base of 64. ...
Considerations other than size For processing For processing, a format should be easy to search, truncate, and generally process safely. All normal unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded one or more of these code units will represent a Unicode code point. To allow easy searching and truncation a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but UTF-7 and GB18030 do not. Fixed-size characters can be helpful, but it should be remembered that even if there is a fixed width per code point (as in UTF-32), there is not a fixed width per displayed character due to combining characters. If you are working with a particular API heavily and that API has standardised on a particular Unicode encoding it is generally a good idea to use the encoding that the API does to avoid the need to convert before every call to the API. Similarly if you are writing server side software it may simplify matters to use the same format for processing that you are communicating in. Combining diacritical marks are Unicode characters that are intended to modify other characters (see Diacritic). ...
An application programming interface (API) is the interface that a computer system, library or application provides in order to allow requests for service to be made of it by other computer programs, and/or to allow data to be exchanged between them. ...
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. Unfortunately using UTF-16 makes characters outside the BMP a special case which increases the risk of oversights related to their handling.
For communication and storage Some protocols and file formats may be limited to a specific set of encodings, but even when they are not some encodings may offer better compatibility than others with existing implementations. Also the cost of converting between your processing format and your communication format should be considered both in terms of program size (e.g. GB18030 requires a huge mapping table) and run-time requirements. Also UTF-16 and UTF-32 are not byte orientated and so a byte order must be selected when transmitting them over a byte orientated network or storing them in a byte orientated file. This may be achived by standardising on a single byte order, by specifying the endian as part of external metadata (for example the MIME charset registry has distinct UTF-16BE and UTF-16LE registrations as well as the plain UTF-16 one) or by using a Byte Order Mark at the start of the text. A Byte Order Mark (BOM) is the character at code point U+FEFF (zero-width no-break space), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text...
Finally if the bytestream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resyncronise at the start of the next good character. UTF-16 and UTF-32 will handle corrupt bytes well (again recovering on the next good character) but a lost byte will garble all following text. GB18030 may be thrown out of sync by a corrupt or missing byte and has no designed in recovery.
In detail The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.
Eight-bit environments | Code range (hexadecimal) | UTF-8 | UTF-16 | UTF-32 | UTF-EBCDIC | GB18030 | | 000000 – 00007F | 1 | 2 | 4 | 1 | 1 | | 000080 – 00009F | 2 | 2 | 4 | 1 | 2 for characters inherited from GB2312/GBK (e.g. most Chinese characters) 4 for everything else. | | 0000A0 – 0003FF | 2 | 2 | 4 | 2 | | 000400 – 0007FF | 2 | 2 | 4 | 3 | | 000800 – 003FFF | 3 | 2 | 4 | 3 | | 004000 – 00FFFF | 3 | 2 | 4 | 4 | | 010000 – 03FFFF | 4 | 4 | 4 | 4 | 4 | | 040000 – 10FFFF | 4 | 4 | 4 | 5 | 4 | UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...
In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ...
UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ...
UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ...
GB18030 is the registered Internet name for the official character set of the Peoples Republic of China (PRC). ...
GB2312 is the registered internet name for a key official character set of the Peoples Republic of China, used for simplified Chinese characters. ...
GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the Peoples Republic of China. ...
Seven-bit environments This table may not cover every special case and so should be used for estimation and comparison only. To accurately determine the size of text in an encoding, see the actual specifications. | code range (hexadecimal) | UTF-7 | UTF-8 quoted printable | UTF-8 base64 | UTF-16 quoted printable | UTF-16 base64 | UTF-32 quoted printable | UTF-32 base64 | GB18030 quoted printable | GB18030 base64 | | 000000 – 000032 | same as 000080–00FFFFFF | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ | | 000033 – 00003C | 1 for "direct characters" and possibly "optional direct characters" (depending on the encoder setting) 2 for +, otherwise same as 000080–00FFFFFF | 1 | 1⅓ | 4 | 2⅔ | 10 | 5⅓ | 1 | 1⅓ | | 00003D (equals sign) | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ | | 00003E – 00007E | 1 | 1⅓ | 4 | 2⅔ | 10 | 5⅓ | 1 | 1⅓ | | 00007F | 5 for an isolated case inside a run of single byte characters. For runs 2⅔ per character plus padding to make it a whole number of bytes plus two to start and finish the run | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ | | 000080 – 0007FF | 6 | 2⅔ | 2–6 depending on if the byte values need to be escaped | 2⅔ | 8–12 depending on if the final two byte values need to be escaped | 5⅓ | 4–6 for characters inherited from GB2312/GBK (e.g. most Chinese characters) 8 for everything else. | 2⅔ for characters inherited from GB2312/GBK (e.g. most Chinese characters) 5⅓ for everything else. | | 000800 – 00FFFF | 9 | 4 | 2⅔ | 5⅓ | | 010000 – 10FFFF | same as two characters from above | 12 | 5⅓ | 8–12 depending on if the low bytes of the surrogates need to be escaped. | 5⅓ | 5⅓ | 8 | 5⅓ | |