|
BOCU-1 is a MIME compatible Unicode compression scheme. BOCU stands for Binary Ordered Compression for Unicode. BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU. This Unicode encoding is useful for compressing short strings, and it maintains code point order. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently. In computing, Unicode provides an international standard which has the goal of providing the means to encode the text of every document people want to store on computers. ...
A character encoding consists of a code that pairs a set of characters (representations of graphemes or grapheme-like units, such as might appear in an alphabet or syllabary for the communication of a natural language) with a set of something else, such as numbers or electrical pulses, in order...
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ...
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...
...
In computing, UCS-2 and UTF-16 are alternative names for a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission...
UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ...
UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ...
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ...
Punycode, defined in RFC 3492, is a self-proclaimed Bootstring encoding of Unicode strings into the limited character set supported by the Domain Name System. ...
GB18030 is the registered internet name for the official character set of the Peoples Republic of China (PRC). ...
The Universal Character Set (UCS) is a character encoding that is defined by the international standard ISO/IEC 10646. ...
The writing systems of some languages, such as Persian (Farsi), Hebrew, and Arabic are written from right to left (RTL). ...
A Byte Order Mark (BOM) is the character at code point FEFF (ZERO-WIDTH NO-BREAK SPACE), when that character is used to denote the Endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32. ...
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ...
HTML 4. ...
Many email clients are now able to use Unicode. ...
Multipurpose Internet Mail Extensions (MIME) is an Internet Standard for the format of e-mail. ...
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ...
In computing, Unicode provides an international standard which has the goal of providing the means to encode the text of every document people want to store on computers. ...
The word encoding has a number of meanings. ...
The ZIP file format is the most widely-used compressed file format in the IBM PC world. ...
bzip2 is an open source data compression algorithm and program developed by Julian Seward. ...
SCSU was created as a Unicode compression scheme with a byte/code point ratio similar to language-specific codepages. It has not been widely adopted although it fulfills the criteria for an IANA charset and is registered with IANA. SCSU is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. The Internet Assigned Numbers Authority (IANA) is an organisation that oversees IP address, top level domain and Internet protocol code point allocations. ...
It is worth noting that SCSU has been adopted as an official Unicode Technical Standard. BOCU-1 has not been officially adopted by the Unicode consortium, but Unicode Technical Note #6 does describe this encoding in more detail.
External links
- Unicode Technical Note #6 BOCU-1: MIME Compatible Unicode Compression
- International Components for Unicode A library that can convert between BOCU-1 and other Unicode encodings
|