FACTOID # 143: If someone you know died from falling out of a tree, you’re probably Brazilian.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Standard Compression Scheme for Unicode
Unicode
Encodings
UCS
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. It does so by dynamically mapping the values in the range 128-255 to blocks of 128 characters. Since most alphabets are in 128 contiguous Unicode codepoints, this allows for 1 byte per character (plus overhead) encoding for many text files. SCSU will also switch to UTF-16 internally to handle non-alphabetic languages. In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. ... A character encoding consists of a code that pairs a set of natural language characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses, to facilitate the storage and transmission of text in computers and through telecommunication networks. ... UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in MIME messages. ... UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ... In computing, UCS-2 and UTF-16 are alternative names for a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission... UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ... UTF-EBCDIC is an encoding of unicode as a superset of the basic ranges of EBCDIC and has similar advantages for existing EBCDIC based systems as UTF-8 has for existing ascii based systems. ... Punycode, defined in RFC 3492, is a self-proclaimed Bootstring encoding of Unicode strings into the limited character set supported by the Domain Name System. ... GB18030 is the registered internet name for the official character set of the Peoples Republic of China (PRC). ... The Universal Character Set (UCS) is a character encoding that is defined by the international standard ISO/IEC 10646. ... The writing systems of some languages, such as Persian (Farsi), Hebrew, and Arabic are written from right to left (RTL). ... A Byte Order Mark (BOM) is the character at code point FEFF (ZERO-WIDTH NO-BREAK SPACE), when that character is used to denote the Endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32. ... Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ... HTML 4. ... Many email clients are now able to use Unicode. ... In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. ... This article refers to the unit of binary information. ...


SCSU is not a resounding success. Few places need to compress enough Unicode text to make it worth using a poorly supported compression scheme. Treated purely as a compression format, it's inferior to most commonly used compression programs for texts over a few kilobytes. It can be used as a text encoding, but it's very hard to handle internally, and the percentage savings between SCSU and UTF-16 or UTF-8 drops after external compression, dramatically in the case of bzip2 and other modern compression schemes. It does have the advantage that SCSU can compress texts that are only a few characters long, whereas most full-scale compressors need a few kilobytes of data to overcome the overhead. In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ... bzip2 is an open source data compression algorithm and program developed by Julian Seward. ...


Reuters, the organization that floated the first draft of SCSU, is believed to use SCSU internally. Reuters (pronounced IPA: ) is a company supplying global financial markets and news media with a range of information products and transactional solutions, including real-time and historical market data, research and analytics, financial trading platforms, investment data and analytics plus news in text, video, graphics and photographs. ...


External links

  • UTS #6: Compression Scheme for Unicode
  • SIL's freeware fonts, editors and documentation

  Results from FactBites:
 
Standard Compression Scheme for Unicode - definition of Standard Compression Scheme for Unicode in Encyclopedia (264 words)
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks.
It can be used as a text encoding, but it's very hard to handle internally, and the percentage savings between SCSU and UTF-16 or UTF-8 drops after external compression, dramatically in the case of bzip2 and other modern compression schemes.
It does have the advantage that SCSU can compress texts that are only a few characters long, whereas most full-scale compressors need a few kilobytes of data to overcome the overhead.
RFC 3536 (rfc3536) - Terminology Used in Internationalization in the IETF (7462 words)
Standards Bodies and Standards This section describes some of the standards bodies and standards that appear in discussions of internationalization in the IETF.
The Unicode Standard is a CCS whose repertoire and code points are identical to ISO/IEC 10646.
This refers to code points of the standard whose interpretation is not specified by the standard and whose use may be determined by private agreement among cooperating users.
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.