|
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. A character encoding consists of a code that pairs a sequence of characters from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. ...
Simplified Chinese characters (Simplified Chinese: 简体字; Traditional Chinese: 簡體字; pinyin: jiǎntǐzì; also called 简化字/簡化字, jiǎnhuàzì) are one of two standard character sets of printed contemporary Chinese written language. ...
The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code set 0,1,2, and 3) can be represented with EUC scheme. G0 is almost always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X 1003/ISO 646:KR in EUC-KR and US-ASCII/the lower half of JIS X 0201 in EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared). ISO 2022, more formally ISO/IEC 2022, is an ISO standard (equivalent to the ECMA standard ECMA-35) specifying a technique for including multiple character sets in a single character encoding. ...
ISO 646 is an ISO standard that specifies a 7 bit character code from which several national standards are derived, the best known of which is ASCII. Since the portion of ISO 646 shared by all countries specified only the letters used in the English alphabet, other countries using the...
To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code. In various branches of mathematics and computer science, strings are sequences of various simple objects (symbols, tokens, characters, etc. ...
The most commonly-used EUC codes are variable length encoding with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The EUC-CN form of GB2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes. EUC-CN
EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters. Unlike the case of Japanese, the ISO-2022 form of GB2312 is not normally used, though a variant form called HZ was sometimes used on USENET. GB2312 is the registered internet name for a key official character set of the Peoples Republic of China, used for simplified Chinese characters. ...
Simplified Chinese characters (Simplified Chinese: 简体字; Traditional Chinese: 簡體字; pinyin: jiǎntǐzì; also called 简化字/簡化字, jiǎnhuàzì) are one of two standard character sets of printed contemporary Chinese written language. ...
ISO 2022, more formally ISO/IEC 2022, is an ISO standard (equivalent to the ECMA standard ECMA-35) specifying a technique for including multiple character sets in a single character encoding. ...
The HZ character encoding is an encoding of GB2312 that was formerly commonly used in USENET postings. ...
Usenet is a distributed Internet discussion system that evolved from a general purpose UUCP network of the same name. ...
EUC-CN can also be used to encode the Unicode-based GB18030 character encoding, which includes traditional characters, although GB18030 is more frequently used without EUC encoding, since GB18030 is already a Unicode encoding. However, GB18030 encoded in EUC-CN is a variable-length character encoding, because GB18030 contains more than 8836 (94×94) characters. GB18030 is the registered Internet name for the official character set of the Peoples Republic of China (PRC). ...
Traditional Chinese characters are one of two standard character sets of printed contemporary Chinese written language. ...
Because of technical limitations, some web browsers may not display some special characters in this article. ...
A character encoding consists of a code that pairs a sequence of characters from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. ...
Related encoding systems An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting. GB2312 is the registered internet name for a key official character set of the Peoples Republic of China, used for simplified Chinese characters. ...
Big-5 or Big5 is a character encoding method used in Taiwan (Republic of China) and Hong Kong for Traditional Chinese characters. ...
In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. ...
EUC-JP EUC-JP is a variable-width encoding used to represent the elements of three Japanese character set standards, namely JIS X 0208, JIS X 0212, and JIS X 0201. A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation in a computer. ...
In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. ...
JIS X 0208 is a Japanese Industrial Standard defining a set of kanji indexed by a pair of integers from 1 to 94 (this is known as the kuten pair of the kanji). ...
JIS X 0201, developed in 1969, was the first Japanese character encoding to become widely used. ...
- A character from JIS-X-0208 (code set 1) is represented by two bytes, both in the range 0xA1 - 0xFE.
- A character from JIS-X-0212 (code set 3) is represented by three bytes, the first being 0x8F, the following two in the range 0xA1 - 0xFE.
- A character from the upper half of JIS-X-0201 (half-width kana, code set 2) is represented by two bytes, the first being 0x8E, the second in the range 0xA1 - 0xDF.
- A character from the lower half of JIS-X-0201 (ASCII, code set 0) is represented by one byte, in the range 0x21 - 0x7E.
This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards. Half-width kana (ã¯ãããã«ã) refers to the katakana character portion of the character set specified by JIS X 0201. ...
There are 95 printable ASCII characters, numbered 32 to 126. ...
ISO 2022, more formally ISO/IEC 2022, is an ISO standard (equivalent to the ECMA standard ECMA-35) specifying a technique for including multiple character sets in a single character encoding. ...
In Japan, the EUC-JP encoding is heavily used by Unix or Unix-like operating systems (except for HP-UX), while Shift_JIS or its extensions (Windows code page 932 and MacJapanese) are used on other platforms. Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS they are using. Unix or UNIX is a computer operating system originally developed in the 1960s and 1970s by a group of AT&T employees at Bell Labs including Ken Thompson, Dennis Ritchie, and Douglas McIlroy. ...
To meet Wikipedias quality standards, this article or section may require cleanup. ...
HP-UX (Hewlett Packard UniX) is Hewlett-Packards proprietary implementation of the Unix operating system. ...
Shift_JIS (SJIS) is a character encoding for the Japanese language developed by a Japanese company called ASCII and adopted by, amongst others, Microsoft. ...
Code page 932 (aka CP932, Windows-31J) is Microsofts extension of Shift_JIS to include NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119). ...
EUC-JISX0213 is similar to but different from EUC-JP in that two planes of JIS-X-0213 take place of JIS-X-0208 and JIS-X-0212. There is a similar relationship between Shift_JIS and Shift-JISX0213.
EUC-KR EUC-KR is a variable-length character encoding to represent Korean text using two coded character sets, KS X 1001 (formely KS C 5601) and KS X 1003 (formerly KS C 5636)/ISO 646:KR/US-ASCII. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR. A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1-0xFE) and a character from KS X 1003/US-ASCII (G0, code set 0) takes one byte in GL (0x21-0x7E). It is the most widely used legacy character encoding in Korea on all three major platforms (Unix-like OS, Windows and Mac), but its use has been very slowly decreasing as UTF-8 gains popularity, especially on Linux and Mac OS X. It is usually referred to as Wansung (완성) in South Korea. The default Korean codepage for Windows (code page 949) is a proprietary, but upward compatible extension of EUC-KR referred to as Unified Hangul Code(통합 완성형, Tonghab Wansunghyeong). Mac Korean used in classic Mac OS is also compatible with EUC-KR. Code page 949 is Microsofts implementation that appears similar to KSC 5601. ...
EUC-TW EUC-TW is a variable-length character encoding that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding for traditional Chinese characters as used on Taiwan. Big5 is much more common. A character in US-ASCII (G0, code set 0) is encoded as a single byte in GL( 0x21-0x7E) and a character in CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1-0xFE). A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes with the first byte always being 0x8E(Single Shift 2) and the second byte indicating the plane (the plane number is obtained by subtracting 0xA0 from the second byte). The third and fourth bytes are in GR (0xA1-0xFE). Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2. Traditional Chinese characters are one of two standard character sets of printed contemporary Chinese written language. ...
Big-5 or Big5 is a character encoding method used in Taiwan (Republic of China) and Hong Kong for Traditional Chinese characters. ...
External links - EUC-JP codeset table (non-ascii part)
- GB18030-2000 — The New Chinese National Standard
- The New Generation of Pre-Press Software in China — mentions the 748 code
- Description of the EUC-TW code (in Chinese)
- Manual page of EUC-JISX0213 in Perl Encode module
- EUC-JP code range chart at Opengroup Japan
- International Register of Coded Character Sets — The coded character sets of China, Japan, South Korea, North Korea and Taiwan (ISO/IEC)
- Chinese, Japanese, and Korean character set standards and encoding systems
|