|
Unicode’s Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points. The Unicode Standard, Version 5. ...
This page compares Unicode encodings. ...
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ...
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. ...
CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report 26. ...
In computing, UTF-16 is a variable-length (16 or 32 bits) character encoding. ...
UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ...
UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ...
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ...
This article or section may be confusing for some readers, and should be edited to be clearer. ...
Example of Arabic IDN Example of Chinese IDN Example of Greek IDN Example of Hebrew IDN Example of Hindi IDN An internationalized domain name (IDN) is an Internet domain name that (potentially) contains non-ASCII characters. ...
GB18030 is the registered Internet name for the official character set of the Peoples Republic of China (PRC) superseding GB2312. ...
The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ...
Some writing systems of the world, such as Arabic and Hebrew, are written in a form known as right-to-left (RTL), in which writing begins at the right-hand side of a page and concludes at the left-hand side. ...
A Byte Order Mark (BOM) is the character at code point U+FEFF (zero-width no-break space), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text...
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ...
The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike. ...
Many e-mail clients are now able to use Unicode. ...
Unicode typefaces (also known as UCS fonts and Unicode fonts) contains wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc, which are collectively mapped into Universal Character Set, also known as, UCS (which is an international standard ISO/IEC 10646), derived from many different languages, scripts from all...
The Unicode Standard, Version 5. ...
The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ...
In mathematics and computer science, hexadecimal, base-16, or simply hex, is a numeral system with a radix, or base, of 16, usually written using the symbols 0â9 and AâF, or aâf. ...
As of Unicode 5.0.0, 102,012 (9.2%) of these code points are assigned, with another 137,468 (12.3%) reserved for private use, 2,048 for surrogates, and 66 designated noncharacters, leaving 872,582 (78.3%) unassigned. The number of assigned code points is made up as follows: Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
(See the summary table for a more detailed breakdown). Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
Unicode characters can be categorized in many ways. Every character is assigned a script (though many are assigned the common or inherited scripts where they inherit the script from the adjacent character). In Unicode a script is a coherent writing system that includes letters but also may include script specific punctuation, diacritic and other marks and numerals and symbols. A single script supports one or more languages. Characters are assigned in blocks of characters. These blocks are usually groups of code points in some multiple of eight: many, for example, are grouped in blocks of 128 or 256 code points. Every character is also assigned a general category and subcategory. The general categories are: letter, mark, number, punctuation, symbol, or control (in other words a formatting or non-graphical character). The blocks of characters are assigned according to various planes. Most characters are currently assigned to the first plane: the Basic Multilingual Plane. This is to help ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two octet bytes. The characters outside the first plane usually have very specialized or rare use. In computer technology and networking, an octet is a group of 8 bits. ...
The first 256 code points correspond with those of ISO 8859-1, the most popular 8-bit character encoding in the Western world. As a result, the first 128 characters are also identical to ASCII. Though Unicode refers to these as a Latin script block, these two blocks contain many characters that are commonly useful outside of the Latin script. ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding originally developed by ISO, but later jointly maintained by ISO and IEC. The standard, when supplemented with additional character assignments, is the...
A character encoding or character set (sometimes referred to as code page) consists of a code that pairs a sequence of characters from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers...
The term Western world, the West or the Occident (Latin occidens -sunset, -west, as distinct from the Orient) [1] can have multiple meanings dependent on its context (e. ...
Image:ASCII fullsvg There are 95 printable ASCII characters, numbered 32 to 126. ...
Planes The Unicode characters can be categorized in many different ways, Unicode code points can be logically divided into 17 planes, each with 65,536 (= 216) code points, although currently only a few planes are used: - Plane 0 (0000–FFFF): Basic Multilingual Plane (BMP). This is the plane containing most of the character assignments so far. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing systems in current use.
- Plane 1 (10000–1FFFF): Supplementary Multilingual Plane (SMP).
- Plane 2 (20000–2FFFF): Supplementary Ideographic Plane (SIP)
- Planes 3 to 13 (30000–DFFFF) are unassigned
- Plane 14 (E0000–EFFFF): Supplementary Special-purpose Plane (SSP)
- Plane 15 (F0000–FFFFF) reserved for the Private Use Area (PUA)
- Plane 16 (100000–10FFFF), reserved for the Private Use Area (PUA)
Currently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively blocked out for every current and ancient writing system (script) the Unicode consortium has been able to identify: (see [1]). While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain, if previously unknown scripts with tens of thousands of characters are discovered. This 20 bit limit is therefore unlikely to be reached in the near future. A writing system, also called a script, is used to visually record a language with symbols. ...
Basic Multilingual Plane The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters. CJK is a collective term for Chinese, Japanese, and Korean, which comprise the main East Asian languages. ...
Roadmap of Unicode Basic Multilingual Plane. Each numbered box represents 256 codepoints. The graphic on the right is a visual roadmap to the Basic Multilingual Plane. The colours in use are: Image File history File links Coloured rendition of roadmap to Unicodes Basic Multilingual Plane. ...
- Black = Latin scripts and symbols
- Light Blue = Linguistic scripts
- Blue = Other European scripts
- Orange = Middle Eastern and SW Asian scripts
- Light Orange = African scripts
- Green = South Asian scripts
- Purple = Southeast Asian scripts
- Red = East Asian scripts
- Light Red = Unified CJK Han
- Yellow = Canadian Aboriginal scripts
- Magenta = Symbols
- Dark Grey = Diacritics
- Light Grey = UTF-16 surrogates and private use
- Cyan = Miscellaneous characters
- White = Unused
As of Unicode 5.0, The BMP includes the following scripts: Japanese name Kanji: Hiragana: Korean name Hangul: Hanja: Vietnamese name Quoc Ngu: Han Tu: A Chinese character or Han character (Simplified Chinese: ; Traditional Chinese: ; pinyin: ) is a logogram used in writing Chinese, Japanese, sometimes Korean, and formerly Vietnamese. ...
Note: This page or section contains IPA phonetic symbols in Unicode. ...
Example of a letter with a diacritic A diacritical mark or diacritic, also called an accent, is a small sign added to a letter to alter pronunciation or to distinguish between similar words. ...
2006 is a common year starting on Sunday of the Gregorian calendar. ...
| | | - Supplemental Punctuation (2E00–2E7F)
- CJK Radicals Supplement (2E80–2EFF)
- Kangxi Radicals (2F00–2FDF)
- Ideographic Description Characters (2FF0–2FFF)
- CJK Symbols and Punctuation (3000–303F)
- Hiragana (3040–309F)
- Katakana (30A0–30FF)
- Bopomofo (3100–312F)
- Hangul Compatibility Jamo (3130–318F)
- Kanbun (3190–319F)
- Bopomofo Extended (31A0–31BF)
- CJK Strokes (31C0–31EF)
- Katakana Phonetic Extensions (31F0–31FF)
- Enclosed CJK Letters and Months (3200–32FF)
- CJK Compatibility (3300–33FF)
- CJK Unified Ideographs Extension A (3400–4DBF)
- Yijing Hexagram Symbols (4DC0–4DFF)
- CJK Unified Ideographs (4E00–9FFF)
- Yi Syllables (A000–A48F)
- Yi Radicals (A490–A4CF)
- Modifier Tone Letters (A700–A71F)
- Latin Extended-D (A720–A7FF)
- Syloti Nagri (A800–A82F)
- Phags-pa (A840–A87F)
- Hangul Syllables (AC00–D7AF)
- High Surrogates (D800–DB7F)
- High Private Use Surrogates (DB80–DBFF)
- Low Surrogates (DC00–DFFF)
- Private Use Area (E000–F8FF)
- CJK Compatibility Ideographs (F900–FAFF)
- Alphabetic Presentation Forms (FB00–FB4F)
- Arabic Presentation Forms-A (FB50–FDFF)
- Variation Selectors (FE00–FE0F)
- Vertical Forms (FE10–FE1F)
- Combining Half Marks (FE20–FE2F)
- CJK Compatibility Forms (FE30–FE4F)
- Small Form Variants (FE50–FE6F)
- Arabic Presentation Forms-B (FE70–FEFF)
- Halfwidth and Fullwidth Forms (FF00–FFEF)
- Specials (FFF0–FFFF)
| Future additions Several scripts are expected to be included in the BMP in the next revision of Unicode. These scripts, and their proposed code point ranges, are the following: Unicode as of version 5. ...
ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. ...
Unicode as of version 5. ...
Unicode as of version 5. ...
Articles with similar titles include the NATO phonetic alphabet, which has also informally been called the âInternational Phonetic Alphabetâ. For information on how to read IPA transcriptions of English words, see IPA chart for English. ...
Unicode ranges encoding phonetic notation. ...
Combining diacritical marks are Unicode characters that are intended to modify other characters (see Diacritic). ...
The Coptic alphabet is an alphabet used for writing the Coptic language. ...
The Cyrillic alphabet (pronounced also called azbuka, from the old name of the first two letters) is actually a family of alphabets, subsets of which are used by certain Slavic languages â Belarusian, Bulgarian, Macedonian, Russian, Rusyn, Serbian, and Ukrainianâas well as many other languages of the former Soviet Union...
Note: This article contains special characters. ...
The Arabic alphabet is the script used for writing languages such as Arabic, Persian, Urdu, and others. ...
11th century book in Syriac Serto. ...
Thaana is the writing system for the Dhivehi language spoken in the Maldives. ...
The word NKo written in the NKo alphabet NKo is both a script devised by Solomana Kante in 1949 as a writing system for the Mande languages of West Africa, and the name of the literary language itself written in the script. ...
Rigveda manuscript in Devanagari (early 19th century) DevanÄgarÄ« (दà¥à¤µà¤¨à¤¾à¤à¤°à¥ â in English pronounced ) (ISCII â IS13194:1991) [1] is an abugida alphabet used to write several Indian languages, including Sanskrit, Hindi, Marathi, Kashmiri, Sindhi, Bihari, Bhili, Konkani, Bhojpuri and Nepali from Nepal. ...
It has been suggested that Robert B. Wray be merged into this article or section. ...
The GurmukhÄ« (à¨à©à¨°à¨®à©à¨à© or à¨à©à¨°à¨®à©à©±à¨à©) script, derived from the Later Sharada script and standardised by Guru Angad Dev in the 16th century, was designed to write the Punjabi (ਪੰà¨à¨¾à¨¬à©) language. ...
The Gujarati script (àªà«àªàª°àª¾àª¤à« લિપિ GujarÄtÄ« Lipi), which like all NÄgarÄ« writing systems is strictly speaking an abugida rather than an alphabet, is used to write the Gujarati and Kutchi languages. ...
The Oriya script is used to write the Oriya language. ...
Note: This page or section contains IPA phonetic symbols in Unicode. ...
Telugu script, an abugida from the Brahmic family of scripts, is used to write Telugu, a Dravidian Language found in the Southern Indian state of Andhra Pradesh as well as several other neighboring states. ...
Note: This page or section contains IPA phonetic symbols in Unicode. ...
The Malayalam script is an abugida of the Brahmic family, used to write the Malayalam language. ...
The Sinhala script is used to write the Sinhala language. ...
This article or section is in need of attention from an expert on the subject. ...
Jamo redirects here. ...
For the computer software, see Hangul (word processor). ...
Note: This article contains special characters. ...
Sequoyah The Cherokee language is written in a syllabary invented by Sequoyah (also known as George Gist or George Guess). ...
Canadian aboriginal syllabic writing (often syllabics for short) is a family of writing schemes which are used to write a number of aboriginal Canadian languages from the Algonquian, Athabaskan and Inuit language families. ...
Note: This article contains special characters. ...
Technical note: Due to technical limitations, some web browsers may not display some special characters in this article. ...
Baybayin (sometimes called Alibata) is a pre-Hispanic Tagalog writing system that originated from the Javanese script Kavi. ...
One of the indigenous scripts of the Philippines; see Baybayin. ...
Buhid (áááá), or Mangyan, is an indigenous Brahmic script of the Philippines, and is used today by the Mindoro people to write Tagalog, the national language of the Philippines. ...
Tagbawna is one of the indigenous writing systems of the Philippines. ...
The quick brown fox jumps over the lazy dog translated into Khmer. ...
The Limbu alphabet, or Kirat-Sirijonga script, is a Brahmic script used to write the Limbu language of northern India and Nepal. ...
Tai Le is a script used for the Tai Nüa language. ...
Tai Lue (or Tai Lü, Tai Le; tai51 lɯ11; Xishuangbanna Dai; Chinese: å£ä»è¯ DÇilèyÇ) is one of the languages spoken by the Dai people in China. ...
This article or section uses Khmer characters which may be rendered as boxes or other nonsensical symbols. ...
Buginese (locally Basa Ugi, elsewhere also Bahasa Bugis, Bugis, Bugi, De) is the language spoken by about four million people mainly in the southern part of Sulawesi, Indonesia. ...
The Balinese alphabet is a type of alphabet called an abugida that was used to write the Balinese language, an Austronesian language spoken by about three million people on the Indonesian island of Bali. ...
Lepcha script is used by the Lepcha people. ...
Unicode ranges encoding phonetic notation. ...
Unicode as of version 5. ...
The term punctuation has two different linguistic meanings: in general, the act and the effect of punctuating, i. ...
Graphic symbols are often used as a shorthand for currency names. ...
Letterlike Symbols are special characters like a regular alphabet or symbol characters but they have specific style and appearance which is known and commonly used in many different tradition and places. ...
Because of technical limitations, some web browsers may not display some special characters in this article. ...
An arrow is a graphical symbol like â, â, used to point or indicate direction, being in its simplest form a line segment with a triangle affixed to one end, and in more complex forms a representation of an actual arrow. ...
Unicode ranges encoding mathematical operators: Mathematical Operators (2200â22FF) Miscellaneous Mathematical Symbols-A (27C0â27EF) Miscellaneous Mathematical Symbols-B (2980â29FF) Supplemental Mathematical Operators (2A00â2AFF) Unicode Symbols Mapping of Unicode characters http://www. ...
Because of technical limitations, some web browsers may not display some special characters in this article. ...
Optical character recognition, usually abbreviated to OCR, is a type of computer software designed to translate images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text, or to translate pictures of characters into a standard encoding scheme representing them (e. ...
Box drawing characters are widely used in text user interfaces to draw various frames and boxes. ...
Because of technical limitations, some web browsers may not display some special characters in this article. ...
The Miscellaneous Symbol plane of Unicode (2600â26FF) contains various glyphs representing things from a variety of categories: Astrological, Astronomical, Chess, Dice, Ideological symbols, Musical notation, Political symbols, Recycling, Religious symbols, Trigrams, Warning Signs and Weather. ...
Poem typeset with generous use of decorative dingbats, 1880s A dingbat is an ornament or spacer used in typesetting, sometimes more formally known as a printers ornament. The term supposedly originated as onomatopoeia in old style metal-type print shops, where extra space around text or illustrations would be...
Braille code where the word (, French for first) can be read. ...
The Glagolitic alphabet or Glagolitsa is the oldest known Slavic alphabet. ...
Unicode as of version 5. ...
This article needs to be cleaned up to conform to a higher standard of quality. ...
CJK is a collective term for Chinese, Japanese, and Korean, which comprise the main East Asian languages. ...
The left part of mÄ, a Chinese character meaning mother, is a radical that means woman A radical (from Latin radix, meaning root) is a basic identifiable component of every Chinese character. ...
The following is a list of all 214 Kangxi radicals, used originally in the 1615 Zihui and adopted by the 1716 Kangxi dictionary, in order of the number of strokes along with some examples of characters containing them. ...
Hiragana ) is a Japanese syllabary, one component of the Japanese writing system, along with katakana and kanji; the Latin alphabet is also used in some cases. ...
Katakana ) is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet. ...
Zh yīn F o (注音符號), or Symbols for Annotating Sounds, often abbreviated as Zhuyin, or known as Bopomofo (ㄅㄆㄇㄈ) for the first four syllables of these Chinese phonetic symbols, is the national phonetic system of the Republic of China (based on Taiwan...
Jamo redirects here. ...
For the computer software, see Hangul (word processor). ...
Example of Kaeriten Kanbun (æ¼¢æ, literally Han writing) is Chinese written for a Japanese audience. ...
Zh yīn F o (注音符號), or Symbols for Annotating Sounds, often abbreviated as Zhuyin, or known as Bopomofo (ㄅㄆㄇㄈ) for the first four syllables of these Chinese phonetic symbols, is the national phonetic system of the Republic of China (based on Taiwan...
Stroke order refers to the way of writing Chinese characters. ...
Katakana ) is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet. ...
Alternative meaning: I Ching (monk) The I Ching (Simplified Chinese: 易经; Traditional Chinese: 易經, Hanyu Pinyin: Yì Jīng; Cantonese IPA: jɪk6gɪŋ1; Cantonese Jyutping: jik6ging1; alternative romanizations include I Jing, Yi Ching, Yi King) is the oldest of the Chinese classic texts. ...
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ...
The Yi scripts, also known as Cuan or Wei, are used to write the Yi languages. ...
Unicode as of version 5. ...
Sylheti Nagari is the original script used for writing the Sylheti language. ...
The word Wiki in Phagspa characters The Phagspa script (also square script) was an Abugida designed by the Lama Phagspa for the emperor Kublai Khan during the Yuan Dynasty in China, as a unified script for all languages within the Mongolian Empire. ...
Jamo redirects here. ...
In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ...
In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ...
Halfwidth and Fullwidth Forms is the name of Unicode block U+FF00âFFEF, the last of the Basic Multilingual Plane excepting the short Specials block at U+FFF0âFFFF. U+FF01âFF5E reproduce the characters of ASCII 21 to 7E as fullwidth forms (zenkaku), that is, as monospace glyphs with...
Specials is the name of a short Unicode block allocated at the very end of the Basic Multilingual Plane, at U+FFF0âFFFF. Of these 16 codepoints, 5 are assigned as of Unicode 5. ...
Several other scripts are proposed for inclusion in the BMP, including: The Eastern (Vietnamese) Cham Writing Script The Cham script evolved from the early Brahmi alphabet that is found in India. ...
Lanna (English One Million Thai Rice Fields, Thai: ) was a kingdom in the north of Thailand around the city of Chiang Mai. ...
Tai Lue (or Tai Lü, Tai Le; tai51 lɯ11; Xishuangbanna Dai; Chinese: å£ä»è¯ DÇilèyÇ) is one of the languages spoken by the Dai people in China. ...
The Santali alphabet, also known as Ol Cemet (language of writing), Ol Ciki, Ol Chiki, or Ol, was created in 1925 by Pandit Raghunath Murmu for the Santali language. ...
Vai language is a language of Liberia. ...
Saurashtra, more correctly, Sauraá¹£á¹ri or Sauraá¹£á¹ram or Sourashtra, also known as Palkar, Sowrashtra, Saurashtram, is an Indo-Aryan language spoken in parts of the Southern Indian State of Tamil Nadu. ...
The Avestan alphabet was created in the 3rd century AD for writing the hymns of Zarathustra (a. ...
The Pahlavi script was used broadly in the Sasanid Persian Empire to write down Middle Persian for secular, as well as religious purposes. ...
The Batak alphabet is a type of alphabet called an abugida that is used to write the Batak languages of northern Sumatra, an Austronesian language spoken by about three million people on the Indonesian island of Sumatra. ...
Sample of Meitei Mayek script, showing the main consonants in the alphabet Meitei Mayek script (also Meithei Mayek, Meetei Mayek, Manipuri script) (Manipuri: Meetei Mayek) is a syllabic script used for the Meitei language (Manipuri), one of the official languages of the Indian state of Manipur. ...
To meet Wikipedias quality standards, this article or section may require cleanup. ...
Sora (also Saora, Saonras, Shabari, Sabar, Saura, Savara, Sawaria, Swara, Sabara) is a Munda language of India, spoken by some 288,000 native speakers (1997) in South Orissa, mainly in the Ganjam District, but also in the Koraput and Phulbani districts; other communities exist in Andhra Pradesh (Srikakulam District), Madhya...
Supplementary Multilingual Plane Plane 1, the Supplementary Multilingual Plane (SMP), is mostly used for historic scripts such as Linear B, but is also used for musical and mathematical symbols. This article is about the ancient syllabary. ...
| As of Unicode 5.0, Plane One includes the following scripts: 2006 is a common year starting on Sunday of the Gregorian calendar. ...
| Many other scripts are proposed for inclusion in Plane One, including: This article is about the ancient syllabary. ...
Note: This article contains special characters. ...
The Gothic alphabet is an alphabetic writing system attributed by Philostorgius to Wulfila, used exclusively for writing the ancient Gothic language. ...
The Ugaritic alphabet is a cuneiform abjad, used from around 1300 BC for the Ugaritic language, an extinct Canaanite language discovered in Ugarit, Syria. ...
Old Persian cuneiform is the primary script used in Old Persian writings. ...
The Deseret alphabet is a phonetic alphabet developed in the mid-19th century by the board of regents of the University of Deseret (later the University of Utah) under the direction of Brigham Young, second president of The Church of Jesus Christ of Latter-day Saints. ...
Posthumously funded by and named for Irish playwright George Bernard Shaw, the Shavian alphabet (also known as Shaw alphabet) was conceived as a way to provide simple, phonetic orthography for the English language to replace the difficulties of the conventional spelling. ...
The Osmanya alphabet is a script for the Somali language invented between 1920 and 1922 by the Sultan of Obbias brother, Cismaan Yuusuf Keenadiid. ...
The Cypriot syllabary is a syllabic script used in Iron Age Cyprus, from ca. ...
The Phoenician alphabet is a continuation of the Proto-Canaanite alphabet, by convention taken to begin with a cut-off date of 1050 BCE. It was used by the Phoenicians to write Phoenician, a Northern Semitic language. ...
The Kharoṣṭhī script, also known as the Gāndhārī script, is an ancient alphabetic script used by the Gandhara culture of historic northwest India to write the Gandhari and Sanskrit languages (the Gandhara kingdom was located along the present-day border...
âCuneiformâ redirects here. ...
The text Tà i Xuán Jïng (Canon of Supreme Mystery, Chinese: ) was composed by the Confucian writer Yáng Xióng (Chinese: ; Pinyin: ; Wade-Giles: Yang Hsiung; 53 BCE-18 CE). ...
Mathematical alphanumeric symbols are modifications of Latin and Greek letters and decimal digits that enable mathematicians to denote different notions with different letter styles (one example is blackboard bold, or double-struck (in Unicode terminology)). Unicode now includes many such symbols (in the range U+1D400 . ...
| The Old Permic script, sometimes called Abur, is an original ancient Permic writing system introduced by a Russian missionary Stepan Khrap, also known as Saint (Stephen of Perm) (СÑепан Ð¥Ñап, Ñв. СÑеÑан ÐеÑмÑкий) in 1372. ...
The Meroitic script is an alphabet of Egyptian (Hieroglyphic) origin used in Kingdom of Meroë. Some scholars, e. ...
The Manichaean script is a variant of the Syriac script designed to record the Middle Persian language. ...
Balti (Ø¨ÙØªÛ) is a language spoken in Baltistan, in the Northern Areas of Pakistan. ...
Bilingual inscription (Greek and Aramaic) by the Indian emperor Ashoka the Great, 3rd century BC. The Aramaic alphabet is an abjad alphabet designed for writing the Aramaic language. ...
The ancient South Arabian alphabet (also known as musnad) branched from the Proto-Sinaitic alphabet in ca. ...
BrÄhmÄ« refers to the pre-modern members of the Brahmic family of scripts, attested from the 3rd century BC. The best known and earliest dated inscriptions in Brahmi are the rock-cut edicts of Ashoka. ...
Soyombo script - Wikipedia /**/ @import /skins-1. ...
An Indus Valley seal with the seated figure termed pashupati. ...
First article of the Universal Declaration of Human Rights (in English) The Tengwar are an artificial script which was invented by J. R. R. Tolkien. ...
This chart showing the runes shared by the Angerthas Daeron and Angerthas Moria is presented in Appendix E of The Return of the King. ...
Blissymbolics or Blissymbols were conceived of as an ideographic writing system consisting of several hundred basic symbols, each representing a concept, which can be composed together to generate new symbols that represent new concepts. ...
A section of the Papyrus of Ani showing cursive hieroglyphs. ...
The counting rods (Traditional Chinese: , Simplified Chinese: , pinyin: chou2) were used by ancient Chinese before the invention of the abacus. ...
Supplementary Ideographic Plane Plane 2, the Supplementary Ideographic Plane (SIP), is used for about 40,000 Unified Han Ideographs that have previously been seldom used in daily written communications. Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ...
Unused planes Unicode has not yet assigned any characters to Planes 3 through 13. The current study of written language have not identified any need for these planes yet. However, symbol characters that arise outside the script writing systems could have potentially limitless possibilities for characters. The UCS and Unicode take requests for symbols on a case by case basis.
Supplementary Special-purpose Plane Plane 14 (E in hexadecimal), the Supplementary Special-purpose Plane (SSP), currently contains non-graphical characters in two blocks of 128 and 240 characters. The first block is for language tag characters for use when language cannot be indicated through other protocols (such as the xml:lang attribute in XML). The other block contains glyph variation selectors to indicate an alternate glyph for a character that cannot be determined by context. In mathematics and computer science, hexadecimal, base-16, or simply hex, is a numeral system with a radix, or base, of 16, usually written using the symbols 0â9 and AâF, or aâf. ...
Private use planes Two planes (planes 15 and 16) have been set aside for character assignment by parties outside the ISO and the Unicode Consortium. Use of such characters will have limited interoperability. Software and fonts that support Unicode will not necessarily support characters assignments by other parties. Especially if the characters have unusual properties such as right-to-left characters, other implementations may treat those characters inappropriately.
Plane mapping tables | Unicode mapping tables | | BMP | SMP | SIP | SSP | | 0000–0FFF | 8000–8FFF | 10000–10FFF | 20000–20FFF | 28000–28FFF | E0000–E0FFF | | 1000–1FFF | 9000–9FFF | | 21000–21FFF | 29000–29FFF | | 2000–2FFF | A000–AFFF | 12000–12FFF | 22000–22FFF | 2A000–2AFFF | | 3000–3FFF | B000–BFFF | | 23000–23FFF | | | 4000–4FFF | C000–CFFF | 1D000–1DFFF | 24000–24FFF | 2F000–2FFFF | | 5000–5FFF | D000–DFFF | | 25000–25FFF | | | 6000–6FFF | E000–EFFF | | 26000–26FFF | | | 7000–7FFF | F000–FFFF | | 27000–27FFF | Graphical characters By far the most common Unicode characters are graphical characters. Graphical characters all have some visual representation or glyphs associated with them. While Unicode does not specify the concrete glyphs for these characters, it does specify recommended or prototypical glyphs. The actual glyph used by textual display software will depend on the font files used and whether those fonts provide support for contextual and non-contextual glyph variations
Script-specific characters | v • d • e Character Types | | Letters and other script specific • Unihan ideographs, etc. • Phonetic characters Numerals Punctuation and separators Diacritics and other marks Symbols: Compatibility characters Control characters Other Topics • Combining character • Precomposed character In Unicode, a script is an abstract coherent and unified writing system supporting one or more concrete writing systems which in turn support the written forms of one or more languages. ...
In Unicode, a script is an abstract coherent and unified writing system supporting one or more concrete writing systems which in turn support the written forms of one or more languages. ...
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ...
Unicode ranges encoding phonetic notation. ...
Numerals (often called numbers in Unicode) are characters that denote a number. ...
The term punctuation has two different linguistic meanings: in general, the act and the effect of punctuating, i. ...
Example of a letter with a diacritic A diacritical mark or diacritic, also called an accent, is a small sign added to a letter to alter pronunciation or to distinguish between similar words. ...
In discussing Unicode and the UCS, many often refer to compatibility characters. ...
Many characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. ...
Combining diacritical marks are Unicode characters that are intended to modify other characters (see Diacritic). ...
Precomposed character is a Unicode entity that can be decomposed into a canonically equivalent string of several other characters. ...
| -
In Unicode, a script is an abstract coherent and unified writing system supporting one or more concrete writing systems which in turn support the written forms of one or more languages. Some scripts support one and only one language, for example: Armenian. Other scripts, like Latin, support many different writing systems: English, French, German, Italian, and Latin to name just a few. Some languages also make use of multiple alternate writing systems. Turkish, for example used Arabic before the 20th century and transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system. Writing systems of the world today. ...
The Unicode Standard, Version 5. ...
The Arabic alphabet is the script used for writing the Arabic language, which is the language of the Quran, the holy book of Islam. ...
. ...
When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character ‘å’ (sometimes called a “Swedish O”) while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of writing systems is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks and collation algorithms. While all characters have the property of belonging to a script, many characters, such as symbols, indicate “common” or “inherited” for their script property. The unified diacritical characters and unified punctuation characters frequently have the “common” or “inherited” script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters. Unicode already includes over 60 scripts supporting hundreds or even thousands of languages throughout the World. Unicode is actively working on many more as indicated by its roadmap. The Unicode Standard, Version 5. ...
Unihan characters -
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. The Chinese characters are common to Chinese (where they are called hanzi), Japanese (where they are called kanji), and Korean (where they are called hanja). Modern Korean, Chinese and Japanese typefaces may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these different glyphs were treated as the same character. This unification is referred to as "Han unification", with the resulting character repertoire sometimes referred to as Unihan. Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ...
The Unicode Standard, Version 5. ...
The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ...
A character encoding is a code that pairs a set of characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses. ...
CJK is a collective term for Chinese, Japanese, and Korean, which comprise the main East Asian languages. ...
Japanese name Kanji: Hiragana: Korean name Hangul: Hanja: Vietnamese name Quoc Ngu: Han Tu: A Chinese character or Han character (Simplified Chinese: ; Traditional Chinese: ; pinyin: ) is a logogram used in writing Chinese, Japanese, sometimes Korean, and formerly Vietnamese. ...
Technical note: Due to technical limitations, some web browsers may not display some special characters in this article. ...
Japanese writing Kanji Kana Hiragana Katakana Hentaigana ManyÅgana Uses Furigana Okurigana RÅmaji ) are the Chinese characters that are used in the modern Japanese logographic writing system along with hiragana (平仮å), katakana (çä»®å), and the Arabic numerals. ...
Hanja is the Korean name for Chinese characters. ...
âFontâ redirects here. ...
Besides the Unihan ideographs, Han unification also provides Han unified punctuation, symbols, numerals, ideograph stroke characters and ideographic description characters.
Phonetic characters -
| | This article or section is in need of attention from an expert on the subject. WikiProject Phonetics may be able to help recruit one. If a more appropriate WikiProject or portal exists, please adjust this template accordingly. Unicode ranges encoding phonetic notation. ...
Image File history File links Emblem-important. ...
| Unicode includes letters and marks from the International Phonetic Alphabet (IPA) and those supporting other phonetic writing systems too. Essentially these characters are used as graphemes for phonemes. In terms of script or writing system, these phonetic alphabets are basically one writing system. What distinguishes the various phonetic alphabets are their glyphs. However, as with numerals, the UCS often focus more on the presentational forms or glyphs given to these phonemes by the various phonetic alphabets. This is in contrast to the alternate names of these characters provided by Unicode NamesList property which typically reflects the common phoneme semantics shared by those various writing systems regardless of the glyphs used. So these differences manifest in the alternate names given to these characters: the canonical UCS name and the NamesList property names. Similarly, Unicode assignees the value of “Latin” to the script property of many of these characters. However, the primary purpose for these characters inclusion in the character set is to support the various phonetic writing systems. These phonetic writing system, in many ways, constitute a single unified writing system on its own: despite borrowing glyphs from other Latin, Greek and Cyrillic scripts. Articles with similar titles include the NATO phonetic alphabet, which has also informally been called the âInternational Phonetic Alphabetâ. For information on how to read IPA transcriptions of English words, see IPA chart for English. ...
A grapheme designates the atomic unit in written language. ...
In spoken language, a phoneme is a basic, theoretical unit of sound that can distinguish words (i. ...
Numerals -
Numerals (often called numbers in Unicode) are characters that denote a number. The same Arabic-Indic numerals are used widely in various writing systems throughout the world and all share the same semantics for denoting numbers, However, the glyphs representing these numerals differ widely from one writing system to another. To support these glyph differences, Unicode includes duplicate encodings of these numerals within many of the script blocks. These digits are repeated in 23 separate blocks: twice in Arabic. Six additional blocks contain the digits again as rich text or legacy software compatibility characters. Numerals (often called numbers in Unicode) are characters that denote a number. ...
Numerals sans-serif Arabic numerals, known formally as Hindu-Arabic numerals, and also as Indian numerals, Hindu numerals, Western Arabic numerals, European numerals, or Western numerals, are the most common symbolic representation of numbers around the world. ...
Unicode also includes several less common numerals: Roman numerals, counting rod numerals, Cuneiform numerals and ancient Greek numerals. Numerals invariably involve composition of glyphs as a limited number of characters are composed to make other numerals. For example the sequence 9 - 9 - 0 in Arabic-Indic numerals composes the numeral for nine hundred and ninety (990). In Roman numerals, the same number is expressed by the composed numeral Ⅹↀ or ⅩⅯ. Each of these is a distinct numeral for representing the same abstract number. The semantics of the numerals differ in particular in their composition. The Arabic-Indic decimal digits are positional-value compositions, while the Roman numerals are sign-value and they are additive and subtractive depending on their composition.
Punctuation and diacritics Unicode includes several blocks for unified diacritics and other combining marks and also blocks for unified punctuation. However, when a mark or punctuation character is intended primarily for use within a particular script, the character is assigned to that particular script’s blocks. Therefore authors will find these types of characters throughout the Unicode character database. Unicode categorizes them as: -
- connector (Pc)
- dash (Pd)
- open (Po)
- close (Pe)
- initial (Pi)
- final (Pf)
-
- non-spacing (Mn)
- spacing-combining (Mc
- enclosing (Me)
Symbols Unicode has dozens of blocks dedicated to symbols that are useful regardless of one’s writing system. Other script-specific symbols are often included within a particular script’s blocks. Symbols are categorized as: Symbols: - math (Sm)
- currency (Sc)
- modifier (Sk)
- other (So)
Music notation Unicode devotes a block of 256 characters for musical symbols. Since Unicode focuses on characters laid out in two dimensions, these characters do not encode pitch or other parts of Western music expressed in the vertical dimension. Therefore the music symbols are more suited for discussions of music symbols themselves or to discuss rhythm within the prose of a document. To encode more complex musical information some other data format is necessary, such as MusicXML or Midi. MusicXML is an open, XML-based music notation file format. ...
Musical Instrument Digital Interface, or MIDI, is a system designed to transmit information between electronic musical instruments. ...
Compatibility characters -
In discussing Unicode and the UCS, many often refer to compatibility characters. Compatibility characters are graphical characters that are discouraged by the Unicode Consortium. As the Unicode consortium says: In discussing Unicode and the UCS, many often refer to compatibility characters. ...
A character that would not have been encoded except for compatibility and round-trip convertibility with other standards However, the definition is more complicated that the glossary reveals. One of the properties given to characters by the Unicode consortium is the characters decomposition or compatibility decomposition. Most characters have no value for this property, but over 5 thousand characters do have a compatibility decomposition mapping that compatibility character to one or more other characters. By setting a characters decomposition property, Unicode establishes that character as a compatibility character. The reasons for these compatibility designations are varied and are discussed in further detail below. The term decomposition can sometimes confuse because a characters decomposition can, in some cases, be a singleton. In these cases the decomposition of one character is simply another equivalent or approximately equivalent character.
Canonical and Non-canonical The compatibility decomposition property for the 5,402 Unicode compatibility characters includes a keyword that divides the compatibility characters into 17 logical groups. Those without a keyword are termed canonical equivalent or canonical decomposable characters. These characters have the closest relationship. Other keywords include: <initial>, <medial>, <final>, <isolated>, <wide>, <narrow, <small>, <square>, <vertical>, <circle>, <noBreak>, <fraction>, <subscript>, <superscript>, and <compat>. These keywords provide some indication of the relation between the compatibility character and its compatibility decomposition character sequence. However, the compatibility characters — whether canonical or not — fall in three basic categories: 1) characters corresponding to multiple alternate glyph forms and precomposed diacritics to support software and font implementations that do not include complete Unicode text layout capabilities; 2) characters included from other character sets or otherwise added to the UCS that constitute rich text rather than the plain text goals of Unicode; 3) some other characters that are semantically distinct, but visually similar. Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a find on a page for a capital Latin letter ‘I’ and their software application fails to find the visually similar Roman numeral ‘Ⅰ’.
Compatibility Blocks Several blocks of Unicode characters include either entirely or almost entirely all compatibility characters. These compatibility blocks contain none of the semantically distinct compatibility characters and so they fall unambiguously into the set of discouraged characters. Unicode recommends authors use the plain text compatibility decomposition equivalents instead and complement those characters with rich text markup. This approach is much more flexible and open-ended than using the finite set of circled or enclosed alphanumerics to give just one example. Unfortunately, there are a small number of characters even within the compatibility blocks that themselves are not compatibility characters and therefore may confuse authors. The “Enclosed CJK Letters and Months” block contains a single non-compatibility character: the ‘Korean Standard Symbol’ (㉿ U+327F). This symbol and 12 other characters have been included in these blocks for no known reasons. The “CJK Compatibility Ideographs” block contains these non-compatibility unified Han ideographs: - (U+FA0E): 﨎
- (U+FA0F): 﨏
- (U+FA11): 﨑
- (U+FA13): 﨓
- (U+FA14): 﨔
- (U+FA1F): 﨟
- (U+FA21): 﨡
- (U+FA23): 﨣
- (U+FA24): 﨤
- (U+FA27): 﨧
- (U+FA28): 﨨
- (U+FA29): 﨩
These thirteen characters are neither compatibility characters nor are their use discouraged in any way. Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support: Alphabetic Presentation Forms (1) - Hebrew Point Judeo-Spanish Varika (U+FB1E): ﬞ. This is a glyph variant of Hebrew Point Rafe (U+05BF): ֿ , though Unicode provides no compatibility mapping.
Arabic Presentation Forms (4) - “Ornate Left Parenthesis” (U+FD3E): ﴾. A glyph variant for U+0029 ‘)’
- “Ornate Right Parenthesis” (U+FD3F): ﴿. A glyph variant for U+0028 ‘ (’
- “Ligature Bismillah Ar-Rahman Ar-Raheem” (U+FDFD): ﷽. Bismillah Ar-Rahman Ar-Raheem is a ligature for Teh Marbuta (U+0629), Lam (U+0644), Meem (U+0645), Seen (U+0633), Beh (U+0628), (بسملة)
- “Arabic Tail Fragment” (U+FE73): ﹳ for supporting text systems without contextual glyph handling
CJK Compatibility Forms (2 that are both related to CJK Unified Ideograph: U+4E36 丶) Basmala calligraphy Basmala (Arabic بسÙ
ÙØ©) is an Arabic language noun which is used as the collective name of the whole of the recurring Islamic phrase . This phrase constitutes the first verse of the first sura (or chapter) of the Quran, and is used in a number of contexts by Muslims. ...
- Sesame Dot (U+FE45): ﹅
- White Sesame Dot (U+FE46): ﹆
Enclosed Alphanumerics (21 rich text variants) - 10 Negative Circled Numbers (0 and 11 through 20) (U+24FF and U+24EB through U+24F4): ⓫ – ⓴
- 11 Double Circled Numbers (0 through 10) (U+24F5 through U+24FE): ⓵ – ⓾
Compatibility characters and normalization -
Normalization is the process by which Unicode conforming software first performs compatibility decomposition before making comparisons or collating text strings. This is similar to other operations needed when, for example, a user performs a case or diacritic insensitive search within some text. In such cases software must equate or ignore characters it would not otherwise equate or ignore. Typically normalization is performed without altering the underlying stored text data (lossless). However, some software may potentially make permanent changes to text that eliminates the canonical or even non-canonical compatibility characters differences from text storage (lossy). Unicode normalization is a form of text normalization that transforms equivalent characters or sequences of characters into a consistent underlying representation so that they may be easily compared. ...
Non-graphical characters -
Many characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the null character (U+0000) is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string. The string ends once the program reads the null character. Many characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. ...
Legacy control characters The legacy control characters come from ASCII and ISO 8859-1 character sets and are sometimes referred to as C0 and C1 respectively. Many of these characters play no explicit role in Unicode text handling, though they are still used in mainframe computing environments. Others, like the null character and many whitespace characters are still used commonly in text processing. Other common control characters are tabulation or tab (U+0009), linefeed (U+000A), carriage return (U+000D) and newline (U+0085). These are included among whitespace characters because, though they have no visual glyph, they do insert vertical or horizontal spacing between the display of characters.
Unicode introduced separators In an attempt to simplify the several new line characters used in legacy text, UCS introduces its own new line characters to separate either lines or paragraphs: the line separator (U+2028) and paragraph separator (U+2029) characters.
Language tags Unicode includes 128 characters as language tags. The characters essentially mirror the 128 ASCII characters except, when used they identify the subsequent text as belonging to a particular language according to BCP 47. For example, for indicating subsequent text as the variant of English as written in the United States, the initiating ‘Language Tag character’ (U+E0001) followed by the sequence ‘Tag Small Letter e’ (U+U+E0065), ‘Tag Small Letter n’ (U+E006E), “Tag Hyphen-minus’ (U+E002D), ‘Tag Small Letter u’ (U+E0075) and ‘Tag Small Letter s’ (U+E0073). Alternate meaning: Wikipedia:Requests for comment A Request for Comments (RFC) document is one of a series of numbered Internet informational documents and standards very widely followed by both commercial software and freeware in the Internet and Unix communities. ...
These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example the display of Unihan ideographs might substitute different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might influence the display of decimal digits 0 through 9 differently depending on the language they appeared in.
Interlinear annotation Three formatting characters provide support for interlinear annotation (U+FFF9, U+FFFA, U+FFFB). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C ruby markup recommendation is an example of an alternate protocol supporting more advanced interlinear annotation. Ruby is a W3C recommendation markup for inline annotation of text, often to provide pronunciation hints or other annotation above or below the main text. ...
Bidirectional text control Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, the Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسملة”) right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right.. However, support for bidirectional text becomes more complicated when text flowing in opposite directions is embedded hierarchically. So that for example if one quotes an Arabic phrase that in turn quotes an English phrase. Other situations may complicate this when for example, an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides seven characters ((U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E) to help control these embedded bidirectional text levels up to 61 levels deep.
Variation Selectors Many characters map to alternate glyphs depending on the context. For example Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character. These types of glyph substitution are easily handled by the context of the character with no other authoring input involved. Authors may also use special-purpose characters such as joiners and non-joiners to force an alternate form of glyph where it would not otherwise appear. Ligatures are similar instances where glyphs may be substituted simply by turning ligatures on or off as a rich text attribute. However, for other glyph substitution, the authors intent may need to be encoded with the text and cannot be determined contextually. This is the case with character/glyphs referred to as gaiji where different glyphs are used for the same character either historically or for ideographs for family names. This is one of the gray areas in distinguishing between a glyph and a character. If a family name differs slightly from the ideograph character it derives from, then is that a simple glyph variant or a character variant? As of Unicode 3.2 and 4.0, the character set now includes 256 variation selectors so that these combining mark characters can select from 256 possible character/glyph variations for the preceding character. Unicode does not as yet provide any registry for these variations, so the issue of interoperable variation registration is left to other parties. Japanese writing Kanji Kana Hiragana Katakana Hentaigana ManyÅgana Uses Furigana Okurigana RÅmaji ) are the Chinese characters that are used in the modern Japanese logographic writing system along with hiragana (平仮å), katakana (çä»®å), and the Arabic numerals. ...
Other Special-purpose characters Several characters fall between the non-graphical control and formatting characters and full-fledged graphical characters.
Joiners and Non-joiners Word Joiner (U+2060), Zero-width joiner (U+200D), Zero-width non-joiner (U+200C), Zero-width space (U+200B), Combining Grapheme Joiner (U+034F). The zero width joiner (ZWJ) is a non-printing character used in the computerized typesetting of some cursive scripts, such as the Arabic script or the Korean hangul script. ...
The zero width non joiner (ZWNJ) is a non-printing character used in the computerized typesetting of some cursive script, Korean hangul or Persian script. ...
Invisible Separator Primarily for mathematics, the Invisible Separator (U+2063) provides a separator between characters where punctuation or space may be omitted such as in a two-dimensional index like ij.
Invisible Times and Function Application Invisible Times (U+2062) and Function Application (U+2061) are useful in mathematics text where the multiplication of terms or the application of a function is implied without any glyph indicating the operation.
Spaces The space character (U+0020) typically input by the space bar on a keyboard serves semantically as a word separator in many languages. For legacy reasons, the UCS also includes spaces of varying sizes that are compatibility equivalents for the space character. These spaces include: - Space (U+0020)
- En Quad (U+2000)
- Em Quad (U+2001)
- En Space (U+2002)
- Em Space (U+2003)
- Three-Per-Em Space (U+2004)
- Four-Per-Em Space (U+2005)
- Six-Per-Em Space (U+2006)
- Figure Space (U+2007)
- Punctuation Space (U+2008)
- Thin Space (U+2009)
- Hair Space (U+200A)
- Mathematical Space (U+205F)
Aside from the original ASCII space, the other spaces are all compatibility characters. In this context this means that they effectively add no semantic content to the text, but instead provide styling control. Within Unicode, this non-semantic styling control is often referred to as rich text and is outside the thrust of Unicode’s goals. Rather than using different spaces in different contexts, this styling could instead be handled through intelligent text layout software.
Line-break control characters Several characters are designed to help control line-breaks either by discouraging them (no-break characters) or suggesting line breaks such as the soft or shy hyphen (U+00AD). Such characters, though designed for styling, are probably indispensable for the intricate types of line-breaking they make possible. - Shy Hyphen (U+00AD)
- Non-breaking Hyphen (U+2011)
- No-break Space (U+00A0)
- Narrow No-break Space (U+202F)
- Zero-width space (U+200B)
Whitespace characters Whitespace characters are not a separate group of characters, but instead Unicode provides a list of characters it deems whitespace characters for interoperability support. Software Implementations and other standards may use the term to denote a slightly different set of characters. Whitespace characters are characters typically designated for programming environments. Often they have no syntactic meaning in such programming environments and are ignored by the machine interpreters. Unicode designates the legacy control characters U+0009 through U+000D and U+0085 as white space characters as well as the Unicode introduced line separator and paragraph separator. Also the core space character (U+0020) is designated as a whitespace character, but none of the other styling spaces.
Private use characters The UCS includes over 100,000 code points for private use. This means these code points can be assigned characters with specific properties by individuals, organizations and software vendors outside the ISO and Unicode Consortium. A Private Use Area (PUA) is one of several ranges which are reserved for private use. For this range, the Unicode standard does not specify any characters. The Basic Multilingual Plane includes a PUA in the range from U+E000 to U+F8FF (57344–63743). Plane Fifteen (U+F0000 to U+FFFFD), and Plane Sixteen (U+100000 to U+10FFFD) are completely reserved for private use as well. The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways. Similarly the ConScript Unicode Registry aims to coordinate the mapping of scripts not yet encoded in or rejected by Unicode in the PUAs. The Medieval Unicode Font Initiative uses the PUA to encode various ligatures, precomposed characters, and symbols found in medieval texts. Japanese writing Kanji Kana Hiragana Katakana Hentaigana ManyÅgana Uses Furigana Okurigana RÅmaji ) are the Chinese characters that are used in the modern Japanese logographic writing system along with hiragana (平仮å), katakana (çä»®å), and the Arabic numerals. ...
The ConScript Unicode Registry is a volunteer project to coordinate the assignment of code points in the Unicode Private Use Area for the encoding of artificial scripts. ...
In digital typography, the Medieval Unicode Font Initiative (MUFI) is a project which aims to coördinate the encoding and display of special characters in Medieval texts written in the Latin alphabet, which are not encoded as part of Unicode. ...
Precomposed character is a Unicode entity that can be decomposed into a canonically equivalent string of several other characters. ...
One example of usage of the Private Use Area is Apple's usage of U+F8FF for the Apple logo. Apple Inc. ...
Unicode code point U+F8FF is the last character in the Unicode private use area. ...
This article, Fonts on the Mac, describes current and historical practices regarding the Apple Macintoshâs approach to typefaces, including font management and fonts included with each system revision. ...
In Microsoft Windows, these character can be created using Private Character Editor, a limited font editor that comes with Windows. Microsoft Private Character Editor (Eudcedit. ...
Font editor is a computer program to edit fonts. ...
Special code points At the simplest level, each character in the UCS represents a code point and a particular semantic function: For graphical characters, the semantic function is often implied by its name, and the script or block it is included within. A graphical character may also have a recommended glyph that helps define the meaning of the character. Ideographs for languages in China, Japan, Korea and Vietnam include many other rich properties that participate in defining the semantic role for a character. However, the UCS and Unicode designate other code points for other purposes. Those code points may have no or few character properties associated with them.
Surrogates The 2,048 surrogates are not characters, but are reserved for use in UTF-16 to specify code points outside the Basic Multilingual Plane. They are divided into "high surrogates" (D800–DBFF) and "low surrogates" (DC00–DFFF). In UTF-16, they must always appear in pairs, as a high surrogate followed by a low surrogate, thus using 32 bits to denote one code point. In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ...
A surrogate pair denotes the code point - 1000016 + (H - D80016 ) × 40016 + (L - DC0016)
where H and L are the numeric values of the high and low surrogates respectively. Since high surrogate values in the range DB80 to DBFF always produce values in the Private Use planes, the high surrogate range can be further divided into (normal) high surrogates (D800–DB7F) and "high private use surrogates" (DB80–DBFF).
Noncharacters Unicode reserves several code points as noncharacters. These code points are guaranteed to never have a character assigned to them. Software implementations are therefore free to use these code points for internal use. However, these noncharacters should never be included in text interchange between implementations. One inherently useful example of a noncharacter is the code point U+FFFE. This code point has the reverse binary sequence of the byte order mark (U+FEFF). If a stream of text contains this noncharacter, this is a good indication the text has been interpreted with the incorrect endianness. A byte-order mark (BOM) is the character at code point U+FEFF (zero-width no-break space), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text...
In computing, endianness is the byte (and sometimes bit) ordering in memory used to represent some kind of data. ...
Summary table of UCS characters assignments Description of Table Columns and Rows The following table lists all of the blocks currently assigned characters as of April 2007 (Unicode 5.0). Blocks are grouped according to their function. - The first column lists the name of the group.
Working backwards: - The last four columns indicate the boundaries of the block — both its starting code point and ending code point — in both hexdecimal and decimal notation.
- The prior column (labeed seq for sequence) indicates the order of the block in code point order. This sequence is jusst an ordering based on the current block assignments. As new blocks are assigned or broken down from the existing unassigned blocks, those sequences numbers would change (though the order would remain the same).
- Unalloc'd indicates the number of unallocated code points represented by a potential block.
- Alloc'd indicates the number of code points allocated to the block whether actually provisioned or reserved for potential future use of the block.
- Exxcl and Incl: Some bblocks contain unrelated characters best treated within other categories. In this casse the characters are all tallied in one place in terms of the allocated and reserved characters. The other unrelated characters are subtracted ('Excl ) from the present row and added (Incl) to another.
- Resrvd indicates the number of characters assigned to the block for related characters, but not yet assigned.
- Provd indicates the characters provisioned in the block: those actually assigned characters (Allocd − Excl + Incl − Resrvd; = Provd).
- Compat indicates the number of characters in the block considered compatibility characters. The issue of compatibility characters is complicated, however they generally represent characters included for compatibility with legacy text processing systems or legacy character sets. Unicode’s separation of glyph from character implies that far fewer characters are required for text processing. The various variant glyphs are instead stored as font data, rather than stored as text data (see Unicode compatibility character section). Compatibility characters are typically ligatures such as ffi or precomposed diacritic letters such as å.
- Core indicates the number of provisioned characters in the block less the discouraged compatibility characters and deprecated (strongly discouraged) characters.
Though the table name unallocated blocks, those blocks could potentially be allocated for any purpose. For example unused code point blocks within the general area of the BMP dedicated to Unihan ideographs could instead be allocated to modern scripts. The names merely indicate the general region of the plane in which they are situated. Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
Totals | Allocd | Unallocd | Nonchars | Potential Code Points | | | | | | Grand Totals | 241,498 + | 872,548 | + 66 | = 1,114,112 | | | Code Points in Unallocated Planes | | | 720,874 | | | Non-Characters (Unallocated Planes) | | | | 22 | Unicode designates 32 other non-characters in the Arabic Presentation Forms-A block from U+FDD0 to U+FDEF for a total of 66 noncharacters designated so far. | | Non-Characters (Planes in Current Use) | | | | 12 | | Non-Characters in BMP Arabic Presentation Forms-A block | | -32 | | 32 | | | | | Private Use Allocation | | 137,468 | | | | | | | Surrogates | | 2,048 | | | | | 93,978 | 5,349 | 99,327 | 2,684 | 155 | 155 | 102,014 | 151,674 | | | | | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | | | | | | Modern Scripts | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | A | Modern Scripts | 17,722 | 999 | 18,721 | 1,454 | 14 | 79 | 20,240 | 944 | (sequences 1-4, 8-55, 99-100, 102-103, 105, 111-112) | | 1 | Arabic Blocks | 253 | 12 | 265 | 39 | | | 304 | 0 | (sequences 13 & 15) | | 1.1 | Arabic | 223 | 12 | 235 | 21 | | | 256 | | 13 | 0600 | 06FF | 1536 | 1791 | | 1.2 | Arabic Supplement | 30 | | 30 | 18 | | | 48 | | 15 | 0750 | 077F | 1872 | 1919 | | 2 | Armenian | 85 | 1 | 86 | 10 | | | 96 | | 11 | 0530 | 058F | 1328 | 1423 | | 3 | Balinese | 110 | 11 | 121 | 7 | | | 128 | | 54 | 1B00 | 1B7F | 6912 | 7039 | | 4 | Bengali | 87 | 5 | 92 | 36 | | | 128 | | 20 | 0980 | 09FF | 2432 | 2559 | | 5 | Bopomofo Blocks | 64 | 0 | 64 | 16 | | | 80 | | (sequences 100, 103) | | 5.1 | Bopomofo | 40 | | 40 | 8 | | | 48 | | 100 | 3100 | 312F | 12544 | 12591 | | 5.2 | Bopomofo Extended | 24 | | 24 | 8 | | | 32 | | 103 | 31A0 | 31BF | 12704 | 12735 | | 6 | Buginese | 30 | | 30 | 2 | | | 32 | | 52 | 1A00 | 1A1F | 6656 | 6687 | | 7 | Buhid | 20 | | 20 | 12 | | | 32 | | 43 | 1740 | 175F | 5952 | 5983 | | 8 | Cherokee | 85 | | 85 | 11 | | | 96 | | 37 | 13A0 | 13FF | 5024 | 5119 | | 9 | Coptic Blocks | 128 | 0 | 128 | 14 | 14 | | 128 | | (sequences 8, 87) | | 9.1 | Coptic | 114 | | 114 | 14 | | | 128 | | 87 | 2C80 | 2CFF | 11392 | 11519 | | 9.2 | Coptic, Greek and | 14 | | 14 | 0 | 14 | | 0 | | 8 | 0370 | 03FF | 880 | 1023 | | 10 | Cyrillic Blocks | 223 | 52 | 275 | 29 | | | 304 | | (sequences 9-10) | | 10.1 | Cyrillic | 203 | 52 | 255 | 1 | | | 256 | | 9 | 0400 | 04FF | 1024 | 1279 | | 10.2 | Cyrillic Supplement | 20 | | 20 | 28 | | | 48 | | 10 | 0500 | 052F | 1280 | 1327 | | 11 | Devanagari | 99 | 11 | 110 | 18 | | | 128 | | 19 | 0900 | 097F | 2304 | 2431 | | 12 | Ethiopic Blocks | 461 | 0 | 461 | 51 | | | 512 | | (sequences 33, 88) | | 13.1 | Ethiopic | 356 | | 356 | 28 | | | 384 | | 35 | 1200 | 137F | 4608 | 4991 | | 13.2 | Ethiopic Extended | 79 | | 79 | 17 | | | 96 | | 90 | 2D80 | 2DDF | 11648 | 11743 | | 13.3 | Ethiopic Supplement | 26 | | 26 | 6 | | | 32 | | 36 | 1380 | 139F | 4992 | 5023 | | 14 | Georgian Blocks | 121 | 0 | 121 | 23 | | | 144 | | | | | | | | 14.1 | Georgian | 83 | | 83 | 13 | | | 96 | | 33 | 10A0 | 10FF | 4256 | 4351 | | 14.2 | Georgian Supplement | 38 | | 38 | 10 | | | 48 | | 88 | 2D00 | 2D2F | 11520 | 11567 | | 15 | Glagolitic | 94 | | 94 | 2 | | | 96 | | 85 | 2C00 | 2C5F | 11264 | 11359 | | 16 | Greek Blocks | 79 | 267 | 346 | 40 | | 14 | 400 | | (sequences 33, 88) | | 16.1 | Greek and Coptic | 79 | 34 | 113 | 17 | | 14 | 144 | | 8 | 0370 | 03FF | 880 | 1023 | | 16.2 | Greek Extended | 0 | 233 | 233 | 23 | | | 256 | | 60 | 1F00 | 1FFF | 7936 | 8191 | | 17 | Gujarati | 84 | | 84 | 44 | | | 128 | | 22 | 0A80 | 0AFF | 2688 | 2815 | | 18 | Gurmukhi | 72 | 6 | 78 | 50 | | | 128 | | 21 | 0A00 | 0A7F | 2560 | 2687 | | 19 | Hangul Blocks | 11,412 | 0 | 11,412 | 28 | | | 11,440 | | (sequences 34, 120) | | 19.1 | Hangul Jamo | 240 | | 240 | 16 | | | 256 | | 34 | 1100 | 11FF | 4352 | 4607 | | 19.2 | Hangul Syllables | 11,172 | | 11,172 | 12 | | | 11,184 | | 120 | AC00 | D7AF | 44032 | 55215 | | 20 | Hanunoo | 23 | | 23 | 9 | | | 32 | | 42 | 1720 | 173F | 5920 | 5951 | | 21 | Hebrew | 87 | | 87 | 25 | | | 112 | | 12 | 0590 | 05FF | 1424 | 1535 | | 22 | Japanese Blocks | 159 | 62 | 221 | 3 | | | 224 | | (sequences 98-99, 102, 105) | | 22.1 | Katakana | 64 | 32 | 96 | 0 | | | 96 | | 99 | 30A0 | 30FF | 12448 | 12543 | | 22.2 | Katakana Phonetic Extensions | 16 | | 16 | 0 | | | 16 | | 105 | 31F0 | 31FF | 12784 | 12799 | | 22.3 | Hiragana | 63 | 30 | 93 | 3 | | | 96 | | 98 | 3040 | 309F | 12352 | 12447 | | 22.4 | Kanbun | 16 | | 16 | 0 | | | 16 | | 102 | 3190 | 319F | 12688 | 12703 | | 23 | Kannada | 81 | 5 | 86 | 42 | | | 128 | | 26 | 0C80 | 0CFF | 3200 | 3327 | | 24 | Khmer Blocks | 144 | 2 | 146 | 14 | | | 160 | | (sequences 45, 51) | | 24.1 | Khmer | 112 | 2 | 114 | 14 | | | 128 | | 45 | 1780 | 17FF | 6016 | 6143 | | 24.2 | Khmer Symbols | 32 | | 32 | 0 | | | 32 | | 51 | 19E0 | 19FF | 6624 | 6655 | | 25 | Lao | 62 | 3 | 65 | 63 | | | 128 | | 30 | 0E80 | 0EFF | 3712 | 3839 | | 26 | Latin Blocks | 268 | 524 | 792 | 247 | | 65 | 1,104 | | (sequences 1-4, 59, 86, 115) | | 26.1 | Latin, Basic | 95 | | 95 | 0 | | 33 | 128 | | 1 | 0000 | 007F | 0 | 127 | | 26.2 | Latin-1 Supplement | 35 | 61 | 96 | 0 | | 32 | 128 | | 2 | 0080 | 00FF | 128 | 255 | | 26.3 | Latin Extended Additional | 0 | 246 | 246 | 10 | | | 256 | | 59 | 1E00 | 1EFF | 7680 | 7935 | | 26.4 | Latin Extended-A | 14 | 114 | 128 | 0 | | | 128 | | 3 | 0100 | 017F | 256 | 383 | | 26.5 | Latin Extended-B | 105 | 103 | 208 | 0 | | | 208 | | 4 | 0180 | 024F | 384 | 591 | | 26.6 | Latin Extended-C | 17 | | 17 | 15 | | | 32 | | 86 | 2C60 | 2C7F | 11360 | 11391 | | 26.7 | Latin Extended-D | 2 | | 2 | 222 | | | 224 | | 115 | A720 | A7FF | 42784 | 43007 | | 27 | Limbu | 66 | | 66 | 14 | | | 80 | | 48 | 1900 | 194F | 6400 | 6479 | | 28 | Malayalam | 75 | 3 | 78 | 50 | | | 128 | | 27 | 0D00 | 0D7F | 3328 | 3455 | | 29 | Mongolian | 155 | | 155 | 21 | | | 176 | | 46 | 1800 | 18AF | 6144 | 6319 | | 30 | Myanmar | 77 | 1 | 78 | 82 | | | 160 | | 32 | 1000 | 109F | 4096 | 4255 | | 31 | New Tai Lue | 80 | | 80 | 16 | | | 96 | | 50 | 1980 | 19DF | 6528 | 6623 | | 32 | NKo | 59 | | 59 | 5 | | | 64 | | 17 | 07C0 | 07FF | 1984 | 2047 | | 33 | Ogham | 29 | | 29 | 3 | | | 32 | | 39 | 1680 | 169F | 5760 | 5791 | | 34 | Oriya | 76 | 5 | 81 | 47 | | | 128 | | 23 | 0B00 | 0B7F | 2816 | 2943 | | 35 | Phags-pa | 56 | | 56 | 8 | | | 64 | | 118 | A840 | A87F | 43072 | 43135 | | 36 | Runic | 81 | | 81 | 15 | | | 96 | | 40 | 16A0 | 16FF | 5792 | 5887 | | 37 | Sinhala | 77 | 4 | 81 | 47 | | | 128 | | 28 | 0D80 | 0DFF | 3456 | 3583 | | 38 | Syloti Nagri | 44 | | 44 | 4 | | | 48 | | 116 | A800 | A82F | 43008 | 43055 | | 39 | Syriac | 77 | | 77 | 3 | | | 80 | | 14 | 0700 | 074F | 1792 | 1871 | | 40 | Tagalog | 20 | | 20 | 12 | | | 32 | | 41 | 1700 | 171F | 5888 | 5919 | | 41 | Tagbanwa | 18 | | 18 | 14 | | | 32 | | 44 | 1760 | 177F | 5984 | 6015 | | 42 | Tai Le | 35 | | 35 | 13 | | | 48 | | 49 | 1950 | 197F | 6480 | 6527 | | 43 | Tamil | 68 | 4 | 72 | 56 | | | 128 | | 24 | 0B80 | 0BFF | 2944 | 3071 | | 44 | Telugu | 81 | 1 | 82 | 46 | | | 128 | | 25 | 0C00 | 0C7F | 3072 | 3199 | | 45 | Thaana | 50 | | 50 | 14 | | | 64 | | 16 | 0780 | 07BF | 1920 | 1983 | | 46 | Thai | 86 | 1 | 87 | 41 | | | 128 | | 29 | 0E00 | 0E7F | 3584 | 3711 | | 47 | Tibetan | 176 | 19 | 195 | 61 | | | 256 | | 31 | 0F00 | 0FFF | 3840 | 4095 | | 48 | Tifinagh | 55 | | 55 | 25 | | | 80 | | 89 | 2D30 | 2D7F | 11568 | 11647 | | 49 | Unified Canadian Aboriginal Syllabics | 630 | | 630 | 10 | | | 640 | | 38 | 1400 | 167F | 5120 | 5759 | | 50 | Yi Blocks | 1,220 | 0 | 1,220 | 12 | | | 1,232 | 0 | (sequences 111 & 112) | | 50.1 | Yi Radicals | 55 | | 55 | 9 | | | 64 | | 112 | A490 | A4CF | 42128 | 42191 | | 50.2 | Yi Syllables | 1,165 | | 1,165 | 3 | | | 1,168 | | 111 | A000 | A48F | 40960 | 42127 | | 51 | Unallocated Script Blocks | | | | | | | | 944 | (sequences 18, 47, 53, 55) | | 51.1 | | | | | | | | | 256 | 18 | 0800 | 08FF | 2048 | 2303 | | 51.2 | | | | | | | | | 80 | 47 | 18B0 | 18FF | 6320 | 6399 | | 51.3 | | | | | | | | | 224 | 53 | 1A20 | 1AFF | 6688 | 6911 | | 51.4 | | | | | | | | | 384 | 55 | 1B80 | 1CFF | 7040 | 7423 | Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
Ancient Scripts | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | B | Ancient Scripts | 1,783 | 0 | 1,783 | 313 | | | 2,096 | 51,152 | (sequences 138-162) | | 1 | Linear B Syllabary | 88 | | 88 | 40 | | | 128 | | 138 | 10000 | 1007F | 65536 | 65663 | | 2 | Linear B Ideograms | 123 | | 123 | 5 | | | 128 | | 139 | 10080 | 100FF | 65664 | 65791 | | 3 | Aegean Numbers | 57 | | 57 | 7 | | | 64 | | 140 | 10100 | 1013F | 65792 | 65855 | | 4 | Ancient Greek Numbers | 75 | | 75 | 5 | | | 80 | | 141 | 10140 | 1018F | 65856 | 65935 | | 5 | Old Italic | 35 | | 35 | 13 | | | 48 | | 143 | 10300 | 1032F | 66304 | 66351 | | 6 | Gothic | 27 | | 27 | 5 | | | 32 | | 144 | 10330 | 1034F | 66352 | 66383 | | 7 | Ugaritic | 31 | | 31 | 1 | | | 32 | | 146 | 10380 | 1039F | 66432 | 66463 | | 8 | Old Persian | 50 | | 50 | 14 | | | 64 | | 147 | 103A0 | 103DF | 66464 | 66527 | | 9 | Deseret | 80 | | 80 | 0 | | | 80 | | 149 | 10400 | 1044F | 66560 | 66639 | | 10 | Shavian | 48 | | 48 | 0 | | | 48 | | 150 | 10450 | 1047F | 66640 | 66687 | | 11 | Osmanya | 40 | | 40 | 8 | | | 48 | | 151 | 10480 | 104AF | 66688 | 66735 | | 12 | Cypriot Syllabary | 55 | | 55 | 9 | | | 64 | | 153 | 10800 | 1083F | 67584 | 67647 | | 13 | Phoenician | 27 | | 27 | 5 | | | 32 | | 155 | 10900 | 1091F | 67840 | 67871 | | 14 | Kharoshthi | 65 | | 65 | 31 | | | 96 | | 157 | 10A00 | 10A5F | 68096 | 68191 | | 15 | Cuneiform | 879 | | 879 | 145 | | | 1,024 | | 159 | 12000 | 123FF | 73728 | 74751 | | 16 | Cuneiform Numbers and Punctuation | 103 | | 103 | 25 | | | 128 | | 160 | 12400 | 1247F | 74752 | 74879 | | 17 | Ancient Script Unallocated | | | | | | | | 51,152 | (sequences 7, 58, 64, 131) | | 17.1 | | | | | | | | | 368 | 142 | 10190 | 102FF | 65936 | 66303 | | 17.2 | | | | | | | | | 48 | 145 | 10350 | 1037F | 66384 | 66431 | | 17.3 | | | | | | | | | 32 | 148 | 103E0 | 103FF | 66528 | 66559 | | 17.4 | | | | | | | | | 848 | 152 | 104B0 | 107FF | 66736 | 67583 | | 17.5 | | | | | | | | | 192 | 154 | 10840 | 108FF | 67648 | 67839 | | 17.6 | | | | | | | | | 224 | 156 | 10920 | 109FF | 67872 | 68095 | | 17.7 | | | | | | | | | 5,536 | 158 | 10A60 | 11FFF | 68192 | 73727 | | 17.8 | | | | | | | | | 128 | 161 | 12480 | 124FF | 74880 | 75007 | | 17.9 | | | | | | | | | 43,776 | 162 | 12500 | 1CFFF | 75008 | 118783 | Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
Phonetics | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | C | Phonetic | 277 | 118 | 395 | 5 | | | 400 | 0 | (sequences 5-6, 56-57, 114) | | 1 | IPA Extensions | 96 | | 96 | 0 | | | 96 | | 5 | 0250 | 02AF | 592 | 687 | | 2 | Phonetic Extensions | 67 | 61 | 128 | 0 | | | 128 | | 56 | 1D00 | 1D7F | 7424 | 7551 | | 3 | Phonetic Extensions Supplement | 27 | 37 | 64 | 0 | | | 64 | | 57 | 1D80 | 1DBF | 7552 | 7615 | | 4 | Spacing Modifier Letters | 60 | 20 | 80 | 0 | | | 80 | | 6 | 02B0 | 02FF | 688 | 767 | | 5 | Modifier Tone Letters | 27 | | 27 | 5 | | | 32 | | 114 | A700 | A71F | 42752 | 42783 | Unified Diacritics | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | D | Unified Diacritics | 156 | 4 | 160 | 79 | | 1 | 240 | 0 | (sequences 7, 58, 64, 131) | | 1 | Combining Diacritical Marks | 107 | 4 | 111 | 0 | | 1 | 112 | | 7 | 0300 | 036F | 768 | 879 | | 2 | Combining Diacritical Marks Supplement | 13 | | 13 | 51 | | | 64 | | 58 | 1DC0 | 1DFF | 7616 | 7679 | | 3 | Combining Diacritical Marks for Symbols | 32 | | 32 | 16 | | | 48 | | 64 | 20D0 | 20FF | 8400 | 8447 | | 4 | Combining Half Marks | 4 | | 4 | 12 | | | 16 | | 131 | FE20 | FE2F | 65056 | 65071 | Unified Punctuation | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | E | Unified Punctuation | 88 | 25 | 113 | 108 | | 19 | 240 | 0 | (sequences 61 & 92) | | 1 | General Punctuation | 62 | 25 | 87 | 6 | | 19 | 112 | | 61 | 2000 | 206F | 8192 | 8303 | | 2 | Supplemental Punctuation | 26 | | 26 | 102 | | | 128 | | 92 | 2E00 | 2E7F | 11776 | 11903 | Unified Symbols | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | F | Unified Symbols | 2,528 | 90 | 2,618 | 241 | 18 | 55 | 2,896 | 10,414 | (sequences 63, 65-71, 73-84, 91, 95, 97, 109, 167-169, 171) | | 1 | Arrows Blocks | 268 | 6 | 274 | 0 | 18 | 0 | 256 | 0 | (sequences 67, 79, 81, 84) | | 1.1 | Arrows | 106 | 6 | 112 | 0 | | | 112 | | 67 | 2190 | 21FF | 8592 | 8703 | | 1.2 | Supplemental Arrows-A | 16 | | 16 | 0 | | | 16 | | 79 | 27F0 | 27FF | 10224 | 10239 | | 1.3 | Supplemental Arrows-B | 128 | | 128 | 0 | | | 128 | | 81 | 2900 | 297F | 10496 | 10623 | | 1.4 | Miscellaneous Symbols and Arrows | 18 | | 18 | 0 | 18 | | 0 | | 84 | 2B00-2B11 | 11008-11025 | | 2 | Braille Patterns | 256 | | 256 | 0 | | | 256 | | 80 | 2800 | 28FF | 10240 | 10495 | | 3 | Control Pictures | 39 | | 39 | 25 | | | 64 | | 70 | 2400 | 243F | 9216 | 9279 | | 4 | Counting Rod Numerals | 18 | | 18 | 14 | | | 32 | | 168 | 1D360 | 1D37F | 119648 | 119679 | | 5 | Currency Symbols | 22 | | 22 | 26 | | | 48 | | 63 | 20A0 | 20CF | 8352 | 8399 | | 6 | Geometrical Symbols | 256 | 0 | 256 | 0 | | | 256 | 0 | (sequences 73-75) | | 6.1 | Geometric Shapes | 96 | | 96 | 0 | | | 96 | | 75 | 25A0 | 25FF | 9632 | 9727 | | 6.2 | Box Drawing | 128 | | 128 | 0 | | | 128 | | 73 | 2500 | 257F | 9472 | 9599 | | 6.3 | Block Elements | 32 | | 32 | 0 | | | 32 | | 74 | 2580 | 259F | 9600 | 9631 | | 7 | Letterlike Symbols | 38 | 4 | 42 | 1 | | 37 | 80 | | 65 | 2100 | 214F | 8448 | 8527 | | 8 | Math | 632 | 47 | 679 | 9 | | | 688 | 0 | (sequences 68, 78, 82-83) | | 8.1 | Mathematical Operators | 214 | 42 | 256 | 0 | | | 256 | | 68 | 2200 | 22FF | 8704 | 8959 | | 8.2 | Supplemental Mathematical Operators | 251 | 5 | 256 | 0 | | | 256 | | 83 | 2A00 | 2AFF | 10752 | 11007 | | 8.3 | Miscellaneous Mathematical Symbols-A | 39 | | 39 | 9 | | | 48 | | 78 | 27C0 | 27EF | 10176 | 10223 | | 8.4 | Miscellaneous Mathematical Symbols-B | 128 | | 128 | 0 | | | 128 | | 82 | 2980 | 29FF | 10624 | 10751 | | 9 | Miscellaneous Symbols | 818 | 2 | 820 | 122 | | 18 | 960 | 0 | (sequences 69, 76-77, 84) | | 9.1 | Miscellaneous Symbols and Arrows | 238 | | 238 | 0 | | 18 | 256 | | 84 | 2B00 | 2BFF | 11008 | 11263 | | 9.2 | Miscellaneous Symbols | 176 | | 176 | 80 | | | 256 | | 76 | 2600 | 26FF | 9728 | 9983 | | 9.3 | Miscellaneous Technical | 230 | 2 | 232 | 24 | | | 256 | | 69 | 2300 | 23FF | 8960 | 9215 | | 9.4 | Dingbats | 174 | | 174 | 18 | | | 192 | | 77 | 2700 | 27BF | 9984 | 10175 | | 10 | Number Forms | 19 | 31 | 50 | 14 | | | 64 | | 66 | 2150 | 218F | 8528 | 8591 | | 11 | Optical Character Recognition | 11 | | 11 | 21 | | | 32 | | 71 | 2440 | 245F | 9280 | 9311 | | 12 | Tai Xuan Jing Symbols | 87 | | 87 | 9 | | | 96 | | 167 | 1D300 | 1D35F | 119552 | 119647 | | 13 | Yijing Hexagram Symbols | 64 | | 64 | 0 | | | 64 | | 109 | 4DC0 | 4DFF | 19904 | 19967 | | 14 | Unallocated Symbol Blocks | | | | | | | | 48 | (sequences 91 & 95) | | 14.1 | | | | | | | | | 32 | 91 | 2DE0 | 2DFF | 11744 | 11775 | | 14.2 | | | | | | | | | 16 | 95 | 2FE0 | 2FEF | 12256 | 12271 | | 15 | Unallocated Symbol Blocks | | | | | | | | 10,366 | (sequences 169 & 171) | | 15.1 | | | | | | | | | 128 | 169 | 1D380 | 1D3FF | 119680 | 119807 | | 15.2 | | | | | | | | | 10,238 | 171 | 1D800 | 1FFFD | 120832 | 131069 | Music Notation | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | G | Music Notation and Symbols | 522 | 13 | 535 | 57 | | | 592 | 176 | (sequences 163-166) | | 1 | Byzantine Musical Symbols | 246 | | 246 | 10 | | | 256 | | 163 | 1D000 | 1D0FF | 118784 | 119039 | | 2 | Musical Symbols | 206 | 13 | 219 | 37 | | | 256 | | 164 | 1D100 | 1D1FF | 119040 | 119295 | | 3 | Ancient Greek Musical Notation | 70 | | 70 | 10 | | | 80 | | 165 | 1D200 | 1D24F | 119296 | 119375 | | 4 | Unallocated Musical Blocks | | | | | | | | 176 | (sequence 166) | | 4.1 | | | | | | | | | 176 | 166 | 1D250 | 1D2FF | 119376 | 119551 | Unihan CJKV Blocks | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | H | Unified CJK Blocks | 70,426 | 7 | 70,433 | 127 | | | 70,560 | 22,320 | (sequences 93, 96, 104, 108, 110, 113-114, 117, 119, 120-121, 172-173) | | 1 | Unified CJK Support Blocks | 200 | 7 | 207 | 49 | | | 256 | 0 | (sequences 93, 96-97, 104) | | 1.1 | CJK Radicals Supplement | 113 | 2 | 115 | 13 | | | 128 | | 93 | 2E80 | 2EFF | 11904 | 12031 | | 1.2 | Ideographic Description Characters | 12 | | 12 | 4 | | | 16 | | 96 | 2FF0 | 2FFF | 12272 | 12287 | | 1.3 | CJK Strokes | 16 | | 16 | 32 | | | 48 | | 104 | 31C0 | 31EF | 12736 | 12783 | | 1.4 | CJK Symbols and Punctuation | 59 | 5 | 64 | 0 | | | 64 | | 97 | 3000 | 303F | 12288 | 12351 | | 2 | Unified Han Ideographs | 70,226 | | 70,226 | 78 | | | 70,304 | 0 | (sequences 108, 110, 172) | | 2.1 | CJK Unified Ideographs Extension A | 6,582 | | 6,582 | 10 | | | 6,592 | | 108 | 3400 | 4DBF | 13312 | 19903 | | 2.2 | CJK Unified Ideographs | 20,924 | | 20,924 | 68 | | | 20,992 | | 110 | 4E00 | 9FFF | 19968 | 40959 | | 2.3 | CJK Unified Ideographs Extension B | 42,720 | | 42,720 | | | | 42,720 | | 172 | 20000 | 2A6DF | 131072 | 173791 | | 3 | Unallocated Unihan | | | | | | | | 22,320 | (sequences 113, 117, 119, 121, 173) | | 3.1 | | | | | | | | | 20,768 | 173 | 2A6E0 | 2F7FF | 173792 | 194559 | | 3.2 | | | | | | | | | 560 | 113 | A4D0 | A6FF | 42192 | 42751 | | 3.3 | | | | | | | | | 16 | 117 | A830 | A83F | 43056 | 43071 | | 3.4 | | | | | | | | | 896 | 119 | A880 | ABFF | 43136 | 44031 | | 3.5 | | | | | | | | | 80 | 121 | D7B0 | D7FF | 55216 | 55295 | Legacy Compatibility Blocks | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | I | Legacy Compatibility Blocks | 41 | 3,054 | 3,095 | 232 | 0 | 1 | 3,328 | 1,502 | (sequences 62, 72, 94, 101, 106-107, 126-128, 130, 132-135, 170, 174) | | 1 | Enclosed Alphanumerics | 21 | 139 | 160 | 0 | | | 160 | | 72 | 2460 | 24FF | 9312 | 9471 | | 2 | Superscripts and Subscripts | 0 | 34 | 34 | 14 | | | 48 | | 62 | 2070 | 209F | 8304 | 8351 | | 3 | Alphabetic Presentation Forms | 1 | 57 | 58 | 22 | | | 80 | | 127 | FB00 | FB4F | 64256 | 64335 | | 4 | Arabic Compatibility | 4 | 731 | 735 | 96 | | 1 | 832 | | | | | | | | 4.1 | Arabic Presentation Forms-A | 3 | 592 | 595 | 93 | | | 688 | | 128 | FB50 | FDFF | 64336 | 65023 | | 4.2 | Arabic Presentation Forms-B | 1 | 139 | 140 | 3 | | 1 | 144 | | 134 | FE70 | FEFF | 65136 | 65279 | | 5 | CJK and Ideograph Compatibility | 15 | 2,093 | 2,108 | 100 | | | 2,208 | | | | | | | | 5.1 | KangXi Radicals | 0 | 214 | 214 | 10 | | | 224 | | 94 | 2F00 | 2FDF | 12032 | 12255 | | 5.2 | Hangul Compatibility Jamo | 0 | 94 | 94 | 2 | | | 96 | | 101 | 3130 | 318F | 12592 | 12687 | | 5.3 | CJK Compatibility | 0 | 256 | 256 | 0 | | | 256 | | 107 | 3300 | 33FF | 13056 | 13311 | | 5.4 | CJK Compatibility Ideographs | 12 | 455 | 467 | 45 | | | 512 | | 126 | F900 | FAFF | 63744 | 64255 | | 5.5 | Vertical Forms | 0 | 10 | 10 | 6 | | | 16 | | 130 | FE10 | FE1F | 65040 | 65055 | | 5.6 | CJK Compatibility Forms | 2 | 30 | 32 | 0 | | | 32 | | 132 | FE30 | FE4F | 65072 | 65103 | | 5.7 | Small Form Variants | 0 | 26 | 26 | 6 | | | 32 | | 133 | FE50 | FE6F | 65104 | 65135 | | 5.8 | Halfwidth and Fullwidth Forms | 0 | 225 | 225 | 15 | | | 240 | | 135 | FF00 | FFEF | 65280 | 65519 | | 5.9 | CJK Compatibility Ideographs Supplement | 0 | 542 | 542 | 2 | | | 544 | | 174 | 2F800 | 2FA1F | 194560 | 195103 | | 5.10 | Enclosed CJK Letters and Months | 1 | 241 | 242 | 14 | | | 256 | | 106 | 3200 | 32FF | 12800 | 13055 | | 6 | Unallocated Compatibility Blocks | | | | | | | | 1,502 | | | | | | | 6.1 | | | | | | | | | 1,502 | 175 | 2FA20 | 2FFFD | 195104 | 196605 | Other Compatibility Blocks | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | J | Other Compatibility Blocks | 0 | 1,033 | 1,033 | 28 | 37 | 0 | 1,024 | 0 | (sequences 65 & 170) | | 1 | Letterlike Symbols | 0 | 37 | 37 | 0 | 37 | | 0 | | 65 | 2100 | 214F | 8448 | 8527 | | 2 | Mathematical Alphanumeric Symbols | 0 | 996 | 996 | 28 | | | 1,024 | | 170 | 1D400 | 1D7FF | 119808 | 120831 | Special-purpose characters | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | K | Control Characters | 435 | 6 | 441 | | 86 | 0 | 396 | 65,166 | (sequences 1-2, 7, 61, 134, 136, 129, 177, 178) | | 1 | ASCII/8099-1 Controls | 65 | 0 | 65 | 0 | 65 | 0 | 0 | 0 | (sequences 1 & 2) | | 1.1 | C0 (Latin, Basic) | 33 | | 33 | | 33 | | 0 | | 1 | 0001 | 001F & 007F | 1 | 31 & 79 | | 1.2 | C1 (Latin-1 Supplement) | 32 | | 32 | | 32 | | 0 | | 2 | 0080 | 009F | 128 | 159 | | 2 | Byte Order Mark | 1 | | 1 | | 1 | | 0 | | 134 | FEFF | | 65279 | | | 3 | Combining Grapheme Joiner | 1 | | 1 | | 1 | | 0 | | 7 | 034F | | 847 | | | 4 | General Punctuation | 13 | 6 | 19 | 0 | 19 | 0 | 0 | 0 | (sequence 61) | | 4.1 | Bidi Characters | 7 | 0 | 7 | | 7 | | 0 | | 61 | 200E-200F, 202A-202E | | 4.2 | Other Formatting | 6 | 0 | 6 | 0 | 6 | | 0 | | 61 | 2000, 200D, 2028-2029, 2060, 2063 | | 4.3 | Deprecated | 0 | 6 | 6 | | 6 | | 0 | | 61 | 206A-206F | 8298-8303 | | 5 | Specials | 5 | | 5 | 7 | | | 12 | | 136 | FFF0 | FFFD | 65520 | 65533 | | 6 | Tags | 95 | | 95 | 33 | | | 128 | | 177 | E0000 | E007F | 917504 | 917631 | | 7 | Variation Selectors | 256 | | 256 | | | | 256 | | | | | | | | 7.1 | Variation Selectors | 16 | | 16 | | | | 16 | | 129 | FE00 | FE0F | 65024 | 65039 | | 7.2 | Variation Selectors Supplement | 240 | | 240 | | | | 240 | | 178 | E0100 | E01EF | 917760 | 917999 | | 8 | Unallocated Special-Purpose | 0 | | 0 | | | | 0 | 65,166 | (sequences 179-181) | | 8.1 | | | | | | | | | 128 | 179 | E0080 | E00FF | 917632 | 917759 | | 8.2 | | | | | | | | | 16 | 180 | E01F0 | E01FF | 918000 | 918015 | | 8.3 | | | | | | | | | 65,022 | 181 | E0200 | EFFFD | 918016 | 983037 | Unicodeâs Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 Ã 216, hexadecimal 110000) code points. ...
Surrogates | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | L | Surrogates | 0 | 0 | 0 | 0 | 0 | 0 | 2,048 | 0 | (sequences 122-124) | | 1 | High Private Use Surrogates | | | | | | | 128 | | 123 | DB80 | DBFF | 56192 | 56319 | | 2 | High Surrogates | | | | | | | 896 | | 122 | D800 | DB7F | 55296 | 56191 | | 3 | Low Surrogates | | | | | | | 1,024 | | 124 | DC00 | DFFF | 56320 | 57343 | Private use characters | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | M | Private Use Areas | 0 | 0 | 0 | 0 | 0 | 0 | 137,468 | 0 | (sequences 125, 182-183) | | 1 | Private Use Area | | | | | | | 6,400 | | 125 | E000 | F8FF | 57344 | 63743 | | 3 | Supplementary Private Use Area-A | | | | | | | 65,534 | | 182 | F0000 | FFFFD | 983040 | 1048573 | | 3 | Supplementary Private Use Area-B | | | | | | | 65,534 | | 183 | 100000 | 10FFFD | 1048576 | 1114109 | Unused Planes | Script-Block Name | Core | Compat | Provd. | Resrvd | Incl. | Excl. | Allocd | Unallocd | Seq | Hex Start | Hex End | Dec Start | Dec End | | N | Planes 3 through 13 | 0 | 0 | 0 | 0 | | 0 | 0 | 720,874 | 176 | | | | | See also The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ...
The Unicode Standard, Version 5. ...
Tables | Unicode mapping tables | | BMP | SMP | SIP | SSP | | 0000—0FFF | 8000—8FFF | 10000—10FFF | 20000—20FFF | 28000—28FFF | E0000—E0FFF | | 1000—1FFF | 9000—9FFF | | 21000—21FFF | 29000—29FFF | | 2000—2FFF | A000—AFFF | 12000—12FFF | 22000—22FFF | 2A000—2AFFF | | 3000—3FFF | B000—BFFF | | 23000—23FFF | | | 4000—4FFF | C000—CFFF | 1D000—1DFFF | 24000—24FFF | 2F000—2FFFF | | 5000—5FFF | D000—DFFF | | 25000—25FFF | | | 6000—6FFF | E000—EFFF | | 26000—26FFF | | | 7000—7FFF | F000—FFFF | | 27000—27FFF | External links The ConScript Unicode Registry is a volunteer project to coordinate the assignment of code points in the Unicode Private Use Area for the encoding of artificial scripts. ...
Notes References |