Character Sets

Introduction

Character sets are standards established for two main purposes: education and computing. Educational character sets are not the focus here, but it is important to note that in both China and Taiwan, national educational standards have been incorporated into the character sets used for computing. An excellent place to begin learning about Chinese, Japanese, and Korean (CJK) character set standards is the introductory chapter in Ken Lunde's book CJKV Information Processing: [CJKVInfoProc.Chap1.pdf]

Encodings map character sets to hexadecimal integers. Hexadecimals are base-16 numbers, written using 0-9 (for 0-9) and A-F (for 10-15) as digits. The encoding orders the characters in the set and assigns a value to each, known as a code point.

Character encoding forms or transformation formats map the encoding's code points to units of data that your computer can understand. This data is conceived as sequences of binary digits (known as "bits") with a value of either 0 or 1. Until recently, most systems used 8-bit sequences (known as "bytes" or "octets") to process text, for which there are 256 possible sequences. These are represented by two-digit hexadecimal code points (00-FF, a total of 256 values). Obviously, this is not sufficient for the many thousands of Chinese characters, known as hanzi. "Double-byte" character encoding forms use two bytes for each character, represented as four-digit hexadecimal code points (0000-FFFF, a total of 65,636 values).

Most recent systems use 16-bit sequences to process text. Mac OS X, for example, uses the Unicode 16-bit character encoding form, UTF-16. Nonetheless, double-byte encodings are a legacy of 8-bit data processing that will be with us for years to come, especially on the Internet. HTML and MIME, for example, are 8-bit protocols.

Speaking of protocols, most of the encodings discussed here have official charset names registered with the Internet Assigned Numbers Authority (IANA). The names are used to identify the encoding used in web pages and emails. No distinction is made between the use of upper and lower case letters. For example, here is a typical web-page header meta command that sets the encoding to Big Five:

<meta http-equiv="content-type" content="text/html; charset=big5">

For plain-text email, the encoding is set in the "content-type" header, as follows:

content-type: text/plain; format=flowed; charset=big5

Here is a list of the preferred charset names for the most common encodings used for Chinese text:

  • BIG5
  • BIG5-HKSCS
  • GB2312
  • GBK
  • GB18030
  • EUC-CN
  • HZ-GB-2312
  • ISO-2022-CN
  • UTF-8
  • UTF-16

Chinese Standards

CNS 11643

CNS = Chinese National Standard

CNS 11643 is the official Taiwan national standard. The first two planes were adopted in 1986 as a corrected and reorganized version of Big Five. In practice, however, Big Five has remained the de facto national standard in Taiwan. CNS 11643 has been implemented for the Unix platform in EUC-TW.

In 1992, CNS 11643 was extended to seven planes and a total of 48,027 hanzi. The official specification is available here.

http://www.cns11643.gov.tw/

Big Five

Traditional-Chinese only. Big Five (1984, "Big-5") gets its name from the consortium of five companies in Taiwan that developed it. Contains 13,051 distinct hanzi, arranged in two levels by total number of strokes then radical. The most common extension to Big Five is ETen, which includes additional punctuation and numerals, 25 radicals and radical-like elements, a full set of Japanese kana, and more. Most Big Five fonts contain this extension, including those distributed by Apple.

Big Five has an unofficial analog character set, developed by font vendors, in which simplified forms replace its traditional forms. This is known as GB Five, usually written as "GB5." Big Five fonts sometimes come in pairs, one with the standard Big-5 character set [a.k.a. "Big-5繁體"] and the other with the GB-5 character set [a.k.a. "Big-5簡體"].

Charset name: BIG5.

Microsoft code page 950 is based on Big Five: http://www.microsoft.com/globaldev/reference/dbcs/950.htm

Big Five Plus

Big Five Plus (1997, "Big-5+") is an extension to Big Five that includes all 20,914 hanzi in the CJK Unified Ideographs block of Unicode. Big5+ has not been widely implemented, largely because it was forced to define code points outside of the original code space reserved by Big Five.

Big Five Extension

Traditional-Chinese only. Big Five Extension (1998, "Big-5E") adds a select group of 3,954 hanzi to Big Five. They appear in three blocks of code points: 8E40-A0FE, 8140-86DF, 86E0-875C, all of which are inside the original code space reserved by Big Five.

The Traditional Chinese Input Method in Mac OS X 10.3 and above supports Big5E in the fonts LiHei 儷黑 Pro and LiSong 儷宋 Pro.

http://www.cmex.org.tw/info.htm#big5e

Hong Kong SCS

In 1995, the government of Hong Kong created its own extension to Big Five, calling it the Government Common Character Set (GCCS). In 1999, they revised it and renamed it the Hong Kong Supplementary Character Set (HKSCS or Hong Kong SCS). It was updated in 2001, 2004, and 2008, for a current total of 4,568 traditional-form hanzi.

Unicode 4.1 (2005) and HKSCS-2004 are fully coordinated, and Unicode 5.2 (2009) and HKSCS-2008 are fully coordinated. Thus, all HKSCS characters map to Unicode characters. HKSCS-2008 is the last version that will be published with Big Five code points.

The Traditional Chinese Input Method in Mac OS X 10.3 and above supports HKSCS-2001 in the fonts LiHei 儷黑 Pro and LiSong 儷宋 Pro.

Charset name: BIG5-HKSCS.

http://www.ogcio.gov.hk/ccli/eng/hkscs/

GB 2312

GB = Guójiā Biāozhǔn 国家标准, "National Standard"

Simplified-Chinese only. GB 2312 (1980) includes 6,763 hanzi on two levels (the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese kana, the Greek and Cyrillic alphabets, Zhuyin, and two sets of Pinyin letters with tone marks (full-width and half-width), some of which were added with the first extension to GB, GB 6345 (1986), which also contained two corrections. Most "GB 2312" fonts contain this extension, including those distributed by Apple. There were two later extensions, in 1988 and 1992, that were not as widely adopted. These were all incorporated into GBK in 1995. GB 2312 and all of its extensions were replaced by GB 18030 in 2000.

GB 2312 has an official analog character set in which traditional forms replace its simplified forms, known as GB/T 12345 (1990). GB 2312 fonts sometimes come in pairs, one with the GB 2312 character set [a.k.a. "GB-2312简体/簡體"] and the other with the GB/T 12345 character set [a.k.a. "GB-2312繁体/繁體"].

Charset name: GB2312. In Windows, the charset name GB2312 includes all of its extensions, including GBK.

A PDF chart of GB 2312 is available at ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/AppE/

A PDF chart of GB/T 12345 is available at ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/AppF/

GBK

GBK = Guójiā Biāozhǔn Kuòzhǎn 国家标准扩展, "GB Extension"

GBK (1995) is an extension to GB 2312 that includes all 20,914 hanzi in the CJK Unified Ideographs block of Unicode, plus 101 additional hanzi.

In Windows 95 and later, the scope of the charset name GB2312 includes GBK. This makes sense, as GBK is an extension of GB 2312. The charset name GBK was not recognized until 2002. See: http://www.iana.org/assignments/charset-reg/GBK.

Microsoft code page 936 is based on GBK: http://www.microsoft.com/globaldev/reference/dbcs/936.htm

Note: GBK is not the same as GB13000.1 (1993). GB 13000.1 was entirely compatible with Unicode 1.1. It was not an extension of GB 2312, and did not define GB-compatible code points.

GB 18030

GB 18030 (2000, revised 2005) is the current Chinese national standard coded character set. It replaces GB 2312 and its major extension, GBK. All characters in GB 2312 and GBK are at the same code points in GB 18030. As of Unicode 4.1 (2005), all GB 18030-2000 characters map to Unicode characters, with six in Extension B.

GB 18030-2005 includes seven additional groups of characters: the remainder of Extension B, plus six regional scripts: Korean, Mongolian, Tai Le (Yunnan), Tibetan, Uighur, and Yi (Sichuan). None of these 2005 groups is currently required for GB 18030 compliance.

Charset name: GB18030.

http://www.iana.org/assignments/charset-reg/GB18030

EUC

Extended Unix Code (EUC) is the internal code processed by Unix software configured for a specific locale:

  • EUC-TW (Taiwan) encodes the CNS 11643 character set. Charset name: EUC-TW.
  • EUC-CN (China) is identical to GB 2312. Charset name: EUC-CN.

7-bit Encodings

7-bit encodings are "mail-safe" transformation formats used primarily for internet services like TELNET and USENET:

  • HZ (1989) encodes GB 2312. See RFC 1843. Charset name: HZ-GB-2312.
  • ISO 2022-CN (1996) encodes GB 2312 and CNS 11643 Planes 1 and 2 (the Big Five character set). See RFC 1922. Charset name: ISO-2022-CN.

Unicode

"Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. ... These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption. ... Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language."

http://www.unicode.org/

ISO 10646 is fully coordinated with the Unicode Standard. Thus, ISO 10646 1:2000 has exactly the same character set and encoding as version 3.0 of the Unicode Standard, and so on. ISO 10646 is a character set. Unicode is an encoding.

The Unicode Standard defines three character encoding forms that allow the same data to be handled in 8, 16, or 32 bits per code unit, called UTF-8, UTF-16, and UTF-32.

  • UTF-8 (charset name: UTF-8) is designed for use with 8-bit protocols like HTML. It uses one to four bytes per character. Thus, UTF-8 code points can have two, four, six, or eight hexadecimal digits. Mac OS 8 and above provide support for UTF-8.
  • UTF-16 (charset name: UTF-16) uses one to two 16-bit sequences per character. Thus, UTF-16 code points have either four or eight hexadecimal digits. Mac OS 9 and above provide support for UTF-16.

"U+" is the standard notation for a Unicode scalar value, a hexadecimal number defined for use by standards such as SGML, XML, and HTML. Unicode's Basic Multilingual Plane (BMP) has room for 65,536 characters (U+0-FFFF). The Unicode scalar values for characters in the BMP are the identical to their UTF-16 code points. As of Unicode 6.0, there are a total of 27,534 distinct Chinese, Japanese, and Korean (CJK) characters in the BMP, in two main blocks:

Unicode also provides space for over a million more characters in 16 additional "planes" (U+10000-10FFFF). As of Unicode 6.0, the Supplementary Ideographic Plane (SIP) contains 47,082 additional characters in three blocks:

For specific information about individual hanzi, see the Unihan Database, introduced in John Jenkins and Richard Cook's A User's Guide to the Unihan Database (Unicode Technical Report #38). The home page for the database is:

It contains extensive information on each CJK Unified Ideograph, with locations in standard print dictionaries and word lists based on CEDICT and EDICT (Japanese). Navigation tools include:

The Unihan.txt file contains all of the information in the database. The latest version of the file is available at http://www.unicode.org/Public/UNIDATA/

Richard Cook provides an online interface for this file:

More about Unicode

The ISO working group charged with the task of processing CJK characters proposed for inclusion in Unicode is called the Ideographic Rapporteur Group (IRG). They have a web site at: http://www.cse.cuhk.edu.hk/~irg/

The "unification" of the Han script in Unicode was not without controversy and confusion. Ken Whistler's On the Encoding of Latin, Greek, Cyrillic, and Han (Unicode Technical Note #26) provides an excellent review of the salient issues.

Andrew West's blog BabelStone is focused on Chinese and other scripts, like 'Phags-pa, Tibetan, Mongolian, Manchu, Khitan, Jurchen, and Tangut: http://babelstone.blogspot.com/

There is an active Unicode email discussion group:

Selected recent ISO documents related to Chinese and Unicode:

  • Summary & Resolutions from IRG Meeting 36: N4021 N4020 (April 2011)
    • Responses: N4075 (May 2011)
  • Proposal to encode Mongolian square script: N4041 (May 2011)
  • Proposal to encode Khitan small script: N3918 (October 2010)
    • Responses: N3925 (September 2010)
  • Proposal to encode Chinese chess symbols: N3910 (September 2010)
  • Request to disunify U+2F89F from U+5FF9: N3787 (March 2010)
  • Draft 5 of IRG Principles and Procedures: N3744 (March 2010)
    • On IRG Working Document Series: N3746 (March 2010)
  • Proposal to encode obsolete simplified Chinese characters: N3695 (October 2009)
  • Proposal to include Jurchen [女真] characters: N3628 (April 2009), N3688 (September 2009)
  • Proposal to include Tangut [西夏] characters: N3297, N3297A, N3297B (May 2007)
  • IRG draft agreement on how to approach the encoding of "Old Hanzi" characters: N2684 (November 2003)
    • Request for comments on fonts for oracle-bone scripts: N4048 (May 2011)
  • Five duplications in CJK Unified Ideographs Extension B: N2644 (October 2003)
  • Proposal for ideographic taboo variation indicator: N2475 (May 2002)
  • Taboo-character replacements encoded in Unicode: N2496 (May 2002)
  • IRG on ideographic variants: N2476 (May 2002)
  • Proposal for addition of monograms, digrams, and tetragrams: N2416 (February 2002)
    • Request for corrections to character names: N2988 (September 2005)
  • Proposal for additional grass radicals: N2326 [Note: April 1, 2001!]