101035 中文信息处理 Chinese NLP Lecture 2.

中文信息处理 Chinese NLP Lecture 2

字——中文编码 Chinese Character Encoding
中文字符集（Character Set）中文编码集（Code Set）基本编码方式中文编码方式国际编码方式

中文字符集 Chinese Character Set
A character set is a collection of characters. {a, b, c, …, z, A, B, C, …, Z, 0, 1, 2, …, 9} is an English character set, {啊, 阿, 唉, …, 作, 坐, 座} is a Chinese character set. Each character set has a name, such as ASCII or KANG XI (康熙) There are more than one Chinese character set, over time and cross regions.

Chinese Character Set GB Big5 ISO 10646-1 and Unicode
They are developed in Mainland China and are based on simplified Chinese characters. GB is short for 国家标准 and means National Standard. Countries such as China and Singapore are using this standard. Big5 Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters. Countries such as Taiwan and Hong Kong are using this standard. ISO and Unicode ISO and Unicode Consortium jointly develop a multilingual character set to combine the majority of the world’s character sets into a large repertoire of characters. Simplified/Traditional Chinese, Korean and Japanese characters can be displayed on the same HTML pages

中文编码集 Chinese Code Set Code Set（编码集）
Code set means “coded character set”. Encoding of a character set is to represent its characters in bytes or bits. The complete set of numerical values is called code space (denoted by CODE). A value in code space is called a code (or a code point). Encoding is a mapping of a (unique) character (in a character set) to a (unique) code (in a code space).

Chinese Code Set A coded character set, denoted by CC, is a set of tuples, CC={(ci, codei) | ciC, codeiCODE}, where codei codej if ci  cj . For example, C={中文计算}, C can be encoded with different code spaces. If CODE1={00, 01, 10, 11}, CC1={(中, 00), (文, 01), (计, 10), (算, 11)}. If CODE2={0000, 0001, 0010, 0011}, CC2={(中, 0000), (文, 0001), (计, 0010), (算, 0011)}. If CODE3 ={1000, 1001, 1001, 1011}, CC3 ={(中, 1000), (文, 1001), (计, 1001), (算, 1011)}.

In-Class Exercise 啊阿唉作坐座
What binary values can be assigned to these 6 characters according to this code space? (Tip: at first, how many bits do you need to encode 6 rows and 6 columns?) Two Dimensional Code Space (66) 啊阿唉作坐座 1 2 3 4 5 6

基本编码方式 Basic Encoding Method
The mapping of a character in a character set to a code point in code space is called code point assignment. An encoding method explains how a character is being mapped into a code point and also how assignments are made to identify a mixture of difference code sets. CC1 CC2 CC3 CC4

ASCII A popular encoding scheme for English characters is called ASCII (American Standard Code for Information Interchange). It defines 128 character code points (from 0x00 to 0x7F), of which the first 32 are control codes (non-printable) from 0x00 to 0x1F and the other 96 are graphic (printable) characters from 0x20 to 0x7F. But actually, only 94 are printable (0x21-0x7E). Values are represented with only 7 bits (the first bit is 0).

ASCII low-bits high-bits 0000 0100 0010 0001 0101 0110 0011 0111 1000
1001 1010 1011 1100 1101 1110 1111

中文编码方式 Chinese Encoding Method
It takes 2 bytes, or 16 bits, to encode all the Chinese characters. However, not all of these 256×256 points are used for representing displayable characters. Generally, 94×94 is considered for a Chinese character encoding matrix.

One Byte English Characters vs. Two Byte Chinese Characters.
High-Bit-On Scheme x English Code range is (<128) or 0x21-0x7E. 1 x Chinese Code range of the first byte is (>128) or 0xA1-0xFE.

AB AC 41 42 43 A4 40 Chinese Characters or English Characters?
0xAB=171>128 0x41=65<128 0x42=66<128 0x43=67<128 0xA4=164>128 AB AC A ABAC is a Chinese Character 41 is a English Character 42 is a English Character 43 is a English Character A440 is a Chinese Character

They are locale-independent encoding methods.
There are two encoding methods that are common to many character sets in China, Taiwan and other Asia countries. ISO-2022 and EUC They are locale-independent encoding methods. However, the exact definitions of them depend greatly on the locale. In other words, there are locale-specific instances of these encodings, e.g. ISO-2022-CN, ISO-2022-CN-EXT, … EUC-CN, EUC-TW, …

Supported Character Sets
Encoding method and character set Supported Character Sets Encoding Method ASCII, GB-Roman, CNS-Roman, GB , CNS … ISO-2022 EUC ASCII ASCII, GB-Roman, CNS-Roman … 00-1E Control character 21-7E Graphic characters 0-31 33-126 20 Space character 7F Delete characters 32 127 ASCII Encoding Range 94 printable characters

ISO-2022 ISO-2022 is a modal encoding, which uses escape sequences or other special characters to switch between different modes (one-byte vs. two-byte). It is used primarily as an information interchange code for moving text between computer systems, such as . It is also often referred to as a seven-bit encoding methods, because all the bytes used to represent characters do not have their eighth-bit enabled.

ISO-2022-CN (-EXT) ISO-2022-CN (-EXT) is a locale-specific implementation of ISO-2022. It is achieved through the use of designators and shifts. Designator specifies the character set associated with a particular shift. There are four shifts, SI, SO, SS2 and SS3. Shift specify how to interpret the subsequent bytes. Each line starts in ASCII, and ends in ASCII. A shifting character, indicated by SO (0x0E) or SI (0x0F) switches between one-byte and two-byte modes. SO (Shift Out) invokes two-byte mode (for GB and CNS Plane 1) for the following bytes until SI (Shift In) is encountered which invokes one-byte mode. There must be a shift back to ASCII (by SI) before the end of the line.

invoked Character Sets
ISO-2022-CN (-EXT) A single shift sequence, indicated by SS2 (0x1B 0x4E) or SS3 (0x1B 0x4F), invokes two-byte mode only for the following two bytes, and is typically employed for rarely-used character sets. A designator (escape) sequence indicates what character set should be invoked when in two-byte mode, e.g., 0x1B 0x24 0x29 0x41 (<ESC> $ ) A in ASCII) indicates GB Shifting Types SO invoked Character Sets SS2 SS3 GB , CNS Plane 1 CNS Plane 2 CNS Planes 3-7

1B 24 29 47 31 30 0E 45 4C 0F 31 38 0E 45 4A 0F Designator and shift
Designate CNS plane 1 ASCII code CNS Plane 1 code ASCII code CNS Plane 1 code 1B E 45 4C 0F E 45 4A 0F one byte mode SO Shift to two byte mode SI Shift to one byte mode SO Shift to two byte mode SI Shift to one byte mode

EUC EUC (Extended Unix Code) encoding is implemented as the internal code for most Unix software configured to support Japanese. Although U represents Unix, this encoding is commonly used on other platforms, such as Windows and Mac OS. The full definition of EUC encoding consists of four code sets. Code set 0 is always set to the ASCII character set or a country’s own version thereof. The remaining code sets are defined as a set of variants from which each country can select.

EUC-CN EUC-CN is a locale-specific implementation of EUC. 1 x
Code set 0 Byte range A1-FE Code set 1 First byte range Second byte range 33-126 94 1 x EUC-CN (GB) Code range of both the first byte and the second byte is (>128) or 0xA1-0xFE.

EUC-TW EUC-TW is by far the most complex implementation of EUC encoding in terms of how many characters it encodes, i.e. close to 50,000 characters.

Locale (Character Set)
ISO-2022 vs EUC EUC encoding is closely related to ISO-2022. In fact, every character that can be encoded by ISO-2022 can be converted to an EUC-encoded equivalent. ISO-2022-CN Locale (Character Set) 3A3A 5756 China (GB ) 汉字 6947 4773 Taiwan (CNS ) 漢 EUC-CN or EUC- TW Set 1 BABA D7D6 E9C7 C7F3

GBK GBK encoding is implemented as the internal code for the Chinese (PRC) version of Microsoft’s Windows and IBM’s OS/2. GBK character set contains 21,886 Symbols and Chinese characters. 21-7E ASCII or GB-Roman Byte range 81-FE 40-7E, 80-FE GBK First byte range Second byte ranges 33-126 64-126,

GBK One of the design principle of GBK is that it should be fully compatible with GB2312 and extend to support Unicode which has 20,902 characters in its first version.

Big5 Big5 encoding range has a lot in common with EUC-TW code sets 0 and 1; the main difference being that there is an additional encoding block. 21-7E ASCII or CNS-Roman Byte range A1-FE 40-7E, A1-FE Big5 First byte range Second byte ranges 33-126 64-126,

Big5 Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters.

国际编码方式 International Encoding Method
Unicode and ISO We need to develop a multilingual character set combining the majority of the world’s writing systems and character sets into a Universal Character Set (UCS) or Unicode. Character Set Encoding Method Unicode and ISO UCS-2, UCS-4 UTF-7, UTF-8, UTF-16

BMP The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters.

UCS-2 A 16-bit representation can end up to 65,536 (=216) unique code points. It allocates the entire encoding space for characters (0x xFFFF). UCS-2 and Unicode encodings are identical for most of Chinese characters. 00-FF UCS-2 First byte range Second byte range 0-255

UCS-4 It is a four byte (actually, 31-bit) representation which can encode up to 2,147,483,648 (=231) code points (0x x7FFFFFFF). It allocates the entire encoding space for characters (0x xFFFF). 00-7F 00-FF UCS-4 First byte range Second byte range Third byte range Fourth byte range 0-127 0-256

all 17 Planes of Unicode Set
UCS-2 vs UCS-4 … 0000FFFF UCS-4 (31-bit) UCS-2 (16-bit) Unicode (17-plane) 17 256256= 1,114,112 characters 0000 … FFFF 2,147,483,648 code points 65,536 code points … 7FFFFFFF Can only encode BMP Plane Sufficient to encode all 17 Planes of Unicode Set

UTF-16 In essence, UTF-16 encodes the BMP according to UCS-2 (16 bits) encoding (compatible). But it also allows the next 16 planes, which are normally only accessible through UCS-4 (32 bits) encoding. The surrogates area is defined with UTF-16 to allow for expansion beyond the 16-bit code space.

all 17 Planes of Unicode Set
UCS-2 vs UCS-4 vs UTF-16 UCS-2 Unicode UCS-4 0000 … FFFF Plane 0 … 0000FFFF 2,147,483,648 code points 65,536 code points Plane 1 Plane 16 … Can only encode BMP Plane … 0010FFFF … 7FFFFFFF U+10000 … U+10FFFF D800 DC00 … DBFF DFFF UTF-16 Surrogates Scalar Value Denoted by U+ Sufficient to encode all 17 Planes of Unicode Set

Base64 64 characters are used, they are the upper-case and lower-case Roman alphabet characters (i.e. A-Z, a-z), the numerals (0-9), and the "+" and "/" symbols.

Base64

Base64 Step 1: Base64 takes every three bytes (each consisting of eight bits), and convert it to four six-bits. Step 2: Each six-bit segment is then converted into a character in the Base64 character set. Step 3: If the size of the original data in bytes is not a multiple of three, we append enough bytes with a value of “0” to create a 3- byte group. The Base64 padding character is “=”. 101001 010001 011011 001001 000000 010000 JbRp QQ==

In-Class Exercise What is the result of applying Base64 to three Hex characters BEAE, CED3 and B7F5 (it is a Japanese name, 小林剑)? (Please first convert the Hex to Bin.)

UTF-7 UTF-7 uses the same set of Base64 character set.
UTF-7 is different from Base64 in that: The “padding” character is not necessary. The Base64-like transformation is applied only to specific characters Those characters that require Base64 transformation according to UTF-7 encoding begin with a “plus” character (+, 0x2B) and end with a “hyphen” (-, 0x2D) character. Character String M y 河豚 UCS-2 Encoding 004D CB3 8C5A UTF-7 Encoding M(4D) y(79) bLOMWg- ASCII Codes

UTF-8 UFT-8 encoding is developed as a way to represent Unicode text as a stream of one or more eight-bits, rather than a true 16-bit units. It converts UCS-2 into a mixed one- through three-byte encoding. It converts UCS-4 into a mixed one- through six-byte encoding. It converts UTF-16 into a mixed one- through four-byte encoding. It is therefore an eight-bit and variable-length encoding. UTF-8 is the de facto standard encoding for interchange of Unicode text.

UTF-8 Encoding Templates
For all but the ASCII-compatible range, the number of first-byte high-order bits set to “1” indicates the byte length. Filling the templates from the rightmost side bits. UCS-2 Range UTF-8 Bit Arrays F 0xxxxxxx FF 110xxxxx 10xxxxxx 0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx – 001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Unicode Range (+) UTF-8 Bit Arrays – 03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx UCS-4 Range (+) UTF-8 Bit Arrays – 7FFF FFFF x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8 Example: Convert a Unicode character “茶” into UTF-8 code.
Step 1” the Unicode value of 茶 (tea) is So we need 3 bytes. Step 2: the binary form of hexadecimal 8336 is Step 3: Fill the empty slots of the three-byte template with the binary value of and get: Step 4: UTF-8 code value is thus E8 8C B6. 1110xxxx 10xxxxxx 10xxxxxx

Wrap-Up 国际编码方式中文字符集中文编码集基本编码方式中文编码方式 ASCII ISO-2022 EUC GBK BIG5
Unicode ISO UCS-2 UCS-4 UTF-16 Base64 UTF-7 UTF-8

101035 中文信息处理 Chinese NLP Lecture 2.

Similar presentations

Presentation on theme: "101035 中文信息处理 Chinese NLP Lecture 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

101035 中文信息处理 Chinese NLP Lecture 2.

Similar presentations

Presentation on theme: "101035 中文信息处理 Chinese NLP Lecture 2."— Presentation transcript:

Similar presentations

About project

Feedback