Lecture 3 1 ISO/IEC 10646 and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.

Lecture 3 1 ISO/IEC 10646 and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters in almost all national standards –Framework: Fix the coding architectures, and code- points can be filled up later. –Uniform and Efficient: fixed-width encoding, no need to identify the coding length(ASCII, Big5, GB) –Unambiguous: Any given 16-bit(32-bit) value always represents the same character

Lecture 3 2 UCS-4 (Canonical form of ISO 10646) Fixed 32-bit(actually 31 bits) coding assignment 00 00 00 00 to 7F FF FF FF Each plane: 2 16 = 65,536 code points BMP(the basic multilingual plane) –Both Group No. and Plan No. are 00(first two bytes of zeros) Before ISO 10646 part 2 came out(end of year 2001), only BMP contains characters Group No. (total: 128) Plane No (total: 256) High Byte (total: 256) Low Byte (total: 256)

Lecture 3 3 Code Architecture of UCS-4 Plane 00 BMP Planes 256/Group Group 0 Group 1 Group 127 Groups

Lecture 3 4 UCS-2: 2-byte representation of UCS-4 –Basic Multilingual Plan(BMP) –Switching mechanism to use code range of BMP to access another 16 planes (Surrogate pairs) BMP Compatibility Zone: A-Zone Alphabets, Symbols, CJK Misc I-Zone CJK ideographs O-zone Hangul S-Zone(Surrogate) R-Zone Private Use, Compatibility, Arabic Presentations

Lecture 3 5 Unicode Unicode is the implementation of ISO 10646 with 16 bit representation using UCS-2 Has definition of actions associated with certain characters control character behavior Rendering behavior: combining characters Examples –Control character bell should cause a sound in the system –Type the character using U+0061(a)U+0300( ̀ ) will be rendered as one symbol à

Lecture 3 6 Extension of ISO 10646 –Extension A(BMP) has 6,582 characters, published in 2000, ISO/IEC 10646-1 Second Edition(2000). –Extension B: All characters in 康熙字典，漢語大字典, plus other characters such as those in HK Supplementary Character Set, ISO/IEC 10646-2(2001), total of 43,253 characters In Plane 2 of UCS-4 –How would Extension B be supported in UCS- 2? => Using some encoding scheme

Lecture 3 7 Surrogate Pairs 2 UCS-2 code H followed by L where –H is in the range of D800 - DBFF –L is in the range of DC00 - DFFF For a given UCS-2 code(or code pair) U, the corresponding UCS-4 code-point value N (scalar value) –N= U if U is a single, non-surrogate value –N=(H-D800 16 )*400 16 + (L-DC00 16 ) + 10000 16 where U is a surrogate pair –Undefined for any other U in UCS-2. N: in the range of 0 to 10FFFF 16 => N = 10000 16 => ?

Lecture 3 8 UTF: UCS Transformation Format –Allows a certain number of code values in UCS which correspond to some other coding standard (e.g. ASCII) be transmitted just as what they would be in that coding standard, a property known as transparency-while other code values are represented through escape mechanism –variable length encoding to achieve greater efficiency

Lecture 3 9 UTF-8: 8-bit encoding for 8-but UNIX Environment –ASCII transparent –First-byte indicates the number of characters –Shortest encoding principle for invertible (or bijective) encoding/decoding –Save storage space for ASCII, non-ideographic characters –Example: Unicode A324 0430 0023 8A43 => UTF-8: –Example: UTF-8 24 38 58 CE 82 => UCS-4:

Lecture 3 10 Character vs. glyph Character: smallest component of written language that have semantic value Glyphs: represent the shapes that characters can have when they are rendered or displayed. A, A,Example: A, A, are the same character and having the same code. Concrete shape can be very different and are given one codepoint. Coding of variants

Lecture 3 11 ISO 10646/Unicode Features for Chinese –Han Unification (Chinese, Japanese and Korean) –Unification Problems: Different sources, non-cognate –Three-dimensional Conceptual Model: semantics(x), abstract shape(y), actual shape(z) Examples

Lecture 3 12 Unification Rules( 認同規則 ) R1: Source Separation Rule: If two ideographs are distinct in a primary source standard, then they are not unified.Why R2: Non-cognate( 非同源 )Rule: In general, if two ideographs are unrelated in historical derivation(non-cognate characters), then they are not unified R3: By means of two-level classification, the abstract shape of each ideograph is determined. Any two ideographs that possess the same abstract shape are unified unless disallowed by R1 or R2.

Lecture 3 13 Example: Component structure analysis

Lecture 3 14 Sources of Unified Han Characters

Lecture 3 15 Wide character vs. Multi-byte characters Text information needs to be represented by the right data types. –Multi byte characters: data are processed on a per-byte basis: Big5, GB, EUC, even UTF-8 –Wide characters: Fixed-byte encoding and no testing of high bit needed. Processing representation for wide characters: –Big Endian vs. Little Endian Data type dependent System architecture dependent Distinction: 0xFEFF for Big Endian and 0xFFFE for Little Endian

Lecture 3 1 ISO/IEC 10646 and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.

Similar presentations

Presentation on theme: "Lecture 3 1 ISO/IEC 10646 and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 3 1 ISO/IEC 10646 and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.

Similar presentations

Presentation on theme: "Lecture 3 1 ISO/IEC 10646 and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters."— Presentation transcript:

Similar presentations

About project

Feedback