Presentation is loading. Please wait.

Presentation is loading. Please wait.

Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.

Similar presentations


Presentation on theme: "Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE."— Presentation transcript:

1 Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE

2 Representation bits and bytes characters code points glyphs fonts standardization

3 Representation What is a bit? ‘a binary digit’, i.e either 0 or 1 What is a byte? ‘the fixed no. of bits that can be treated as a unit by the computer hardware’ A byte can be used to express a character such as “A”

4 Representation ASCII: American standard code for information interchange A standard character encoding system The bytes were originally 7-bits Given this, how many bit patterns? Each pattern maps onto a decimal code point, and that maps onto a character

5 Representation ASCII: The problem with 7-bit bytes… What about French la tête What about Greek κεφαλη Extend ASCII to 8-bit bytes ISO (International organization for standardization) Now 256 bit-patterns

6 Representation Extended ASCII: With 8-bit bytes you get 256 bit-patterns For consistency, the first 128 code-points remain the same from ISO-7 The next 128 used for a range of languages For each language, you need an interpretation of these 128 code points The encoding is handled by a code page

7 Representation What about Chinese? Thousands of characters – 256 bit-patterns clearly not enough Make the bytes bigger… Bytes have 16-bits, which gives 65336 bit- patterns UNICODE

8 Representation Glyphs the pictures used to represent a given pictures used to represent a given character; many to one: The character “A” -> A A AA A A A A A

9 Representation Glyphs the pictures used to represent a given pictures used to represent a given character; many to one: The character “A” -> A A AA A A A A A Fonts the collection, or ‘picture gallery’ of glyphs

10 Representation Extended ASCII: With 8-bit bytes you get 256 bit-patterns For consistency, the first 128 code-points remain the same from ISO-7 The next 128 used for a range of languages For each language, you need an interpretation of these 128 code points The encoding is handled by a code page

11 Representation Extended ASCII: For code point 154: CP_EASTEUROPE (code page 1250): š CP_RUSSIAN (code page 1251): љ What about code point 65 for these two code pages? Now represent your names with your own orthographies in mind, using the code pages

12 Representation Code pages in VB Public Enum ValidCharsets ANSI_CHARSET = 0 GREEK_CHARSET = 161 THAI_CHARSET = 222 End Enum Private Sub Form_Load() Dim X As New StdFont X.Charset = 161 X.Bold = True X.Size = 8 X.Name = "Times New Roman" Set frmTest.Font = X Set frmTest.Label1.Font = X Set frmTest.Text1.Font = X frmTest.Label1.Caption = Chr(181) + Chr(225) + Chr(226) frmTest.Text1.Text = Chr(181) + Chr(225) + Chr(226) End Sub

13 Representation and UNICODE What about Chinese? Thousands of characters – 256 bit-patterns clearly not enough

14 Representation and UNICODE What about Chinese? Thousands of characters – 256 bit-patterns clearly not enough Make the bytes bigger… Bytes have 16-bits, which gives 65536 bit- patterns UNICODE

15 UNICODE – design principles Reference: The Unicode Standard, Version 3. 2000. Online: http://www.unicode.org/unicode/uni2book/

16 UNICODE – design principles Principle 1: 16-bit bytes For code pages, characters share 8-bit byte code points – determined by interpretation For UNICODE each character assigned a unique code point 65536 code values available Byte 1: 256 values X Byte 2: 256 values 63485 for character representation; remaining 2048 reserved for extended 32-bit codes This gives 1, 048, 544 code values to cover all languages

17 UNICODE – design principles Principle 2: allocation of code space General scripts area: alphabetic CJK Ideographs – 27484 ideographs Hangul syllables – 11172 Korean Hangul syllables 1 st 128 code points for Latin Punctuation symbols grouped together

18 UNICODE – design principles Principle 3: efficiency All characters have equal status, i.e. no escape characters Characters of a common script grouped together as far as is possible Common punctuation shared

19 Design principles Principle 4: logical and display order Logical order: how the code is ordered in memory: follows time sequence of input …and ‘logically’ that is L-R Dynamically composed characters: base character ordered ‘before’, i.e. left wrt to the modifying character

20 Design principles Principle 5: plain text and rich text Unicode encodes unformatted plain text, where rendering aim is legibility only Formatting: extra data, give rich text To preserve plain text requirements? Have layers of plain text representing characters and how they are formatted Use mark-up languages: content + tags

21 Design principles Principle 6: unification Share characters where you can: Mixed writing systems Ideographs common to CJK Punctuation

22 Character semantics Character name Representative glyph Properties

23 Property 1: Case A letter in the alphabet has several variants UPPERCASE variant lowercase variant Five scripts which have case: Latin, Greek, Cyrillic, Armenian, archaic Georgian

24 Property 2: Decomposition A character which is equivalent to one or more other characters Š = S + ˇ 0160 (Latin Ext.-A)= 0053 + 030C (Basic Latin)

25 Property 3: Combining class Base character i.e. no special graphical combining behaviour when following another character Combining character Some characters have shape-change or position behaviour when combing with other characters Non-spacing combining character Does not take up space, e.g. diacritics Spacing combining character Takes up space as though a base character

26 Property 3: Combining class Sequence is a convention: Base character + combining character Symbol: dotted circle, representing the space of the base character, and combining character positioned relative to the circle Stacking of diacritics follows the convention: Move from the base character outwards

27 Property 4: Directionality Two directionality types: Left to Right Right to Left (Arabic, Hebrew, Syriac, Thaana) Logical sequence: Left to Right

28 Property 5: General Category The full character space is partitioned into several major categories: Letters Punctuation Symbols Numbers Examples of general category codes: Lu: letter, uppercase; Ll: letter, lowercase Nd: number, decimal digit; No: number, other

29 Property 6: Numeric value For characters that represent numbers Decimal digits Fractions Subscripts and superscripts Currency numerators Portion of the CJK ideographs: e.g. U+4E94

30 Property 7: Mirrored property For characters that have equivalent mirror image characters, e.g. ‘(‘ Important for directionality

31 Character properties Summary 1. Case 2. Decomposition 3. Combining class 4. Directionality 5. General category 6. Numeric value 7. Mirrored property


Download ppt "Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE."

Similar presentations


Ads by Google