Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.

Slides:



Advertisements
Similar presentations
Information Representation
Advertisements

Review of HTML Ch. 1.
Review Ch.1,Ch.4,Ch.7. Review of tags covered various header tags Img tag Style, attributes and values alt.
Bits and the "Why" of Bytes: Representing Information Digitally
Representing Information as Bit Patterns
Binary Expression Numbers & Text CS 105 Binary Representation At the fundamental hardware level, a modern computer can only distinguish between two values,
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
1 The Information School of the University of Washington Nov 6fit more-digital © 2006 University of Washington Digital Information INFO/CSE 100,
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
1/25 Writing Character sets Unicode Input methods.
Chapter 8 Bits and the "Why" of Bytes: Representing Information Digitally.
Data Representation (in computer system) Computer Fundamental CIM2460 Bavy LI.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
CODING SYSTEMS CODING SYSTEMS CODING SYSTEMS. CHARACTERS CHARACTERS digits: 0 – 9 (numeric characters) letters: alphabetic characters punctuation marks:
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
©Brooks/Cole, 2003 Chapter 2 Data Representation.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Week 4 Number Systems.
Binary Numbers and ASCII and EDCDIC Mrs. Cueni. Data Representation  Human speech is analog because it uses continuous signals (waves) that vary in strength.
Bits & Bytes: How Computers Represent Data
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
Chapter 2 Computer Hardware
Data Representation CS280 – 09/13/05. Binary (from a Hacker’s dictionary) A base-2 numbering system with only two digits, 0 and 1, which is perfectly.
1 INFORMATION IN DIGITAL DEVICES. 2 Digital Devices Most computers today are composed of digital devices. –Process electrical signals. –Can only have.
CS151 Introduction to Digital Design
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
1 3 Computing System Fundamentals 3.5 Data Representation.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
CISC1100: Binary Numbers Fall 2014, Dr. Zhang 1. Numeral System 2  A way for expressing numbers, using symbols in a consistent manner.  " 11 " can be.
SEC (1.4) Representing Information as bit patterns.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Data Representation, Number Systems and Base Conversions
The Information School of the University of Washington Oct 13fit digital1 Digital Representation INFO/CSE 100, Fall 2006 Fluency in Information Technology.
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
The Information School of the University of Washington 15-Oct-2004cse digital1 Digital Representation INFO/CSE 100, Spring 2005 Fluency in Information.
1 Problem Solving using Computers “Data....Representation, and Storage.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
ASCII AND EBCDIC CODES By : madam aisha.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
1.4 Representation of data in computer systems Character.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text
DATA REPRESENTATION - TEXT
Binary Representation in Text
Binary Representation in Text
Chapter 8 & 11: Representing Information Digitally
Machine level representation of data Character representation
Chapter 3 Data Storage.
Representing Information as bit patterns
Data Encoding Characters.
Representing Characters
Ch2: Data Representation
Presenting information as bit patterns
INFO/CSE 100, Spring 2005 Fluency in Information Technology
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
Data Representation Chapter 2 Computer HW (Von Neumann Model) Program
Learning Intention I will learn how computers store text.
C Programming Language
ASCII and Unicode.
Presentation transcript:

Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE

Representation bits and bytes characters code points glyphs fonts standardization

Representation What is a bit? ‘a binary digit’, i.e either 0 or 1 What is a byte? ‘the fixed no. of bits that can be treated as a unit by the computer hardware’ A byte can be used to express a character such as “A”

Representation ASCII: American standard code for information interchange A standard character encoding system The bytes were originally 7-bits Given this, how many bit patterns? Each pattern maps onto a decimal code point, and that maps onto a character

Representation ASCII: The problem with 7-bit bytes… What about French la tête What about Greek κεφαλη Extend ASCII to 8-bit bytes ISO (International organization for standardization) Now 256 bit-patterns

Representation Extended ASCII: With 8-bit bytes you get 256 bit-patterns For consistency, the first 128 code-points remain the same from ISO-7 The next 128 used for a range of languages For each language, you need an interpretation of these 128 code points The encoding is handled by a code page

Representation What about Chinese? Thousands of characters – 256 bit-patterns clearly not enough Make the bytes bigger… Bytes have 16-bits, which gives bit- patterns UNICODE

Representation Glyphs the pictures used to represent a given pictures used to represent a given character; many to one: The character “A” -> A A AA A A A A A

Representation Glyphs the pictures used to represent a given pictures used to represent a given character; many to one: The character “A” -> A A AA A A A A A Fonts the collection, or ‘picture gallery’ of glyphs

Representation Extended ASCII: With 8-bit bytes you get 256 bit-patterns For consistency, the first 128 code-points remain the same from ISO-7 The next 128 used for a range of languages For each language, you need an interpretation of these 128 code points The encoding is handled by a code page

Representation Extended ASCII: For code point 154: CP_EASTEUROPE (code page 1250): š CP_RUSSIAN (code page 1251): љ What about code point 65 for these two code pages? Now represent your names with your own orthographies in mind, using the code pages

Representation Code pages in VB Public Enum ValidCharsets ANSI_CHARSET = 0 GREEK_CHARSET = 161 THAI_CHARSET = 222 End Enum Private Sub Form_Load() Dim X As New StdFont X.Charset = 161 X.Bold = True X.Size = 8 X.Name = "Times New Roman" Set frmTest.Font = X Set frmTest.Label1.Font = X Set frmTest.Text1.Font = X frmTest.Label1.Caption = Chr(181) + Chr(225) + Chr(226) frmTest.Text1.Text = Chr(181) + Chr(225) + Chr(226) End Sub

Representation and UNICODE What about Chinese? Thousands of characters – 256 bit-patterns clearly not enough

Representation and UNICODE What about Chinese? Thousands of characters – 256 bit-patterns clearly not enough Make the bytes bigger… Bytes have 16-bits, which gives bit- patterns UNICODE

UNICODE – design principles Reference: The Unicode Standard, Version Online:

UNICODE – design principles Principle 1: 16-bit bytes For code pages, characters share 8-bit byte code points – determined by interpretation For UNICODE each character assigned a unique code point code values available Byte 1: 256 values X Byte 2: 256 values for character representation; remaining 2048 reserved for extended 32-bit codes This gives 1, 048, 544 code values to cover all languages

UNICODE – design principles Principle 2: allocation of code space General scripts area: alphabetic CJK Ideographs – ideographs Hangul syllables – Korean Hangul syllables 1 st 128 code points for Latin Punctuation symbols grouped together

UNICODE – design principles Principle 3: efficiency All characters have equal status, i.e. no escape characters Characters of a common script grouped together as far as is possible Common punctuation shared

Design principles Principle 4: logical and display order Logical order: how the code is ordered in memory: follows time sequence of input …and ‘logically’ that is L-R Dynamically composed characters: base character ordered ‘before’, i.e. left wrt to the modifying character

Design principles Principle 5: plain text and rich text Unicode encodes unformatted plain text, where rendering aim is legibility only Formatting: extra data, give rich text To preserve plain text requirements? Have layers of plain text representing characters and how they are formatted Use mark-up languages: content + tags

Design principles Principle 6: unification Share characters where you can: Mixed writing systems Ideographs common to CJK Punctuation

Character semantics Character name Representative glyph Properties

Property 1: Case A letter in the alphabet has several variants UPPERCASE variant lowercase variant Five scripts which have case: Latin, Greek, Cyrillic, Armenian, archaic Georgian

Property 2: Decomposition A character which is equivalent to one or more other characters Š = S + ˇ 0160 (Latin Ext.-A)= C (Basic Latin)

Property 3: Combining class Base character i.e. no special graphical combining behaviour when following another character Combining character Some characters have shape-change or position behaviour when combing with other characters Non-spacing combining character Does not take up space, e.g. diacritics Spacing combining character Takes up space as though a base character

Property 3: Combining class Sequence is a convention: Base character + combining character Symbol: dotted circle, representing the space of the base character, and combining character positioned relative to the circle Stacking of diacritics follows the convention: Move from the base character outwards

Property 4: Directionality Two directionality types: Left to Right Right to Left (Arabic, Hebrew, Syriac, Thaana) Logical sequence: Left to Right

Property 5: General Category The full character space is partitioned into several major categories: Letters Punctuation Symbols Numbers Examples of general category codes: Lu: letter, uppercase; Ll: letter, lowercase Nd: number, decimal digit; No: number, other

Property 6: Numeric value For characters that represent numbers Decimal digits Fractions Subscripts and superscripts Currency numerators Portion of the CJK ideographs: e.g. U+4E94

Property 7: Mirrored property For characters that have equivalent mirror image characters, e.g. ‘(‘ Important for directionality

Character properties Summary 1. Case 2. Decomposition 3. Combining class 4. Directionality 5. General category 6. Numeric value 7. Mirrored property