Lecture 2 1 Encoding Schemes Encoding methods: a method of encoding at binary level to ensure identification and the use of a mixture of different character.

Slides:



Advertisements
Similar presentations
1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University.
Advertisements

中文信息处理 Chinese NLP Lecture 2.
Addition : _________________ Binary Numbers (contd)
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Bits and Bytes.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Computer Science and Software Engineering University of Wisconsin - Platteville Note 9. Internationalization Yan Shi SE 3730 / CS 5730 Lecture Notes Part.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Agenda Data Representation – Characters Encoding Schemes ASCII
Based on: Companion to Data Communications: From Basics to Broadband, Third Edition by William J. Beyda © 2000 Prentice Hall, Inc. All Rights Reserved.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Working with text ASCII and UNICODE.   
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
10-Sep Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Sept Representing Information in Computers:  numbers: counting numbers,
Text and Graphics September 26, Unit 3.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
1 Lecture 2  Complement  Floating Point Number  Character Encoding.
1 3 Computing System Fundamentals 3.5 Data Representation.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
1 Representation of Data within the Computer Oct., 1999(Revised 2001 Oct)
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Number Systems Denary Base 10 Binary Base 2 Hexadecimal Base 16
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Text encoding: or how to get 黃慧儀 and Ίων Ανδρουτσόπουλος into the same document. Chris Brew Linguistics, Ohio State.
1 Problem Solving using Computers “Data....Representation, and Storage.
M204 - Data Representation
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Characters CS240.
DATA REPRESENTATION 4 Y. Colette Lemard February 2009.
ASCII AND EBCDIC CODES By : madam aisha.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
1.4 Representation of data in computer systems Character.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text
DATA REPRESENTATION - TEXT
Essential Skills for Computing Fonts
Binary Representation in Text
Binary Representation in Text
Machine level representation of data Character representation
INTERNATIONALIZATION
Characters & Fonts Digital Multimedia, 2nd edition
3.1 Denary, Binary and Hexadecimal Number Systems
Data Encoding Characters.
Lecture 3 ISE101: Computing Fundamentals
Ch2: Data Representation
Characters & Fonts Digital Multimedia, 2nd edition
Fundamentals of Data Representation
Chapter 2 Data Representation.
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Chapter 3 - Binary Numbering System
Presentation transcript:

Lecture 2 1 Encoding Schemes Encoding methods: a method of encoding at binary level to ensure identification and the use of a mixture of different character sets –compatibility consideration and usually should be compatible with ASCII –save space and multiple codesets to be used on the same system –Clear whether the codeset is for internal code/exchange code /processing code High-bit on scheme The most significant bit of the first byte of the character is set to 1 to indicate the beginning of a Chinese character Examples: GB, Big5

Lecture 2 2 GuoBiao ( 國標 ) :GB Series PRC standard (also used in Singapore) G0: GB , 6,763 Han char. G1: GB , traditional counterpart GBK: Extension to G0 to support Unicode characters GB is the most commonly used codeset Represents simplified characters (i.e. has representation ambiguities with some internal codes of traditional characters e.g. Big5) Code table: has 94 rows x 94 column, Total 8,836 code- points (code space) Code range shown in code table: 0x21\21-0x7E\7E In high - bit on scheme in most systems(8 byte encoding), the code range is 0xA1\A1 - 0xFE\FE

Lecture 2 3 Character subsets(rows): –1: Special symbols (math, etc. e.g. , 【】 ) –2: Paragraph numbers (e.g. 15. (16). ) –3: ASCII full characters( 全角 ) characters -> ASCII equivalent characters (e.g. A..Z) –4: Hiragana, 5: Katakana –6: Greek (48), 7: Cyrillic( Russian) –8: Pinyin (Romanized Pinyin vows and Zhuyin symbols) –9: Graphic for box and table drawing –16-55: Level 1 (0xb0-0xd7) 3,755 Hanzi characters (ordered by pinyin) –56- 87: Level 2 (0xd8-0xf7) 3,008 Hanzi characters (ordered by radical, stroke number) –88- 94:Not defined areas: For future extension(103 characters were later defined in rows 88-89, and 161 graphic symbols from row 90 and on ) User defined area Full-width characters vs. half-width characters Why are there some undefined codepoints(not like in ASCII which is completed full)?

Lecture 2 4 GBK

Lecture 2 5 Big5 ( 大五 ) De facto standard in Taiwan and HK (commonly for PC) High-bit on scheme Row-cell: Defined Range: First Byte (0xA1-FE) and Second Byte (0x40-7E,A1-FE) , two blocks Standard code space: 94 * (94+63)= 14,758 code points Character Subsets –punctuation symbols (A140-A24e) –units (A24F-A261) –graphic symbols for box and tables (A262-A2AE) –numerals (A2AF-A2CE) –Latin letters (A2CF- A343) –Greek letters (A344-A373) –Zhuyin (A374-A3BF)

Lecture 2 6 –Hanzi Plane 1 (A440-C67E): Frequently used (5,401) Plane 2 (C940-F9D5): Less frequently used (7,652) Contains some simplified writing characters and variants 台(臺) 灣 Contains some dialect-specific characters –Hiragana (C6A1-C6F7) and Katakana (C6F8-C7B0) –Cyrillic letters (C7B1-C7E8) –Numbers (C7E9-C7FC) Extension to Big5(called Etan Big5): 8140-A0FE –additional 32*157=5,024 code points –Total of 14, ,024 = 19,782 –User Defined areas: FA40-FEFE(UDA 1)(5 rows) 8E40 - A0FE(UDA 2)(19 rows) DFE(UDA 3)(13 rows) –Vendor defined areas (VDA): VDA1: C6A1 – C8FE,VDA2: F9D6 – F9FE

Lecture 2 7 HKSCS( 香港增補字符集 ) Extension to Etan Big5 using UDAs Big5 UDAs and VDAs UDA3 (2,041 codepoints) 8140 – 8DFE UDA2 (2,983 codepoints) 8E40 – A0FE VDA 1 (408 codepoints) C6A1 – C8FE VDA 2 (41 codepoints) F9D6 – F9FE UDA1 (785 codepoints) FA40 – FEFE

Lecture 2 8 Principles: –Compatible with GCCS –Distinct areas for han characters and symbols –Subdivision of UDAs Extension in the future Avoid un-necessary use of certain areas UDA – 8DFE (2,041 codepoints) 8140 – 84FE (628 code-points) 8540 – 8DFE (1 413 code-points) Reserved for private use only Reserved for HKSCS-E 757 chars. assigned already

Lecture 2 9 Other Chinese codesets: –CNS (government standard, Taiwan, used in Chinese Solaris,) –Character sets for libraries CCCII for Taiwan and ANSI Z for Library of Congress Character standards from other countries: JIS series for Japanese KS series for Korean, etc.

Lecture 2 10 More on encoding schemes: –ISO-2022 series: uses designated key sequences or switch characters Example: 1B(ESC) 24($) 29( )) 41(A) for GB2312, 1B for CNS Plane 1 and 1B 24 2A 48 for CNS Plane 2, etc. –EUC( Extended Unix Code) SS0:ASCII, SS1:high-bit on, SS2:0x8E SS3: 08F Charset designation and registry –European Computer Manufacturers Association (ECMA) –Registry and the Internet Assigned Numbers Authority (IANA) Registry

Lecture 2 11 Problems with Different Chinese Codesets Codeset incompatibility: one codepoint in one codeset is used in another codeset for a different character. Problem with data exchange: Wrong interpretation of data from non-conforming platforms. Codeset announcement and switching mechanisms are needed when multiple codesets need to co-exist on the same platform Even the same codeset for different writing styles (simplified and traditional) cannot be presented in the same system Problems when using codeset conversion –1-N mapping, example: 后 (gb) vs 后後 (big5) –1-0 mapping: some characters in B5 are not in GB, then map to Undefined-Char Symbol  => Round trip conversion problem Different software must be developed for different codesets