Unicode Normalization Mark Davis www.macchiato.com.

Slides:



Advertisements
Similar presentations
Unicode from a distance…
Advertisements

Unicode/IDN Security Mark Davis President, Unicode Consortium Chief SW Globalization Arch., IBM.
Globalization Gotchas
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Mark Davis President, Unicode Consortium
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.
Graphics 2D 1 Subject:T0934 / Multimedia Programming Foundation Session:6 Tahun:2009 Versi:1/0.
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
Bits and the "Why" of Bytes: Representing Information Digitally
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
Creating a Well-Formed Valid Document. 2 Objectives Introducing XHTML Creating a Well-Formed Document Creating a Valid Document Creating an XHTML Document.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Data Representation in Computers
15 September How Computers Work: Other Forms of Data.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Chapter 2 The Language of Bits
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
10-Sep Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Sept Representing Information in Computers:  numbers: counting numbers,
Implementation Issues Mark Davis Properties.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Introduction to Web Programming. Introduction to PHP What is PHP? What is a PHP File? What is MySQL? Why PHP? Where to Start?
1 Dublin Core & DCMI – an introduction Some slides are from DCMI Training Resources at:
Data Representation, Number Systems and Base Conversions
ISBN Chapter 6 Data Types Introduction Primitive Data Types User-Defined Ordinal Types.
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Cupertino, CA, USA / September, 2000First ICU DeveloperWorkshop1 Transformation Support Alan Liu Globalization Center of Competency IBM Emerging Technology.
Characters CS240.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
1 MIT 5316 Web-Based Computing Lecture 1. 2 Welcome Introduction Syllabus.
Unix RE’s Text Processing Lexical Analysis.   RE’s appear in many systems, often private software that needs a simple language to describe sequences.
Number Systems. The position of each digit in a weighted number system is assigned a weight based on the base or radix of the system. The radix of decimal.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
Introduction to computer science Lec2 cs111. Extended Binary Coded Decimal Interchange Code (EBCDIC) is an 8- bit character encoding used mainly on.
Unit 2.6 Data Representation Lesson 2 ‒ Characters
NUMBER SYSTEMS.
Binary 1 Basic conversions.
INTERNATIONALIZATION
Characters & Fonts Digital Multimedia, 2nd edition
Why use Binary? There are only four rules for addition in binary compared to 100 in decimal [0+0=0 ; 0+1=1 ; 1+0=1; 1+1=10]
PHP Introduction.
TOPICS Information Representation Characters and Images
Unicode from a distance…
XML Problems and Solutions
Characters & Fonts Digital Multimedia, 2nd edition
COMS 161 Introduction to Computing
Why use Binary? It is a two state system (on/off) which makes it simple to operate Even if degradation of current occurs (ie a slight drop in voltage)
COMS 161 Introduction to Computing
INFOCODING BASICS & EXAMPLES OF CURRENT USE
JavaScript: Objects.
ASCII and Unicode.
Varying Character Lengths
Presentation transcript:

Unicode Normalization Mark Davis

Normalization Uniqueness two equivalent strings have precisely the same normalized form Fast binary comparison, accurate digital signatures Recommended for XML, JavaScript and other standards

Canonical Equivalence Fundamental equivalence Indistinguishable to users, when correctly rendered Includes Combining sequences Hangul Singletons Ω C¸Ç

Compatibility Equivalence Formatting differences Font variants ( ) Breaking differences (-) Cursive forms ( ) Circled ( ) Width, size, rotated ( ) Super/subscripts ( ) Squared characters ( ) Fractions ( ) Others ( dž ) fi kg

UTR #15: Unicode Normalization Forms Form DCanonical Decomposition Form KDCompatibility Decomposition Form C Form D + Canonical Composition Form KC Form KD + Canonical Composition

Normalization Requirement Uniqueness: two equivalent strings will have precisely the same normalized form If two strings x and y are canonical equivalents, then C(x) = C(y) D(x) = D(y) If two strings are compatibility equivalents, then KC(x) = KC(y) KD(x) = KD(y)

Affected Characters None of the forms affect text with only ASCII characters (U+0000 to U+007F) None of the forms generate compability characters that were not in the source text. Both KD and KC replace compatibility characters. Both D and C maintain compatibility characters.

Cautions: Decomposition Requires decomposition mappings from the Unicode Character Database Those decomposition mappings must be applied recursively The string must be put into canonical order Either Canonical or Compatibility

Cautions: Composition Decomposition required first! Then canonical composition Composition data: fixed at Unicode Some characters are excluded from composition Form C and Form KC can still have combining characters! Required for Indic, Arabic, Hebrew, &c.

Caution: Both C & D All normalization forms are not closed under string concatenation. Example: NFC/D "…a̰ " + " ̀…" Not Norm. "…à̰ …" NFC "…à̰ …" NFD "…a ̰̀ …" Exceptions easy to test for

Composition Process 1. Decompose (D or KD) 2. Combine unblocked characters with the previous starter, if possible*

Composition Exclusions Script Specifics + ̣ Futures: G + ̣ G ̣ Singletons* Ω Ω Non-starter sequences* ̈ + ́ ̈́

Legacy Encoding Legacy text is normalized if it maps 1:1 to normalized Unicode text Legacy sets: Prenormalized: e.g. ISO Normalizable: e.g. ISO 2022 (ISO 5426/ISO /…) Unnormalizable: e.g. ISO 5426

Programming Identifiers Closed under all Normalization Forms, if minor changes incorporated Modified syntax: identifier := start ( start | extend )* start := [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}] - irregulars – combining_like extend := [{Mn}{Mc}{Nd}{Pc}{Cf}] - irregulars + combining_like + mid_dot (Almost) closed under Case Mappings see SpecialCasing.txt

Resources Reference version on Unicode Site Production Version ICU: C/C++ and Java Versions Open Source, with IBM Public License Free commercial use and distribution: Not Viral! Panel Later today Other companies also providing: ask!

Normalization Uniqueness: two equivalent strings have precisely the same normalized form Fast binary comparison, accurate digital signatures Recommended for XML, JavaScript and other standards

Q & A

Backup Slides

Definition: Starter S is a starter = Canonical class of zero in the Unicode Character Database Can start a composition Examples: Starters: Spacing marks, some non-spacing a, ق Θ Non-starters: most non-spacing marks ̀, ̊ ̽ ̥

Definition: Blocked C is blocked from S There is some character B between S and C, and either B is a starter or B has the same canonical class as C Examples ABC – B blocks C from A A ̀̊ – ̀ blocks ̊ from A Ḁ̊ –̥ doesnt block ̊ from A

Testing Conformance: Canonical For all Unicode characters X C(X) = C(D(X) D(X), C(X) in canonical order CDMNo CDM X = D(X) X = C(X) X D(X) No characters in D(X) have CDM X Exclusions X C(D(X)X = C(D(X)

Unicode Normalization Introduction Normalization forms Design goals Specification Excluded characters Versions Legacy encodings Applications

Characters and Encoding Forms Å A ° C5 AbstractEncoded 212B F A Serialized B DB80DC A C5 UTF-16BE UTF-8 C3 E284 F3B080 61CC8A 85 AB