Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

2 The Unicode Standard, Version 5.0 Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years. Donald E. Knuth For more than a decade, Unicode has been a foundation for many Microsoft products and technologies; Unicode Standard Version 5.0 will help us deliver important new benefits to users. Bill Gates The path W3C follows to making text on the Web truly global is Unicode. Sir Tim Berners-Lee, KBE Without Unicode, Java wouldn't be Java, and the Internet would have a harder time connecting the people of the world. James Gosling

3 The Unicode Standard, Version 5.0 Obsoletes previous versions Basis for Microsoft's Vista; in upgrade plans for Google, Yahoo!, and ICU, to name but a few. Hundreds of pages of new information; thousands of revised pages; all Unicode Standard Annexes Systematic framework for improved text processing Improvements to the Unicode Encoding Model for UTF-8, … Rigorous stability of case folding and identifiers Improved interoperability and backward compatibility Enabling additional new ways to optimize code

4 U5.0 Unicode Character Database Unicode: far more than a list of characters Properties: key to how characters function Changes in 5.0 Scripts: Unassigned code points Zzzz Casing Stability: Upper folded BIDI: Consistent Bidi_Mirrored Now Normative: kIICore Line Break: SE Asian Complex_Context New Properties: Normative_Name_Alias, Deprecated, 3 Unihan provisional properties General99,089 Private Use137,468 Surrogate2,048 Noncharacter66 Reserved875,441

5 U5.0 Conformance Stable Case-Folded Upper Lower Much clearer encoding / property model Stable Approved Named Character Sequences Bengali, Gurmukhi, Tamil changes Combining grapheme joiner clarified Disunification of Diacritics

6 5.0 Annexes: Core UAX #9: Bidirectional Algorithm Tightened conformance requirements UAX #15: Unicode Normalization Forms New Stream-Safe Text Format Appendix of characters requiring special handling Expanded info on stability guarantees Additional detailed figures, guidelines UAX #31: Identifier and Pattern Syntax Added profiles & information on usage

7 U5.0 Annexes: Boundaries UAX #14: Line Breaking Properties Rules modified to improve behavior Now Normative (conformance clauses reorganized) UAX #29: Text Boundaries Edge cases improved Tailorings for text boundaries now in Unicode CLDR Format of the rules changed to ease implementation Additional guidelines on regex, identifiers,…

8 U5.0 Characters by Script

9 Unicode Character Timeline

10 Unicode Guide for Programmers Adjunct to Standard Concise Guide for Software Globalization Crucial Concepts Key Gotchas Recognize and Avoid Details on Encoding & conversions: UTF-8, 16, 32 & BOM Using character properties Text Operations

11 Unicode Common Locale Data Repository: CLDR Key locale data for world languages Most extensive standard repository of locale data XML format Δευτέρα, 05 Σεπτεμβρίου 2005 Montag, 5. September 2005 1,234.57 1 234,57руб. Arabic – arabski Bulgarian – bułgarski Czech – czeski … Africa – Central America – Eastern Africa – Northern Africa – … AED – د.إ. BHD – د.ب. DZD – د.ج. EGP – ج.م. EUR – … Z < Å

12 Unicode CLDR 1.4 121 languages and 142 territories – 360 locales in all 25% more locale data; over 17,000 new/modified items Repository separated into language vs locale data Language-specific segmentation (word/line breaks…) Transliterations (eg Ελληνικά Ellēniká) Data for lenient date/time formatting and parsing Programmer asks for numeric day + abbreviated month Best format pattern returned, eg dd.MMM + Quarters in dates (eg 2006Q1) BCP 47 compatibility + extensions

13 BCP 47 Language Tags Usage: HTTP, HTML, XML; CLDR Locale ID s… RFC 4646; Obsoletes RFCs 1766, 3066 Addresses problems in RFC3066 ISO standards: stability / accessibility / ambiguity Parseability, Extensibility; Registration speed Identification of script (where necessary): Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.

14 Unicode Security Examples: Visual Confusables: with Cyrillic a… Non visual problems: buffer overflows, non-shortest form,… UTR# 36 Unicode Security ConsiderationsUnicode Security Considerations Guidelines & Recommendations UTS# 39. Unicode Security MechanismsUnicode Security Mechanisms Algorithms & Data Limitations on Repertoire Testing for Confusables

15 Internationalized Domain Names One instance of broad problem Many RFCs use Nameprep – limited to Unicode 3.2 Unicode recommendations Narrow the repertoire: exclude symbols, punctuation Expand the coverage: currently only Unicode 3.2. IETF idn-nextsteps published Some positive developments, but misreads Unicode, needs more work

16 URL IRI International Resource Identifier (IRI) UTF-8, %-escaped Example: JP /.html JP%E7%B4%8D... %E8%B1%86.html See

17 Ideographic Variation Database U+82A6 ashi: multiple forms The first occurrence – any glyph Second occurrence is in the name of the town Ashiya – customarily displayed with form #4 Registration for variants

18 Ideographic Variation Database Variation Selector Identifies a restriction on the appearance of a character Character + Variation Selector = Variation Sequence Han ideographs Impossible to build a single collection for everyone: requirements from scholars, governments and publishers… Instead, registration of multiple independent collections Unicode Ideographic Variation Database A given variation sequence is used in at most one collection Makes interchange of variation sequences reliable. Registration, not Assessment

19 ICU 3.6 Mature, portable C/C++/Java intl libraries Unicode 5.0, UCA 5.0, CLDR 1.4 ICU4C Charset Detection Improved: Time Zones, Thai word break, UText (64 bit), Performance, Data Management,… ICU4J Globalization Preferences Flexible date/time formats*, Charset conversion*

20 Near-Term Issues Unicode 5.0.1, Unicode 5.1 CLDR / BCP 47bis LDAP Collation Registry IANA Charset Registry

21 Unicode 5.1 - possibilities Characters CJK Unified Ideographs Extension C Minority Scripts: Cham and Lanna Malayalam chillu … Properties/Behavior Normalization process for stable strings …

22 CLDR 1.5 / BCP 47bis CLDR 1.5 Data Submission Starting November New structures / data BCP 47 Adding ~7,000 (!) new language subtags Possibly other changes…

23 LDAP Now has definitive comparison(good) Stuck at Unicode 3.2(bad)

24 Collation Registry Nearing approval Adds ability to register comparisons Workable for basic cases draft-newman-i18n-comparator-14.txt

25 IANA Charset registry Currently limited usefulness Ill-defined Missing mapping tables Incomplete Inaccurate Regime Change Hope for future improvements!

