Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

2 Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard 5.0 Frozen and published to give implementers a head-start New Character Repertoire: +1,369 Total Graphic + Control: 99,089 Total PU/NC/SG: 139,582 U5.0 character properties New characters Corrections

3 Unicode Standard 5.0 Due 2006Q4: obsoletes previous versions Years of implementation experience Encoding model; casing; writing systems; security; classification of code points; Unicode strings; variation selectors; new properties; linebreak; bidi; segmentation; … Increased interoperability for BIDI, Indic,… Required basis for: regex, collation, segmentation, identifiers, security,… Planned for major software releases: Windows Vista, Solaris, Java, GNOME, …

4 Unicode Guide Authoritative but lightweight Introduction, overview, and quick reference Main principles of the Unicode Standard Best practices in Software Globalization See Globalization Gotchas at this conference

5 Language Tags RFC3066 replacement approved: 2005-11-15 Not yet published, but registry now operating Addresses problems in RFC3066 Stability / accessibility / ambiguity of the underlying ISO standards Parseability, Extensibility; Registration speed Identification of script (where necessary): Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az- Cyrl, etc.

6 Common Locale Data Repository: CLDR Common, necessary software locale data for world languages XML format for effective interchange Δευτέρα, 05 Σεπτεμβρίου 2005 Montag, 5. September 2005 1,234.57 1 234,57руб. Arabic – arabski Bulgarian – bułgarski Czech – czeski … Africa – Central America – Eastern Africa – Northern Africa – … AED – د.إ. BHD – د.ب. DZD – د.ج. EGP – ج.م. EUR – … Z < Å

7 CLDR 1.4 Features Repository separated into language vs locale data Language-specific segmentation (word/line breaks…) Transliterations (eg Ελληνικά Ellēniká) Data for lenient date/time formatting and parsing Programmer asks for numeric day + abbreviated month Best format pattern returned, eg dd.MMM Algorithm and locale data for choosing, adjusting Calendar usage data Quarters in dates (eg 2006Q1)

8 CLDR 1.4 Schedule Gathering data phase:currently Vetting phase start:March 15 Release:May 15 Aside from features: New data, corrections Metadata for parsing & validation New tool for gathering/vetting data

9 CLDR Survey Tool New web tool for data submission Unicode members and others Automatically incorporated into XML Process for resolving differences, approval by committee

10 CLDR Vetting Process Vetters confirm or approve new translations, corrections Errors and alerts for areas of concern Data accepted when approved by multiple organizations (plus exception process).

11 Unicode Security Issues Examples: Visual Confusables: with Cyrillic a… Non visual problems: buffer overflows, non-shortest form,… UAX #36: Unicode Security Considerations Process recommendations Best practices UTS #39: Unicode Security Mechanisms Limitations on Repertoire Testing for Confusables See Unicode Security at this conference

12 Internationalized Domain Names Unicode Recommendations Narrow the repertoire: exclude symbols, punctuation Expand the coverage: currently only Unicode 3.2. Broader problem; many RFCs use Nameprep, but that is limited to Unicode 3.2 New ICANN Guidelines (2.0) Improved, but needs more work. IETF idn-nextsteps Positive developments, but misreads Unicode, needs more work

13 URL -> IRI International Resource Identifier (IRI) iri/JP /.html = iri/JP%E7%B4%8D...%E8%B1%86.html iri/JP%E7%B4%8D...%E8%B1%86.html UTF-8, %-escaped See

14 World Wide Web Consortium Work Areas Web Services Internationalization Language Tags and Locale Identifiers Internationalization Tag Set CSS WG on vertical text, etc. Many W3C specs being upgraded to include IRIs Growing number of articles, tutorials and tests available Find out more at

15 Ideographic Variation Database U+82A6 ashi: multiple forms The first occurrence – any glyph Second occurrence is in the name of the town Ashiya – customarily displayed with form #4 Registration for variants

16 Unicode Members: Full

17 Unicode Members: Institutional & Supporting New membership levels, between Full and Associate

18 Unicode Members: Associate

19 Why Join? Support the technology… That enables your success in international, technical, and emerging markets. Protect your investment: The stability you need The extensions you require The development you call for: security, … Demonstrate your leadership… In furthering the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.

