Presentation is loading. Please wait.

Presentation is loading. Please wait.

Globalization Gotchas

Similar presentations

Presentation on theme: "Globalization Gotchas"— Presentation transcript:

1 Globalization Gotchas
Mark Davis

2 Unicode Basics Unicode encodes characters, not glyphs:
U+0067 → g g g g g g g g g g g g g. ... Unicode does not encode characters by language: French, German, English j have the same code point even though all have different pronunciations Chinese 大 (da) has the same code point as Japanese 大 (dai). UTF-8, UTF-16, and UTF-32 are all Unicode. The word character means different things to different people: make clear which one you mean. glyphs, code points, bytes, code units, user-perceived characters (grapheme clusters),…

3 Unicode in APIs U+0000 to U+10FFFF: Be prepared to handle (at least not corrupt!) any incoming code points A back-level system may get unassigned code points from later versions. Watch for "UCS-2" implementations. They use UTF-16 text, but don't support characters above U+FFFF; they also may accidentally cause isolated surrogates. Some APIs/protocols will count lengths in code points, and others in bytes (or other code units). Make sure you don't mix them up. Don't limit API parameters to a single character (and definitely not to a single code unit!). What users think of as a single character (e.g. ẍ, ch) may be a sequence in Unicode. Use the latest version of Unicode: supports new characters, corrections, more stability guarantees.

4 Choice of Characters Character and block names may be misleading, eg,
U+034F COMBINING GRAPHEME JOINER doesn't join graphemes. ► Use U+2060 (word joiner) instead of U+FEFF (zero-width nobreak space) for everything but the BOM function. Never use unassigned code points; those will be used in future versions of Unicode. Only use private use (PUA) or non-characters (and only if necessary) If you do, minimize the opportunity for collision by picking an unusual range.

5 Character Conversion Always use "shortest form" UTF-8.
It's the Law. And if that isn’t enough, consider security attacks. If a protocol allows a choice of charsets, always tag correctly Not all text is correctly tagged: character detection may be necessary. But remember, it's always a guess! Converting a database of mixed, untagged data is extremely painful. Bad assumptions: Length [bytes] = N * length [code points] 1 character [charset X] = 1 character [Unicode] The ordering may also be different.

6 Character Conversion II
IANA / MIME charset names are ill-defined: vendors often convert same charset different ways. Shift-JIS: 0x5C → U+005C (\) or U+00A5 (¥) Don’t simply omit unconvertable data; to reduce security problems, at least substitute: U+FFFD (when converting to Unicode) or 0x1A (when converting to bytes). ► ►

7 Properties Use properties such as Alphabetic, not hard-coded lists:
isAlphabetic(x) regex: \p{Alphabetic} or [:Alphabetic:] Not (“A” ≤ x ≤ “Z” OR “a” ≤ x ≤ “z”) Some properties aren't what you think; use: White_Space not General_Category=Zs Alphabetic not General_Category=L Lowercase not General_Category=Ll Script=Greek not Block=Greek Characters may change property values between versions of Unicode ►

8 Identifiers & Tokens When designing syntax, use as a base:
Pattern_Syntax for operators / relations Pattern_Whitespace for gaps XID_Start and XID_Continue for identifiers. All backwards compatible across versions Profiles may expand or narrow from the base Watch out for security attacks: “” with a Cyrillic “a” ► See Unicode Security at this conference

9 Comparison (Collation): Searching, Sorting, Matching
There are two binary orders: code point order = UTF-8 order = UTF-32 order ≠ UTF16 order Don’t present users with binary order! No users expect A < Z < a < z < Ç < ä. Apply normalization to get a unique form, so Å = Å. Security Issues: Protocols must precisely define the comparison operations: Eg, LDAP doesn't, so lookup may fail (or falsely succeed!) Aside from wrong results, opening for security attacks.

10 Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectations: a < A < ä < Ç = C◌̧ < z < Z Real language-sensitive order requires tailoring on top of UCA; ordering depends on context and language: china < China < chinas < danish ae < æ < af z < æ (Danish) c < d < ... h < ch < i (Slovak) Follow UCA for substring match offsets – some gotchas here. Don't mix up "stable" and "deterministic" sorting: they are very different. ►

11 Normalization (NFC,…) Standardized normalized forms defined by Unicode. The ordering of accents in a normalization form may not be the typical type-in order. Fonts should handle both orders. Normalization is context independent Don't assume NFC(x + y) = NFC(x) + NFC(y) People assume that NFC always composes, but some characters decompose in NFC. Trivia: In Unicode 4.1 there are exactly 3 characters that are different in all 4 normalization forms: ϓ, ϔ, ẛ

12 Maximum Expansion (U4.1) Operation UTF Factor Sample NFC 8 3X 𝅘𝅥𝅮
U+1D160 16, 32 U+FB2C NFD ΐ U+0390 4X U+1F82 NFKC / NFKD 11X U+FDFA 18X

13 Case Conversion Not a simple 1:1 mapping
Title case: dz ↔ DZ ↔ Dz Expansion: heiß → HEISS → heiss Context-dependent: ΌΣΟΣ → όσος Language-dependent: istanbul ↔ İSTANBUL Warning: never use language-dependent casing for language-independent structures, like file-system B-Trees.

14 Casing: Maximum Expansion
Operation UTF Factor Sample Lower 8 1.5X Ⱥ U+023A 16, 32 1X A U+0041 Upper / Title / Fold 8, 16, 32 3X ΐ U+0390

15 Case Conversion II Case folding was not stable.
Different results from toCaseFold(S) between two versions Stability now guaranteed in Unicode 5.0 Don't use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of  General_Category These were constrained to be in a partition. Use the separate binary properties Lowercase and Uppercase instead.

16 Lowercase / Uppercase: Form vs Function
Lowercase, the binary property: The character is lowercase in form, but not necessarily in function. Functionally Lowercase: isCased(x) & isLowercase(x). See Section 3.13 of TUS.

17 Lowercase: Form vs Function

18 Segmentation What a user thinks of as a characters is often a sequence. Words are not just sequences of letters. Lines don’t just break at spaces All may be language-dependent ► ►   

19 Transliteration Transliteration Ελληνικά ↔ Ellēniká ≠ Translation Ελληνικά ↔ Greek Transliteration may vary by language: Путин ↔ Putin, Poutine, ... Горбачёв ↔ Gorbachev, Gorbacev, Gorbatchev, Gorbačëv, Gorbachov, Gorbatsov, Gorbatschow, ... Watch for terminology: “lossy” vs “lossless” Lossy transliteration: Ελληνικά → Ellinika → Ελλινικα In ISO terms: “transliteration” = lossless transliteration “transcription” = lossy transliteration. ►

20 Rendering is Contextual
Processing character-by-character gives the wrong results! Glyphs may change shape Multiple characters → 1 glyph One character → multiple glyphs

21 Rendering II Good rendering systems will handle customary type-in order for text plus canonical order. Excellent ones will do any canonically-equivalent order, but those are rare. There may be differences in the customary glyphs for different languages; specify the font or the language where they have to be distinguished Security Issues: Never render a missing glyph as "?“. Don't simply overlay diacritics: it can cause security problems. ► ►

22 Globalization Unicode ≠ Globalization (aka Internationalization, Localizability) Unicode provides the basis for software globalization, but there's more work to be done... Use globalization APIs: Formatting and parsing of dates, times, numbers, currencies; comparison of text; calendar systems; ... are locale-dependent. Where OS facilities are not adequate or cross-platform solutions are needed, use ICU (C, C++, Java) Don't put any translatable strings into your code; separate into resource files. Provide context to translators: is Mark a noun, a verb, or a name… Don’t use the same string in different contexts unless the meaning is identical (including references). Note: User-Interface language (menus, dialog, help-system,...) ≠ Data language (body text, spreadsheet cells). Programs need to handle, as data, more languages than in localized UI

23 Common Globalization Mistakes
Never compile Windows apps as “ANSI” (the default!). Don't simply concatenate strings to make messages: Order of components differs by language: use Java MessageFormat, or structure UI as separate fields. Don't assume icons and symbols mean the same around the world. Don't assume everyone can read the Latin alphabet. Allocate space flexibly: “OK” in English → “Aceptar” in Spanish English is a relatively compact language; others may require more characters (eg in database fields) and more screen real estate (in UIs). Beware of discrepancies in “fallback” behavior: Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server Face (JSF), Apache HTTP,... ► ►

24 Neutral Formats Store and transmit neutral-format data wherever possible. Convert that data to the user's preferred formats as "close" to the user as possible. Type Example Rec. Standard Language/Locale* en-US (en_US) RFC 3066 bis / CLDR Territory AU RFC 3066 bis Currency EUR ISO 4217 Timezone Australia/Melbourne TZDB Calendar islamic-civil CLDR Calendar ID Custom Date yyyy-mmm-dd CLDR Pattern Format Binary Time 8C80E9E3967A4B0 Windows File Time

25 Identification Locale IDs are extensions of language IDs; use CLDR. ► Don't assume that everyone in country always uses that country’s currency. Always use an explicit currency ID (ISO 4217). <RUR, ×10³> ↔ 1 234,57р. in Russian, but Rub 1, in English. Don't assume the timezone ID is implied by the user's locale. For the best timezone information, use the TZ database; use CLDR for timezone names. ► If you heuristically compute territory IDs, timezone IDs, currency IDs, etc. (eg, from browser settings) make sure the user can override that and pick an explicit value.

26 Unicode Guide Authoritative but lightweight
Introduction, overview, and quick reference Main principles of the Unicode Standard Best practices in Software Globalization

27 Other Resources Unicode Site: An Overview of ICU:
An Overview of ICU: Globalizing Software: W3C Internationalization: Microsoft Global Software Development

28 Q&A

29 Backup Slides

30 User Input  If you develop your own text editor, use the OS APIs to handle IMEs (Input Method Engines) for Chinese, Japanese, Korean,... If you are using "type-ahead" to get to a position in a list (eg typing "Jo" gets to the first element starting with those characters), allow arbitrary input. This is often easiest with visible fields. If your password field can contain characters that require an IME, a screen pop-up box may reveal the password to onlookers.

31 Dotted and Dotless I ⇄ ← → Uppercase Normal Lowercase Turkic I + ˙ i
I + ˙ i I 0069 İ 0049 0130 ı 0131 i + ˙ İ + ˙

32 Java In MessageFormat, watch for words like can't, since ASCII ' has syntactic meaning. Use a real apostrophe (U+2019) where possible: can’t. In Date and Calendar, the months are numbered from 0 (February is month number 1!). However, weeks and days are numbered from 1. Java serialized text isn't UTF-8, though it's close. U+0000 and supplementary code points are encoded differently. Java globalization support is pretty outdated: use ICU to supplement it. Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server Face (JSF), Apache HTTP server, etc. all provide some locale determination mechanism and facility; but they all differ in details.

33 JavaScript Always encode characters above U+007F with escapes (\uxxxx). There is an HTML mechanism to specify the charset of the Javascript source, but it is not widely implemented. The JDK tool native2ascii can be used to convert the files to use escapes

Download ppt "Globalization Gotchas"

Similar presentations

Ads by Google