Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha.

Similar presentations


Presentation on theme: "1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha."— Presentation transcript:

1 1 Character Conversions and Mapping Tables Presented By: Markus Scherer markus.scherer@us.ibm.com George Rhoten grhoten@us.ibm.com Raghuram (Ram) Viswanadha ramv@us.ibm.com Globalization Center of Competency, Cupertino, CA

2 2 Agenda Introduction Terminology & Concepts Problems Solutions & Tools Summary

3 3 Introduction Text data used to be contained on a single computer system Now text data is exchanged among different systems Each type of system used different ways to encode text Exchanging this text data requires a conversion Text data is increasingly machine processed Main emphasis on Unicode text processing

4 4 Terminology System Character set Code point Encoding/Charset Character mapping Alias

5 5 Concept of Character Mapping A Á A UnicodeCharacter Set fallback roundtrip

6 6 Character Mapping (continued) VV UnicodeCharacter Set reverse fallback roundtrip

7 7 Doing a Conversion Unicode ISO-8859-1 Repertoires: superset/subset

8 8 Doing a Conversion (continued) Unicode IBM SJIS Sun SJIS 99.8% Same

9 9 Text Data Exchange Problems Unable to read text from another system Unable to write correct text for other processes Loss of text data because of mistakes –Maybe partial loss of data due to rare and obscure details –Happens more often to multibyte and stateful encodings New Unicode character added and mapping changes –Character was mapped to PUA –Character is now mapped to a new Unicode character

10 10 Problems (continued) Support of different repertoires of characters Different text encoding models –Different bidi text models Visual order Logical order Explicit embedding –Composed and decomposed characters –Shaping (Arabic) –Reordering (Indic, Thai, etc.) –Ligatures different

11 11 Examples µμMicro symbol (U+00B5) vs. Greek Mu (U+03BC) -–Hyphen-Minus (U+002D) vs. En Dash (U+2013) \¥Backslash (U+005C) vs. Yen symbol (U+00A5) ~¯Tilde (U+007E) vs. Overline (U+00AF) NULGraphical display of control characters NLLFNewline swapped with Linefeed ISO Control rotation 0x1C 0x1A0x7F

12 12 Reasons For Problems Different mappings tables Fallback supported inconsistently Mapping tables were not shared Mappings tables were not published in machine readable format Aliases –Existing registries (IANA, MIME, …) do not specify precise mappings –Different mapping tables for the same name (CP943, SJIS) –Different names for the same character set

13 13 Solutions Use precise names Use precise mapping tables Avoid fallbacks (controllable e.g. with ICU) Share the character set mappings –e.g. format: http://www.unicode.org/unicode/reports/tr22/

14 14 Solutions (continued) Do safe conversions –Exact subsets and supersets –Use precise replacements for unavailable characters (NCRs and escapes) –Algorithmic JIS X 0208: SJIS EUC-JP ISO 2022-JP All Unicode encodings among each other

15 15 Tools ICU (International Components for Unicode) –Feature rich converter API –Allows to match conversion behavior of most other systems –http://oss.software.ibm.com/icu/ Unicode mapping table repositories –http://www.unicode.org/Public/MAPPINGS/ –http://oss.software.ibm.com/icu/charset/ iconv() and other platform converters

16 16 Summary Text data exchange can result in loss of data Using Unicode is safe without a conversion Conversion mapping tables are unsafe Use ICU Thank you for listening Are there any questions?


Download ppt "1 Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha."

Similar presentations


Ads by Google