Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Internationalization: An Introduction Tutorial from Character Encodings & Unicode

License This presentation and its associated materials licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 License. You may use these materials without obtaining permission from the author. Any materials used or redistributed must contain this notice. [Derivative works may be permitted with permission of the author.] This work is copyright © 2008-2011 by Addison P. Phillips

Presenter and Presentation Addison Phillips – Globalization Architect, Lab126 This Presentation – Part I of the Internationalization and Unicode Conference tutorial : “Internationalization: An Introduction” Character Encodings and Unicode

Who is this guy? Globalization Architect, Lab126 We make the technology behind the Kindle Chair, W3C Internationalization WG

Internationalization is: the design and development of a product that is enabled for target audiences that vary in culture, region, or language. [W3C] a fundamental architectural approach to software development

Mystic Numbering (M4C N7G) Opinions differ on capitalization (C12N); choose from:  i18N  I18n  I18N Very geeky; not very internationalized ( I19G ?) I N T E R N A T I O N A L I Z A T I O N I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 N I18N Localization=L10N Globalization=G11N Canonicalization=C14N Accessibility=A12Y

CHARACTER ENCODINGS The basics of text processing in software.

The Biggest Source of Woe “Character encodings consume more than 80% of my work day. They are the source of more mis-information and confusion than any other single thing. And developers aren’t getting any better educated.” ~Glen Perkins Globalization Architect

A lot of jargon Real Jargon Multibyte Variable width Wide character Character encoding Coded character set Bidi or bidirectional Glyph, character, code unit Unicode Potentially Bogus Jargon kanji double-byte language extended ASCII ANSI, OEM encoding agnostic

How the computer sees the world “ bits ” : 010000010101101101101000 “ byte ” or “ octet ” : 01000001 (0x41) code unit: a unit of physical storage and information interchange represent numbersrepresent numbers come in various sizes (e.g. 7, 8, 16, 32, 64 bits)come in various sizes (e.g. 7, 8, 16, 32, 64 bits) how do we map text to the numbers used by computers?

From text to bits Glyphs – A “ glyph ” is screen unit of text: it ’ s a picture of what users think of as a character. – A “ grapheme ” is a single visual unit of text. Characters – A “ character ” is a single logical unit of text. – A “character set” is a set of characters. – A “ code point ” is a number assigned to a character in a character set. – A “ coded character set ” is a character set where each character has a code point. Bytes – A “character encoding form” maps a sequence of code points ( “ characters ” ) to a sequence of code units (such as bytes). – A “ code unit ” is a single logical unit of storage. … 0xC3 0x80 … U+00C0 À

Coded Character Set Collection ( repertoire ) of characters, that is: a set. Organized so that each character has a unique numeric (typically integer) value ( code point ). Examples: – Unicode – ASCII (ANSI X3.4) – ISO 646 – JIS X 208 – Latin-1 (ISO 8859-1)

Character Encoding Form Maps a sequence of code points (characters) to a sequence of code units (e.g. bytes). – Some encoding forms use another code unit instead of the byte. For example, some encoding forms use a 16-bit, 32-bit, or 64-bit code unit. U+00C00xC3 0x80 Often shortened as “character encoding”, “encoding form”, or, confusingly, “charset”

*(the most important slide in this presentation) All texthas a character encoding All text has a character encoding When things go wrong, start by asking what the encoding is, what encoding you expected it to be, and whether the bytes match the encoding. In memory, on disk, on the network, etc.

Common Encoding Problems Tofu hollow boxes Mojibake garbage characters Question Marks (conversion not supported)

It can happen to anyone…

Tofu Can appear as either hollow boxes (empty glyph) or as question marks (Firefox, for example) Not usually a bug: it’s a display problem Can mask or masquerade as character corruption.

Mojibake When Good Characters Go Bad

Sources of Mojibake View text using the wrong encoding Apply a transfer encoding and forget to remove it Convert to an encoding twice Convert to or from the wrong encoding Overzealous escaping Conversion to entities ( “ entitization ” ) Multiple conversions

Character Encoding Forms Their theory, structure, and use

EBCDIC

ASCII 7 bits = 2 7 = 128 characters Enough for “ U.S. English ”

Latin-1 (ISO 8859-1) ASCII for characters 0x00 through 0x7F Accented letters and other symbols 0x80 through 0xFF

One character — many character sets and many character encodings! È0xC80xD4 charcp1252cp850

Windows Code Pages Windows ’ s encodings (called “ code pages ” ) are generally based on standard encodings — plus some additional characters. Example: CP 1252 is based on ISO 8859-1, but includes 27 “ extra ” characters in the C1 control range (0x80- 0x9F)

Code Page Originally an IBM character encoding term. IBM numbered their character sets with “ CCSIDs ” (coded character set ids) and numbered the corresponding character encoding forms as “ code pages ”. Microsoft borrowed code pages to create PC-DOS. Microsoft defines two kinds of code pages: “ ANSI ” code pages are the ones used by Windows GUI programs. “ OEM ” code pages are the ones used by command shell/command line programs. Neither “ ANSI ” nor “ OEM ” refer to a particular encoding standard or standards body in this context. Avoid the use of ANSI and OEM when referring to encodings.

Beyond Single Byte Encodings So far we ’ ve been looking at single- byte encodings:  one byte per character  1 byte = 1 character (= 1 glyph?)  256 character maximum  Good enough for most alphabetic languages Some languages need more characters. What about the “double-byte” languages? Don’t those take two bytes per character? 丏丣並 À

Methods of reaching beyond single- byte Escape sequences to select another character set – Example: ISO 2022 uses escape sequences to select various encodings Use a larger code unit ( “ wide ” character encoding) – Example: IBM DBCS code pages or Unicode UTF-16 – 2 16 = 64K characters – 2 32 = 4.2 billion characters Use a variable-width encoding Variable width encodings use different numbers of code units to represent different types of characters within the same encoding form.

Multibyte Encodings One or more bytes per character – 1 byte != 1 character – May use 1, 2, 3, or 4 bytes per character -> maximum number of bytes per character varies by encoding form. – May use shift or escape sequences – May encode more than one character set Single-byte encodings are a special case of multibyte! Multibyte Encoding: Any “variable-width” encoding that uses the byte as its code unit.

JIS X 213: A Coded Character Set whose common encoding forms are multibyte JIS X 213  11,233 characters  (2) 94x94 character planes

あ 1-4-1 (code point) 6 1-3-22 (code point)

Simple Multibyte Encoding Forms Specific byte ranges encoding characters that take more than one byte. – A “ lead byte ” – One or more “ trailing bytes ” Code point != code unit あ 1-4-1 (code point) 0x82 0xA0 A 1-3-33 (code point) 0x41

Shift_JIS: A Multibyte Encoding In order to reach more characters, Shift_JIS characters start with a limited range of “lead bytes” These can be followed by a larger range of byte values (“trail byte”)

Shift_JIS

Shift-JIS Lead bytes can be trail byte values Trail bytes include ASCII values Trail bytes include special values such as 0x5C ( “ \ ” ) int pos = strchr(mybuf, ‘@’);

More Complex Multibyte Systems Stateful Encodings –ex. IBM “MBCS” code pages [SI/SO shift between 1- byte and 2-byte characters] –ISO 2022 [escape sequence changes character set being encoded]

Ad hoc Encodings

Transfer Encodings A transfer encoding syntax is a reversible transform of encoded data which may (or may not) include textual data represented in one or more character encoding schemes. Email headers URIs IDN (domain names) Abc ソース =?UTF-8?B?QWJj44K 944O844K5?= Abc ソース

Encoding Conversion Document formats often require a single character encoding be used for all parts of the document. Process Output (HTML, XML, etc.) Templates ISO 8859-1 Content UTF-8 Data Shift_JIS When data is merged, the same encoding form must be used or some of the data will be “mojibake”. Common Encoding Conversion Tools and Libraries iconv (Unix) ICU (C, C++, Java) perl Encode Java (native2ascii, IO/NIO) (etc.)

Encoding Conversion as Filter Encoding conversion acts as a “ filter ” – Replacement characters ( “ question marks ” ) replace characters from the source character set that are not present in the target character set. ISO 8859-1 ÀàÐ¡£ ?????? »èç????? ???? UTF-8 ÀàÐ¡£ ?????? »èç????? ???? ISO 8859-1 ÀàÐ¡£ UTF-8 детски »èçينس 文字 Shift_JIS 文字化け ? (0x3F) is the replacement character for ISO 8859-1

Too Many Fish in the Sea Need for more converters and conversion maps Difficulty of passing, storing, and processing data in multiple encodings Too many character sets… … leads to what we call “ code page hell ”

Unicode / ISO-10646

The Idea Behind Unicode Fights mojibake because: – characters are from the common repertoire; – characters are encoded according to one of the encoding forms; – characters are interpreted with Unicode semantics; – unknown characters are not corrupted Basic Principles – Universal repertoire – Logical order – Efficiency – Unification – Characters, not glyphs – Dynamic composition – Semantics – Stability – Plain Text – Convertibility

Unicode (ISO 10646) Unicode is a character set that supports all of the world’s languages and writing systems.  Code space of up to 0x10FFFF characters (about 1.1 million)  Unicode and ISO 10646 are maintained in sync.  Unicode is maintained by an industry consortium.  ISO 10646 is maintained by the ISO.

What are “ planes ” ?  Divide Unicode in equal sized regions of code points.  17 planes (0 through 0x10), each with 65,535 characters.  Plane 0 is called the Basic Multilingual Plane (BMP).  > 99% of text in the wild lives in the BMP  Planes 1 through 0x10 are called supplementary planes.

Unicode as the Universal Character Set An organized collection of characters. Each character has a code point aka Unicode Scalar Value (USV) U+0041 <= hex notation

Unicode Character Database code point name character class combining level bidi class case mappings canonical decomposition mirroring default grapheme clustering ӑ (U+04D1) CYRILLIC SMALL LETTER A WITH BREVE  letter  non-combining  left-to-right  decomposes to U+0430 U+0306  Ӑ U+04D0 is uppercase (and titlecase)

Compatibility Characters Many characters were included in Unicode for round-trip conversion compatibility with legacy encodings: ①②③４５Ⅵ ¾ ǈ ¼ ǋ ½ ǆ ︴︷︻︽﹁﹄ｦｨｩｫｪｭﾞ ﺲ ﺳ ﻫ ﺽ ﵬ ﷺ fi fl ffi ffl ﬅ ﬔ Compatibility Characters includes presentation forms legacy encoding: legacy encoding: a term for non- Unicode character encodings.

Byte Order Mark (BOM) U+FEFF Used to indicate the “ byte-order ” of UTF-16 code units – 0xFE FF; 0xFF FE Also used as a Unicode signature by some software (Windows ’ s Notepad editor, for example) for UTF-8 – 0xEF BB BF Appears as a character or renders as junk in some formats or on some systems. For example, older browsers render it as three bytes of mojibake.

The Replacement Character U+FFFD Indicates a bad byte sequence or a character that could not be converted. Equivalent to “ question marks ” in legacy encoding conversions � there was a character here, but it is gone now

Combining Marks Composition can create “new” characters Base + non-spacing (“combining”) characters A + ˚ = Å U+0041 + U+030A = U+00C5 a + ˆ +. = ậ U+0061 + U+0302 + U+0323 = U+1EAD a +. + ˆ = ậ U+0061 + U+0323 + U+0302 = U+1EAD

Complex Scripts ญัตติที่เสนอได้ผ่านที่ประชุมด้วยมติเอกฉันท ญั = ญ + ั glyph = consonant + vowel ญัตติที่เสนอได้ผ่านที่ประชุมด้วยมติเอกฉันท (word boundaries)

Hindi What is Unicode? यूनिकोड क्या है ? यू नि को ड न + ि = नि

Tamil Example ‘ko’ U+0B95 U+0BCA Combining mark drawn to the “left” of the base character கொ

UNICODE'S ENCODING FORMS

Unicode Encoding Forms UTF-32 – Uses 32-bit code units. – All characters are the same width. UTF-16 – Uses 16-bit code units. – BMP characters use one 16-bit code unit. – Supplementary characters use two special 16-bit code units: a “surrogate pair”. UTF-8 – Uses 8-bit code units (bytes!) – It ’ s a multi-byte encoding! – Characters use between 1 and 4 bytes. – ASCII is ASCII in UTF-8

Unicode Encodings Compared (U+1251) UTF-32:0x00001251 UTF-16:0x1251 UTF-8:0xE1 0x89 0x91 (U+10338) 0x00010338 0xD800 0xDF38 0xF0 0x90 0x8C 0xB8 A (U+0041) UTF-32:0x0000041 UTF-16:0x0041 UTF-8:0x41 À (U+00C0) UTF-32:0x000000C0 UTF-16:0x00C0 UTF-8:0xC2 0x80

UTF-32 Uses 32-bit code units (instead of the more-familiar 8-bit code unit, aka the “ byte ” ) Each character takes exactly one code unit. U+1251 0x00001251 U+10338 0x00010338

Advantages and Disadvantages of UTF-32 Easy to process – each logical character takes one code unit – can use pointer arithmetic Not commonly used – Not efficient for storage 11 bits are never used BMP characters are the most common — 16 bits wasted for each of these – Affected by processor architecture (Big-Endian vs. Little-Endian)

UTF-16 Uses 16-bit code units (instead of the more-familiar 8-bit code unit, aka the “ byte ” ) – BMP characters use one unit – Supplementary characters use a “ surrogate pair ”, special code points that don ’ t do anything else. 0x1251U+1251 0x1251U+1251 0xD800 0xDF38U+10338 0xD800 0xDF38U+10338 High Surrogate Low Surrogate 0xD800-DBFF0xDC00-DFFF Unique Ranges!

Advantages and Disadvantages of UTF-16 Most common languages and scripts are encoded in the BMP. – Less wasteful than UTF-32 – Simpler to process (excepting surrogates) – Commonly supported in major operating environments, programming languages, and libraries May not be suitable for all applications – Affected by processor architecture (Big-Endian vs. Little-Endian) – Requires more storage, on average, for Western European scripts, ASCII, HTML/XML markup.

UTF-8 7-bit ASCII is itself All other characters take 2, 3, or 4 bytes each – lead bytes have a special pattern – trailing bytes range from 0x80 -> 0xBF 0xxxxxxx 0xxxxxxx 110xxxxx 10xxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Lead Byte Trail Bytes < 0x80 < 0x800 < 0x10000 Supplementary Corresponding Code Point

Advantages and Disadvantages of UTF-8 ASCII-compatible Default or recommended encoding for many Internet standards Bit pattern highly detectable (over longer runs) Non-endian Streaming C char* friendly Easy to navigate Multibyte encoding requires additional processing awareness Non-shortest form checking needed Less efficient than UTF-16 for large runs of Asian text

HTML Set Web server to declare UTF-8 in HTTP Content-Type header Declare UTF-8 in META tag header Actually use UTF-8 as the encoding!! Вибір і застосування кодування

WORKING WITH UNICODE It’s more than just a character set and some encodings…

Unicode Properties, Annexes, and Standards Unicode provides additional information:  Character name  Character class  “ctype” information, such as if it’s a digit, number, alphabetic, etc.  Directionality (LTR, RTL, etc.) and the Bidi Algorithm  Case mappings (UPPER, lower, and Titlecase)  Default Collation and the Unicode Collation Algorithm (UCA)  Identifier names  Regular Expression syntaxes  Normalization  Compatibility information Many of these items are in the form of Unicode Technical Reports http://www.unicode.org/reports

Normalization Abc ABC abc abC aBc abc Unicode Normalization has to deal with more issues: single or multiple combining markssingle or multiple combining marks compatibility characterscompatibility characters presentation formspresentation forms Ǻ U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301

Four Normalization Forms Ǻ Form D canonical decomposition Form C canonical decomposition followed by composition Form KD kompatibility decomposition Form KC kompatibility decomposition followed by composition ways to represent: U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301

Normalization in Action OriginalForm CForm DForm KCForm KD U+01FA U+0041 U+0301 U+030A U+01FAU+0041 U+0301 U+030A U+00C5 U+0301U+01FAU+0041 U+0301 U+030A U+01FAU+0041 U+0301 U+030A U+00C1 U+030AU+01FAU+0041 U+0301 U+030A U+01FAU+0041 U+0301 U+030A U+212B U+0301 U+01FAU+0041 U+0301 U+030A U+01FAU+0041 U+0301 U+030A U+01FAU+0041 U+0301 U+030A U+0041 U+030A U+0301 U+01FAU+0041 U+0301 U+030A U+01FAU+0041 U+0301 U+030A Ǻ

Normalization: Not a Panacea Not all compatibility characters have a compatibility decomposition. Not all characters that look alike or have similar semantics have a compatibility decomposition. For example, there are many ‘dots’ used as a period. Not all character variations are handled by normalization. For example, upper, title, and lowercase variations. Normalization can remove meaning

A Bit of Bidi

Bi-directional Scripts Some languages are written predominantly from left-to- right (LTR). Some languages are written predominantly from right- to-left (RTL). (A few can be written top- to-bottom or using other schemes) Unicode defines character “directionality” and a “Bidi” algorithm for rendering text. Uses logical, not visual, order. Uses levels of “embedding”. Requires markup changes (as in HTML) or special controls for certain cases.

Embedding and “Logical Order” Characters are encoded in logical order. Visual order is determined by the layout. – Override and bidi control characters – “Indeterminate” characters

Bidirectional Embedding Paste in Arabic

Unicode Controls and Markup

Natural Language Processing

Unicode Collation Algorithm Defines default collation algorithm and sequences (UTS#10) – Must be tailored by language and “locale” (culture) and other variations. Language Swedish:z < ö German:ö < z Usage German Dictionary:öf < of German Telephone: of < öf Customizations Upper-firstA < a Lower-Firsta < A

Text Segmentation (UAX#29) Find grapheme, word, and line-break boundaries in text. Tailored by language Provides good basic default handling

CLDR and Language Specific Processing… … is in the next section

SUMMARY

“ That ’ s great: I ’ ll just use Unicode ” Remember “all text has an encoding”? user input via forms email data feeds existing, legacy data database instances uploads Use UTF-8 for HTML and Web forms Use UTF-8 in your APIs Check that data really is UTF-8 Control encoding via code; avoid hard-coding the encoding Watch out for legacy encodings Convert to Unicode as soon as practical. Convert from Unicode as late as possible. Wrap Unicode-unfriendly technologies

Your System Map Your System APIs  use Unicode encoding  hide internal storage encoding Data Stores, Local I/O  use Unicode encoding  consider an encoding conversion plan Front Ends  use Unicode encoding Back Ends, External Data  Uses Unicode?  If not, what encoding?  Store the encoding! API Unicode Legacy Encoding Detect / Convert Capture Encoding Detect / Convert Unicode Cloud Unicode Interface Convert to Legacy Input

Counting Things Be aware of whether you need to count glyphs, characters, or bytes: – Is the limit “ screen positions ”, “ characters ”, or “ bytes of storage ” ? – Should you be using a different limit? Which one are you actually counting? यूनिकोड (4 glyphs) य ू न ि क ो ड (7 characters) E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1 (21 bytes) varchar(110)

Character Encodings Code unit Code point Character Glyph Multibyte encoding – Tofu – Mojibake – Question Marks “All text has an encoding”

Unicode 17 planes of goodness – 1.1 million potential code points – 150,000 assigned code points 3 encodings – UTF-32 – UTF-16 – UTF-8 Normalize Bidi Collation Case folding … and so much more

Q&AQ&A Would you write the code for I18N on the whiteboard before you go? #define UNICODE #import I18N.h

Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Similar presentations

Presentation on theme: "Internationalization: An Introduction Tutorial from Character Encodings & Unicode."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Similar presentations

Presentation on theme: "Internationalization: An Introduction Tutorial from Character Encodings & Unicode."— Presentation transcript:

Similar presentations

About project

Feedback