Presentation is loading. Please wait.

Presentation is loading. Please wait.

From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16 This talk discusses the need to support.

Similar presentations


Presentation on theme: "From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16 This talk discusses the need to support."— Presentation transcript:

1 From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16 This talk discusses the need to support "surrogate characters", analyzes many of the implementation choices with their pros and cons, and presents a practical example. As the preparation of the second part of ISO and the next version of Unicode draws to an end, Unicode applications need to prepare to support assigned characters outside the BMP. Although the Unicode encoding range was formally extended via the "surrogate" mechanism with Unicode 2.0 in 1996, many implementations still assume that a code point fits into 16 bits. At the end of this year, the drafts for the new standard versions are expected to be stable, and the assignment of surrogate characters for use in the East Asian markets will soon require implementations that support the full Unicode range. For example, the International Components for Unicode (ICU), an open-source project, provides low-level Unicode support in C and C++ similar to the Java JDK 1.1. In order to support the full encoding range, some of the APIs and implementations had to be changed. Several alternatives were discussed for this project and are presented in this talk. The ICU APIs and implementation are now being adapted for UTF-16, with 32-bit code point values for single characters, and the lookup of character properties is extended to work with surrogate characters. This approach is compared with what other companies and organizations are doing, especially for Java, Linux and other Unixes, and Windows. 17th International Unicode Conference

2 Why is this an issue? The concept of the Unicode standard changed during its first few years Unicode 2.0 (1996) expanded the code point range from 64k to 1.1M APIs and libraries need to follow this change and support the full range Upcoming character assignments (Unicode 3.1, 2001) fall into the added range The Unicode standard was designed to encode fewer than characters; in “The Unicode Standard, Version 1.0” on page 2 it says: Completeness. The coded character set would be large enough to encompass all characters that were likely to be used in general text interchange. It was thought possible to have fewer than characters because rarely used, ancient, obsolete, and precomposed characters were not to be encoded – they were not assumed to be “likely to be used in general text interchange”. Unicode included a private use area of several thousand code points to accommodate such needs. The only original encoding form was a fixed-width 16-bit encoding. With the expansion of the coding space that became necessary later, the 16-bit encoding became variable-width. Byte-based and 32-bit fixed-width encodings were also added over time. These changes went hand in hand with the maturation and growing acceptance and use of Unicode. 17th International Unicode Conference

3 “Unicode is a 16-bit character set”
Concept: 16-bit, fixed-width character set Saving space by not including precomposed, rarely-used, obsolete, … characters Compatibility, transition strategies, and acceptance forced loosening of these principles Unicode 3.1: >90k assigned characters Unicode 1.1 already included a number of precomposed and compatibility characters, largely to allow round-trip conversion with other important character sets. As Unicode became accepted as the universal character set standard, demand from user groups rose to include many more CJKV characters and other rarely used and specialty characters. This made it necessary to extend the code range, and UTF-16 was adopted in 1994 and published as an amendment to ISO and as part of Unicode 2.0 in It adds 1M (1024x1024) code points and makes them available in the default 16-bit encoding form by means of “surrogate pairs”, pairs of special 16-bit values, making the 16-bit encoding form variable in the number of code units per character. The upcoming Unicode 3.1 is expected to add several ten thousand characters with code points above U which will require support of surrogate pairs in 16-bit Unicode implementations. Especially, there will be more than CJKV characters in the range U to U+2ffff. 17th International Unicode Conference

4 16-bit APIs APIs developed for Unicode 1.1 used 16-bit characters and strings: UCS-2 Assuming 1:1 character:code unit Examples: Win32, Java, COM, ICU, Qt/KDE Byte-based UTF-8 (1993) mostly for MBCS compatibility and transfer protocols Programming libraries that were designed several years ago assumed for their APIs and implementations that Unicode text was stored with the original 16-bit fixed-width form. One 16-bit code unit in what the parallel ISO standard calls UCS-2 always encoded one Unicode character. This was true with Unicode 1.0 and 1.1. Examples for libraries that worked with this assumption include the Microsoft Windows Win32 API set and related APIs like COM; Java with its char type and String class; the International Components for Unicode (ICU); and the Qt library from Troll Technologies for user interface programming with the KDE desktop environment that is built on Qt. Libraries, APIs, and protocols that were defined in terms of byte-based text processing use the byte-based UTF-8 encoding of Unicode. It encodes almost all characters in more than one byte – those systems always dealt with variable-width encodings when they were internationalized at all. 17th International Unicode Conference

5 Extending the range Set aside two blocks of 1k 16-bit values, “surrogates”, for extension 1k x 1k = 1M = additional code points using a pair of code units 16-bit form now variable-width UTF-16 “Unicode scalar values” 0..10ffff16 Proposed: 1994; part of Unicode 2.0 (1996) The Unicode encoding range was extended by 1M additional code points. The extension was defined to fit into the standard 16-bit encoding: Two adjacent blocks of 1k code point values were set aside to be used only in pairs. The first code unit in such a pair has a value from d80016 to dbff16, the second one has a value from dc0016 to dfff16. These code point values are called “surrogates” and are and will never be used to encode characters except in such pairs. There are 1k x 1k = 1M possible surrogate pairs. Each surrogate has 10 bits free for encoding part of a code point value. The 10+10=20 free bits in the surrogate pair are concatenated and form numeric values from 0 to fffff16. In order to avoid an overlap of the encoding range between the single-unit and the surrogate-pair sequences, the code point value is calculated with an offset of in addition to this value, so that surrogate pairs represent code points from (= ) to 10ffff16 (= fffff16). With this extension, the total coding range is from 0 to 10ffff16 and is more than large enough for all known scripts and characters. The effect on the Unicode encoding is that the standard 16-bit form, now called UTF-16, became variable-width – starting with Unicode 2.0, one Unicode character (code point) is unambiguously encoded in either 1 or 2 16-bit code units. 17th International Unicode Conference

6 Parallel with ISO-10646 ISO-10646 uses 31-bit codes: UCS-4
UCS-2: 16-bit codes for subset 0..ffff16 UTF-16: transformation of subset 0..10ffff16 UTF-8 covers all 31 bits Private Use areas above 10ffff16 slated for removal from ISO for UTF interoperability and synchronization with Unicode In 1993, the Unicode standard and ISO were merged so that they encode the same characters with the same numeric code point. ISO defines a 31-bit code space with values of 0 to 7fffffff16 for 2G characters. The canonical encoding, called UCS-4, uses 32-bit integer values (4 bytes). The alternative encoding UCS-2 covers only the subset of values up to ffff16 with 16-bit (2-byte) values. No character was assigned a higher value than ffff16 but several ranges above that value were set aside for private use. The Unicode standard originally also used single 16-bit code units, and the two standards assigned the same numeric values beginning with the merger that was completed in 1993. UTF-8, the byte-based and first variable-width encoding for the two standards, was created even before then (in 1992) to help transition byte-oriented systems. It allows up to 6 bytes per character for all of UCS-4. The definition (in 1994) of UTF-16 as the variable-width 16-bit encoding for both standards allowed the extension of the Unicode code point range. UTF-32 was defined in 1999 to clarify the use of Unicode characters with the more limited range compared to UCS-4 but in an otherwise fully compatible way. In 2000, the workgroup for the ISO standard (JTC1/SC2/WG2) agreed to remove any allocations above the UTF-16-accessible range, i.e., above 10ffff16, in order to remove any interoperability problems between UCS-4, UTF-8, and UTF-16. 17th International Unicode Conference

7 21-bit code points Code points (“Unicode scalar values”) up to 10ffff16 use 21 bits 16-bit code units still good for strings: variable-width like MBCS Default string unit size not big enough for code points Dual types for programming? In summary, Unicode assigns 21-bit code point values to characters. In strings, 16-bit integers are still used for the base units because multiple units can be used to express values that do not fit into one single one, just like with byte-based MBCS encodings including UTF-8. For programming, this means that a choice needs to be made for the data types for Unicode support. Unlike before, a single 16-bit data type is not sufficient for both string base units and integer values for single Unicode characters. Dual types, different types for different uses, may be necessary. 17th International Unicode Conference

8 C: char/wchar_t dual types
C/C++ standards: dual types Strings mostly with char units (8 bits) Code points: wchar_t, bits Typical use in I18N-ed programs: (8-bit) char strings but (16/32-bit) wchar_t (or 32-bit int) characters; code point type is implementation-dependent In C and C++, the char type is typically 8 bits wide, while there is a second type, wchar_t, that may (or may not) be wider than that. Most string APIs work with arrays of char units, while single characters are often either of type int or of type wchar_t. There is a small set of wchar_t string APIs, too. Essentially, internationalized C/C++ programs have long dealt with dual types. The smaller (narrower) one is used as the string base type. The other one is wide enough to hold the code point value for any character. With large character sets like EUC-TW and Unicode, this wider type is a 32-bit integer. As usual, the exact types and the character set depend on the platform and the compiler. 17th International Unicode Conference

9 Unicode: dual types, too?
Strings could continue with 16-bit units Single code points could get 32-bit data type Dual-type model like C/C++ MBCS Unicode libraries and applications that are adapted to the larger code point range could use a dual-type model similar to C and C++: Using a 16-bit integer type for code units in strings as before, and a 32-bit integer for single code points. This is one of several choices, as discussed in the following. 17th International Unicode Conference

10 Alternatives to dual 16/32 types
UTF-32: all types 32 bits wide, fixed-width UTF-8: same complexity after range extension beyond just the BMP, closer to C/C++ model – byte-based Use pairs of 16-bit units Use strings for everything Make string unit size flexible 8/16/32 bits This is a list of alternative ideas that were discussed for the ICU library (International Components for Unicode) for dealing with 21-bit code points: The string unit type could be changed to be 32 bits wide. Strings would store text in UTF-32, with one code unit per Unicode character. There would be only one data type for both strings and single characters. Since support of UTF-16 would mean to handle a variable-width string encoding and dual types, the string encoding could also be changed to the byte-based UTF-8. UTF-8 sequences simply become longer for larger code point values: up to 4 bytes per character are used instead of up to 3 for the old range. Instead of using wider integer types for code points, one could use pairs of 16-bit integers for characters that also use surrogate pairs in UTF-16 strings. Instead of using any integer types for code points, one could pass strings around instead. Such strings could be in any UTF. Another idea on top of almost all of these choices is to make the string unit size flexible and configurable at least at compile-time. Strings would be encoded in either UTF-8, UTF-16, or UTF-32, and code points would be 32-bit integers or strings. The API would just take an abstract code unit type, and the implementations would have to take all variations into account. The following slides explore each of these ideas. 17th International Unicode Conference

11 UCS-2 to UTF-32 Fixed-width, single base type for strings and code points UCS-2 programming assumptions mostly intact Wastes at least 33% space, typically 50% Performance bottleneck CPU - memory Option: Changing the string base type from a 16-bit integer to a 32-bit integer. Advantage: Assumptions made in programming for UCS-2 stay intact: Each character is stored in one single code unit and the same type can be used for both strings and code points. Disadvantage: Memory usage, and potentially a reduction in performance: Since Unicode code points only use 21 bits, 11 out of 32 bits – 33% – would never be used. In fact, since the most common characters were assigned smaller values, typical applications would not use 50% of the memory that strings take up. In text processing, more memory needs to be moved from main and virtual memory into and out of the CPU cache, which may cost more performance than the reduction in operations per character from the simpler fixed-width encoding. 17th International Unicode Conference

12 UCS-2 to UTF-8 UCS-2 programming assumes many characters in single code units Breaks a lot of code Same question of type for code points; follow C model, 32-bit wchar_t? – More difficult transition than other choices Option: Changing the string type to UTF-8. This alone does not affect the choice of a data type for code points. Advantage: UTF-8 is a popular variable-width encoding for Unicode. The memory consumption is higher or lower than with UTF-16 depending on the text. Disadvantage: Changing a UCS-2 library to use UTF-8 would break a lot of code even for code points below ffff16 because much of the implementation relies on special characters (digits, modifiers, controls for bidirectional text, etc.) to be encoded in a single unit each. Note: Existing UTF-8 systems need to make sure that 4-byte sequences and 21-bit code points are handled; some may assume that UTF-8 would never use more than 3 bytes per character and that the scalar values would fit into 16 bits, which was the case for Unicode 1.1. 17th International Unicode Conference

13 Surrogate pairs for single chars
Caller avoids code point calculation But: caller and callee need to detect and handle pairs: caller choosing argument values, callee checking for errors Harder to use with code point constants because they are published as scalar values Significant change for caller from using scalars Option: Duplicating code point APIs by adding surrogate-pair variants. Strings are in UTF-16. A caller would check for surrogate pairs and call either function variant and advance in the text by one or two units in case of an iteration. It is also possible to replace existing functions by the pair variant; the caller could always pass in the current and the following code unit. In this case, the function needs to return the number of units that it used to allow forward iteration. For backward iteration, there may be additional provisions. Advantage: The API would still only work with 16-bit integers. Disadvantage: The usage model becomes significantly more complicated. Some of the work for detecting and dealing with surrogates would be done twice for robust interfaces, once by the caller and a second time by the API implementation. The API itself becomes more convoluted and harder to use.Also, character code points are typically published, discussed, and accessed as scalar values. Forcing a programmer to calculate the surrogate values would be clumsy and error-prone. 17th International Unicode Conference

14 Strings for single chars
Always pass in string (and offset) Most general, handles graphemes in addition to code points Harder to use with code point constants because they are published as scalar values Significant change for caller from using scalars Option: Use strings in any UTF even for single code points. The number of code units that was processed may need to be returned. Advantage: Only one data type is necessary for both general string operations and single-character functions. In addition, this approach is general enough to cover combining sequences for “user characters” (“graphemes”), and to provide context in some cases. Disadvantage: This is also a significant change in the use of single-character APIs that would initially break some code more than changing the width of the integer type used. It is harder to use with code point constants, like in the case of using surrogate pairs. It may be more difficult to iterate backwards code point-wise through text with the same functions – the caller and the API function would need to exchange sufficient information to make this possible. 17th International Unicode Conference

15 UTF-flexible In principle, if the implementation can handle variable-width, MBCS-style strings, could it handle any UTF-size as a compile-time choice? Adds interoperability with UTF-8/32 APIs Almost no assumptions possible Complexity of transition even higher than of transition to pure UTF-8, performance? Option: With conceptually dual types for strings and single code points, could the size of the string unit be flexible, to be set at compile-time? This would somewhat mirror the C model where the string type is typically fixed and the code point type depends on the compiler and platform; here, the variability of the types would be the opposite. Advantage: This offers customizable interoperability with other libraries and protocols by choosing the same UTF for internal string operations as what the application is expected to process. Disadvantage: The “minimum-length problem” of UTF-8 may need to be taken into account (UTF-16 does not have this problem). Many algorithms are faster if they can assume that special characters occupy exactly one code unit. In UTF-8, only the ASCII repertoire at 0..7f16 can be represented with single code units. Some code, in ICU especially the codepage conversion library, is written so specifically for the encoding definition that it would require different versions of almost all of that code. 17th International Unicode Conference

16 Interoperability Break existing API users no more than necessary
Interoperability with other APIs: Win32, Java, COM, now also XML DOM UTF-16 is Unicode default: good compromise (speed/ease/space) String units should stay 16 bits wide Further considerations for the support of “surrogate characters” include the question of migration and interoperability. It is desirable to not change the programming model more than necessary. It is also desirable to use the same string type that other popular and important systems use: The leading Unicode implementations in Windows Win32 and COM as well as in Java use 16-bit Unicode strings, as well as the XML DOM specification and many other systems. Windows is an important platform, Java an important programming language, and specifically for ICU, the open-source IBM XML parser is one of the most important applications using ICU. UTF-16 is also the default encoding form of Unicode, and it provides a good compromise between memory usage, performance, and ease of use. This lead to the decision to continue to use 16-bit strings in ICU. 17th International Unicode Conference

17 Does everything need to change?
String operations: search, substring, concatenation, … work with any UTF without change Character property lookup and similar: need to support the extended range Formatting: should handle more code points or even graphemes Careful evaluation of all public APIs Many string operations work the same for UCS-2 and for UTF-16. Especially, searching for and extracting substrings, concatenating strings, and similar operations work without any change. In fact, these operations are safe and simple with any UTF because of their good designs: Unlike many common MBCS encodings, they do not have overlapping code unit values for single-character, lead, and trail units, so that string matches, forward and backward iteration in text, and random access are always efficient and unambiguous. The necessary changes affect all functions with single-character arguments. On the most basic level, lookups for Unicode character properties (Is lowercase? Get bidirectional class, …) are defined in terms of code points that do not fully cover Unicode any more with 16-bit arguments. In formatting operations in higher-level I18N functions, single characters are often used for padding or currency characters. These also need to support any Unicode character, but beyond that may need to support “user characters” or graphemes. 17th International Unicode Conference

18 ICU: some of all Strings: UTF-16, UChar type remains 16-bit
New UChar32 for code points Provide macros for C to deal with all UTFs: iteration, random access, … C++ CharacterIterator: many new functions Property lookup/low-level: UChar32 Formatting: strings for graphemes For ICU, we decided to not change the string base type: the UChar type is still an unsigned 16-bit integer. We also did not change the way offsets into and lengths of strings are calculated: For efficiency and interoperability, indexing into strings is done on a code unit-basis rather than on a code point-basis. We introduced a new type, UChar32, for code point values in low-level operations. It is a 32-bit integer. It may be signed or unsigned because it is defined to be the same as wchar_t if that is 32 bits wide. Functions for Unicode character property lookup and similar were changed from taking and returning UChar to using UChar32. There is a set of new macros to handle all UTFs and especially UTF-16 to make it easier for C programmers to iterate through UTF-16 strings and access code point values. For C++, the CharacterIterator class and its implementation subclasses have a number of new methods that also provide convenient code point access, and the UnicodeString class, which originates from the Java String class, has functions for random access to code points. Formatting classes that used single UChar values were changed to take strings instead to prepare for graphemes. 17th International Unicode Conference

19 Scalar code points: property lookup
Old, 16-bit: UChar u_tolower(UChar c){ u[v[c15..7]+c6..0]; } New, 21-bit: UChar32 u_tolower(UChar32 c){ u[v[w[c20..10]+c9..4]+c3..0]; } Efficient and space-saving lookups of properties and character mappings for conversion used to be done with 2-stage “compact arrays” where the most significant of the 16 code point bits were used to get an offset into a compacted array. This kind of access scales well to 21-bit input when a third stage is added with the same principles. An extension to larger values would probably make it necessary to change this to use hash tables or other efficient means of accessing very sparse structures. 17th International Unicode Conference

20 Formatting: grapheme strings
Old: void setDecimalSymbol(UChar c); New: void setDecimalSymbol(const UnicodeString &s); This is an example for the change from using single characters to using a string for graphemes. It is a useful change regardless of the width of the UChar integer type and the UTF because it supports modifier letters, Ideographic Description Sequences, isolated shapes using ZWNJ, etc. 17th International Unicode Conference

21 Codepage conversion To Unicode: results are one or two UTF-16 code units, surrogates stored directly in the conversion table From Unicode: triple-stage compact array access from 21-bit code points like property lookup Single-character-conversion to Unicode now returns UChar32 values The ICU codepage conversion used to handle only 16-bit Unicode code point mappings for table-based conversions (not for UTFs, of course). With surrogate characters expected to be assigned early in 2001, we implemented a new MBCS converter that can convert to and from surrogate pairs. The conversion to Unicode results in either one or two 16-bit units for each byte sequence. Surrogate pairs are stored directly as such. The byte sequence is processed with a state machine based on the codepage structure, and the final state for each valid sequence indicates whether the result is stored with one or two code units. The converter function that provides a forward-iteration through codepage input and returns a Unicode character per function call now returns a UChar32. The conversion from Unicode first assembles a 21-bit code point value and then performs a 3-stage “compact array” lookup as described before, like for Unicode properties. 17th International Unicode Conference

22 API first… Tools and basic functions and classes are in place (property lookup, conversion, iterators, BiDi) Public APIs reviewed and changed (“luxury” of early project stage) or deprecated and superseded by new versions Higher-level implementations to follow before Unicode 3.1 published After analyzing and discussing the options for adding surrogate support to ICU, we modified the API to accommodate the changes. Many of the tools that process Unicode character properties and conversion tables into binary formats have been modified, and some of the implementation is updated for true UTF-16 support. We focused on changing the API first because we wanted to stabilize it as early as possible. An added incentive was that, since ICU is used by more and more projects, the earlier we change the API the fewer projects need to adapt to the new version. Some APIs were deprecated and superseded rather than changed. Adding full support in the API first and in the implementation later gives more time for the implementation work and prepares users for the new API. 17th International Unicode Conference

23 More implementations follow…
Collation: need to prepare for >64k primary keys Normalization and Transliteration Word/Sentence break iteration Etc. No non-BMP data before Unicode 3.1 is stable Examples for implementations that still need to be adapted for full UTF-16 support are the ones for collation, normalization, transliteration, and break iteration. The most interesting case is collation: In principle, it works with strings and can handle surrogate pairs like combining sequences. However, the efficiency suffers with large numbers of such sequences and of primary sort keys. With the expected addition of CJKV characters, this needs to be augmented. Until the repertoire extension with Unicode 3.1 is sufficiently stable, there is no real data that uses surrogate pairs. The only possible exception are the private use characters in f to 10ffff16 that appear to be used rarely – probably partly due to lack of surrogate support by systems and libraries. 17th International Unicode Conference

24 Other libraries Java: planning stage for transition
Win32: rendering and UniScribe API largely UTF-16-ready Linux: standardizing on 32-bit Unicode wchar_t, has UTF-8 locales like other Unixes for char* APIs W3C: standards assume full UTF-16 range Other systems and libraries that use 16-bit Unicode are in various stages of supporting the full UTF-16 range: Java is only in the planning stage for surrogate support. It does not currently provide any classes or methods to explicitly deal with surrogate pairs or Unicode code points above ffff16. Microsoft Windows provides some support for handling and rendering surrogate pairs, especially with the UniScribe library. For example, a surrogate pair is rendered by modern versions of Internet Explorer not as two but only as one “missing” glyph; there will of course not be any real glyph as long as there is no font containing one. Linux is using UTF-8 in many low-level APIs and with UTF-8 locales. The C standard library (glibc) is moving towards UCS-4 support. The Li18nux2000 specification (see for internationalizing Linux recommends ICU as a component for Linux distributions. ICU uses UTF-16, while Qt and KDE still use UCS-2. More recent Internet standards like HTML 4.0 and later specify Unicode as the document character set, and UTF-8 as the preferred charset. Especially, XML specifies the full UTF-16 range (with specific exclusions) as its document encoding, and the DOM specification defines the API to use UTF-16 strings. 17th International Unicode Conference

25 Summary Transition from UCS-2 to UTF-16 gains importance after four years of standard APIs for single characters need change or new versions String APIs: no change Implementations need to handle 21-bit code points Range of options The transition of the default Unicode encoding form from UCS-2 to UTF-16 that started in 1994 with the definition of UTF-16 and was published as part of Unicode 2.0 in 1996 gains importance with “real” character assignments above ffff16 expected in Especially the new CJKV characters are expected to be important for East Asian text processing and are likely to accelerate the acceptance of Unicode in East Asia. Software that uses 16-bit Unicode needs to be modified to handle the extended encoding range. On the API level, 16-bit strings are syntactically compatible. Single-character APIs need to be modified, or a new version made available, if they used 16-bit integers for Unicode code points. The actual code point values take up 21 bits. There are several options for transition APIs for surrogate support, and some of them are discussed in this presentation. 17th International Unicode Conference

26 Resources Unicode FAQ: http://www.unicode.org/unicode/faq/
Unicode on IBM developerWorks: ICU: 17th International Unicode Conference


Download ppt "From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16 This talk discusses the need to support."

Similar presentations


Ads by Google