Supplementary Character Support in Microsoft Products

Slides:



Advertisements
Similar presentations
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
Advertisements

Unicode and Collation Support in Microsoft SQL Server
Unicode and Keyboards on Windows
Globalization Gotchas
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 12 Introduction to ASP.NET.
Japanese Records and Whether or not to Switch from MARC 8 to Unicode Storage (with an Innovative Interfaces Millennium local system) The University of.
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer.
Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.
1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
Unicode on Downlevel Windows (IUC 18) Unicode Across Windows Michael S. Kaplan Trigeminal Software, Inc. Cathy A. Wissink Microsoft.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
26 April 2001 Unicode and Visual Basic, IUC 18 (Hong Kong) Unicode and Visual Basic: A Case Study Michael S. Kaplan Software Design Engineer Trigeminal.
26 April 2001 Unicode and Collation Support in MS SQL Server, IUC 18 (Hong Kong) Unicode and Collation Support in Microsoft SQL Server Michael S. Kaplan.
Advanced.Net Framework 2.0 David Ringsell MCPD MCSD MCT MCAD.
Unit 1: Overview of the Microsoft.NET Platform
Tutorial 3 – Creating a Multiple-Page Report
Tutorial 9 – Creating On-Screen Forms Using Advanced Table Techniques
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 2 1 Microsoft Office Word 2003 Tutorial 2 – Editing and Formatting a Document.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 7 1 Microsoft Office Word 2003 Tutorial 7 – Collaborating With Others and Creating Web Pages.
4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I
Creating a Customized List of Classes Using Microsoft Access 2000® Stephen J. Woods Idaho State University.
Excel Lesson 11 Improving Data Accuracy
Using.NET Platform Note: Most of the material of these slides have been adapted from Nakov’s excellent overview for.NET framework, MSDN and Wikipedia Muhammad.
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
Unicode and Windows XP Cathy Wissink Program Manager Globalization Infrastructure, Design and Development Windows International Microsoft.
INTRODUCTORY MICROSOFT WORD Lesson 7 – Working With Documents
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Benchmark Series Microsoft Excel 2013 Level 2
Free Pascal compiler internationalisation Rimgaudas Laucius Institute of Mathematics and Informatics, Vilnius University Lithuania.

8 November Forms and JavaScript. Types of Inputs Radio Buttons (select one of a list) Checkbox (select as many as wanted) Text inputs (user types text)
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
1 An Introduction to Visual Basic Objectives Explain the history of programming languages Define the terminology used in object-oriented programming.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
1 Chapter 20 — Creating Web Projects Microsoft Visual Basic.NET, Introduction to Programming.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Managing Business Data Lecture 8. Summary of Previous Lecture File Systems  Purpose and Limitations Database systems  Definition, advantages over file.
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Microsoft Visual Basic 2005: Reloaded Second Edition
COLD FUSION Deepak Sethi. What is it…. Cold fusion is a complete web application server mainly used for developing e-business applications. It allows.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
1 CSC160 Chapter 1: Introduction to JavaScript Chapter 2: Placing JavaScript in an HTML File.
Objectives  Explain the basic Unicode concepts in plain language  Install SILConverters 4.0  Install the converters for your branch  Convert several.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
Data Representation COE 308 Computer Architecture
Binary Representation in Text
Binary Representation in Text
Data Representation ICS 233
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16 This talk discusses the need to support.
Data Representation.
Characters & Fonts Digital Multimedia, 2nd edition
Introduction to ASP By “FlyingBono” 2009_01 By FlyingBono 2009_01
Data Representation COE 301 Computer Organization
Characters & Fonts Digital Multimedia, 2nd edition
Fundamentals of Data Representation
Data Representation ICS 233
Data Representation COE 308 Computer Architecture
Presentation transcript:

Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

What are supplementary characters? "a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate" 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) High/low surrogate? High: U+D800 - U+DBFF Low: U+DC00 - U+DFFF Terminology: "surrogate pair" preferred over "surrogate character“ See http://www.trigeminal.com/16to32AndBack.asp 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) Conversion example #1 Example #1: The first character in the Surrogate range (D800, DC00) as UTF-32: 1. D800: binary 1101100000000000 (lower ten bits: 0000000000) 2. DC00: binary 1101110000000000 (lower ten bits: 0000000000) 3. Concatenate 0000000000+0000000000 = x0000 4. Add x10000 Result: U+10000. This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF) 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) Conversion example #2 Example #2. You have a Unicode character such as U+2040A (a CJK character in Plane 2) and wish to encode it in UTF-16 1. Subtract x10000 - Result: 1040A 2. Split into two ten-bit pieces: 0001000001 0000001010 3. Add 1101100000000000 (D800) to the high 10 bits piece (0001000001) - Result: 1101100001000001 (D841) 4. Add 1101110000000000 (DC00) to the low 10 bits piece (0000001010) - Result: 1101110000001010 (DC0A) Your surrogate pair: D841, DC0A 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) UTF-8 conversions Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately) legal conversions: four-byte UTF-8 (one UTF-32 code point) CESU-8 is the the inverse of the above 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) UTF-8 example Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx Instead, you should take a Unicode surrogate pair: 110110wwwwzzzzyy, 110111yyyyxxxxxx and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1): 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx 24-26 March 2003 Prague, Czech Republic (IUC23)

Encoding choices for MS UTF-16, mostly Occasionally UTF-8 Even more occasionally, UTF-32 REASONS: There was obviously an existing, well-tested set of APIs that support UCS-2, which is a subset of UTF-16. A completely new API set was not required. A move to UTF-32 would require twice as much space for all characters. A move to UTF-8 would require even more than twice as much space in many cases. 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) The products... Mostly the new generation of products: Windows 2000/XP Office XP (some support in Office 2000) Visual Studio.Net .NET’s Common Language Runtime (CLR) Most (all) of these products supported Unicode already a little bit of extra work needed for supplementary characters usually just UTF-8 changes were needed 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) Windows 2000 Uniscribe support for rendering Each surrogate pair is a single grapheme APIs like CharPrev/CharNext not changed No specific surrogate font/IME Must be turned on: http://msdn.microsoft.com/library/en-us/intl/unicode_192r.asp 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) Windows XP *.* from Windows 2000 Turned on by default! GDI+ support for rendering Font CMAP extensions Lots of UTF-8 issues fixed No specific surrogate font/IME (yet) Extensions to fallback fonts [limited]: HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane1 HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane2 HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane3 (etc.) 24-26 March 2003 Prague, Czech Republic (IUC23)

Other system components MLang Internet Explorer http://i18nWithVB.com/surrogate_ime/ IIS 5.0/6.0 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) The downlevel story No good support for Unicode, let alone supplementary characters Uniscribe/RichEdit does improve the downlevel story for display purposes Officially, no support on Win9x 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) The Office suite Word Frontpage Excel/Access Outlook RichEdit 4.0 24-26 March 2003 Prague, Czech Republic (IUC23)

Office - Specific Features Insertion/Deletion of text - All Cursor movement - All Font linking/fallback - All (Word's is best) UTF-8 issues fixed - All Enhanced word breaking - All (Word/RichEdit) Vertical text - Word/PowerPoint/Publisher/RichEdit Direct entry (Alt+nnnnnn, hhhhh + Alt+x) - Word/RichEdit 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) CHS/CHT/CHP Office The product and the langpacks support an extended Unicode IME that handles supplementary characters An Extension B font is also included 24-26 March 2003 Prague, Czech Republic (IUC23)

.NET CLR/Visual Studio.NET String class and globalization namespace StringInfo GetTextElementEnumerator Handles supplementary characters Also handles composite characters GDI+ VS IDE support 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) SQL Server Past - no support (for Unicode, even!) Present - surrogate "safe" (neutral) Future - surrogate “aware” 24-26 March 2003 Prague, Czech Republic (IUC23)

Items not [currently] supported Character Map Graph 10 Outlook 10 mail headers Fonts/IMEs “Collations” for supplementary characters 24-26 March 2003 Prague, Czech Republic (IUC23)

Collation plan for supplementary characters in the UCA? All Plane-1 (non-ideographic) characters sort after all the other non-ideographic scripts but before the ideographs. All Plane 2 (ideographic) characters will be sorted after all the ideographs on the BMP. All Plane 3-14 (currently not assigned) will be treated like any other unassigned characters. Plane 14 language tags will be treated as if they were unassigned. All characters encoded in Plane 15-16 (private use) will be sorted after all other characters. 24-26 March 2003 Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23) Questions? 24-26 March 2003 Prague, Czech Republic (IUC23)

Supplementary Character Support in Microsoft Products Don’t forget to fill out your evals! 24-26 March 2003 Prague, Czech Republic (IUC23)