Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.

Slides:



Advertisements
Similar presentations
What a HotDocs Developer Should Know About Microsoft Word Alan Soudakoff, Bart Earle, Marc Lauritsen Capstone Practice Systems December 9, 2009.
Advertisements

Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
Chris Pratley Group Program Manager Microsoft Word.
Chris Pratley Lead Program Manager Microsoft Office.
Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+
HTML I. HTML Hypertext mark-up language. Uses tags to identify elements of a page so that a browser such as Internet explorer can render the page on a.
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
Lesson 11 Presentation Graphics
XHTML Basics.
GUI Testing. High level System Testing Test only those scenarios and outputs that are observable by the user Event-driven Interactive Two parts to test.
CIS-100 Chapter 3—The Ribbon. The Ribbon When you first open Word 2007, you may be surprised by its new look. Most of the changes are in the Ribbon, the.
1 ADVANCED MICROSOFT POWERPOINT Lesson 5 – Using Advanced Text Features Microsoft Office 2003: Advanced.
Binary Expression Numbers & Text CS 105 Binary Representation At the fundamental hardware level, a modern computer can only distinguish between two values,
XP New Perspectives on Microsoft Office Word 2003 Tutorial 1 1 Microsoft Office Word 2003 Tutorial 1 – Creating a Document.
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
XP 1 Microsoft Office Word 2003 Tutorial 1 – Creating a Document.
1/25 Writing Character sets Unicode Input methods.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Creating Web Page Forms
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft.
Word Processing basics
Laboratory Exercise # 13 – Font and Number Format Styles Office Productivity Tools 1 Laboratory Exercise # 13 Font and Number Format Styles Objectives:
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Copyright (c) 2004 Prentice Hall. All rights reserved. 1 Committed to Shaping the Next Generation of IT Experts. Go! With Microsoft Office (Word) 2003.
2.1 Different Text Attributes Font A set of printable or displayable text characters with its style and size specified Arial 16 point bold Arial 32 point.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
ASCII and Unicode.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Agenda Data Representation – Characters Encoding Schemes ASCII
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Microsoft Office Word 2003 Tutorial 1 Creating a Document.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Microsoft Office Illustrated Introductory, Premium Edition with Word 2003 Getting Started.
Microsoft Office 2007 Word Chapter 1 Creating and Editing a Word Document.
HTML (HyperText Markup Language)
Using Html Basics, Text and Links. Objectives  Develop a web page using HTML codes according to specifications and verify that it works prior to submitting.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Lesson 2: Applying Advanced Formatting
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Text and Graphics September 26, Unit 3.
1 by Mary Anne Poatsy, Keith Mulbery, Lynn Hogan, Amy Rutledge, Cyndi Krebs, Eric Cameron, Rebecca Lawson Chapter 1 Introduction to Word.
1 Week 1 l HTML l Applets Applets and HTML. 2 Overview l Applets: Java programs designed to run from a document on the Internet l HTML: Hypertext Markup.
1 Word 2010 Intro to Word – Part 2. 2 Steps for Creating a Document  Step 1: Open a Blank Document (New, or Open)  Step 2: Name the Document (Save As.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Complex Scripts* in Internet Explorer 5.0 *and Multilingual text F. Avery Bishop Senior Program Manager Microsoft Corporation.
Use CSS to Implement a Reusable Design Selecting a Dreamweaver CSS Starter Layout is the easiest way to create a page with a CSS layout You can access.
ICT 111 – PART 2 APPLICATIONS SOFTWARE /11: APPLICATIONS SOFTWARE Remember: Computer hardware VS human body Computer operating systems VS human.
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Lesson: 2 Common Features and Commands After completing this lesson, you will be able to: Identify the main components of the user interface. Identify.
1. Chapter 1 Creating, Printing, and Editing Documents.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Mr. Munaco Computer Technology TEACHING ADVANCED WORD 2007.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
17th International Unicode Conference 1 Font Coverage in Windows Bob Rasmussen Rasmussen Software, Inc. Bob Rasmussen: Master layout Codeexamples.
Objectives  Explain the basic Unicode concepts in plain language  Install SILConverters 4.0  Install the converters for your branch  Convert several.
Chapter 3: Mastering Editors Chapter 3 Mastering Editors (Emacs)
HTML5 Basics.
INFS 211: Introduction to Information Technology
Characters & Fonts Digital Multimedia, 2nd edition
Microsoft Word 2003 Illustrated Complete
Characters & Fonts Digital Multimedia, 2nd edition
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Presentation transcript:

Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text

Whats RichEdit? u RichEdit 3.0 is set of plain/rich-text, single/multiline Unicode/ANSI edit controls in single world-wide binary u Multilevel undo, message & com interfaces, Word compatibility, pretty rich text u Outline view, zoom, font binding, latest in IME support, and rich complex script support (BiDi, Indic, and Thai) u Next version: pagination, nested tables, tight wrap, 2D math (maybe!)… u Clients: Office dialogs, WordPad, Outlook RTF editor, Pocket Word,…

Introduction Discuss some problems in manipulating multilingual Unicode text: u Multiple fonts to display Unicode plain text u Neutral characters, deunifying characters that look different in different scripts u Working with complex scripts, like Arabic u Using keyboards to enter Unicode characters conveniently u Maintaining backward compatibity with previous character sets u Navigating through text that includes multicharacters u Implementing glyph variants and surrogate pairs

Font Binding u Most Unicode characters belong to scripts u Associate with each position in a document a font bundle u When inserting characters, assign each one to a script u For CJK, check surrounding characters for Kana and Hangul as clues to use Japanese or Korean fonts instead of Chinese u Assign scripts to neutrals and digits u Keyboard language, especially IMEs, provide strong binding clues u Format inserted characters with fonts assigned to scripts. Check current font to see if it supports required script

Font Binding Problems u Character not in any script, e.g., mathematical, arrows, dingbats: use current font or bind to font with font signature covering appropriate Unicode range. Or invent new script ID u Font signature may be zero, i.e., unsupported. Call EnumFontFamiliesEx() to enumerate all charsets for facename u Font signature may claim support for Unicode ranges, but miss some characters. cmap reveals support on codewise basis (slow to access) u Ironically, charset or codepage is a good script ID

Language Detection & Font Binding u Korean and Japanese are often easy to spot because of Hangul and Kana characters, respectively u For CJK can convert back to codepage and see if errors occur (Ken Lundes suggestion) u For proofing purposes, accurate language identification is needed. For font binding, script identification is usually sufficient u Typically more than one language corresponds to a script, e.g., Latin script. Essentially only one uses the Korean script u Natural language processing techniques allow good language identification if more than a few words are involved, e.g., a sentence

Big Fonts u BitStream Cyberbit has most Unicode characters (big font) u Some big fonts have CJK glyph variants for Japanese vs Simplified Chinese vs Traditional Chinese vs Korean u Font-binding code needs to avoid unnecessary (and unwanted) font binding with such fonts u Recognize such fonts by using font signature Unicode ranges and script (codepage) information

Font Sizing u In dialogs, 8-pt Latin characters are commonly used u 8-pt Chinese characters are hard to read, so better to use 9 points in combination with 8-pt Latin characters u Latin characters have bigger descenders than Chinese characters, since latter only need room for underline u Combining 8-pt Latin characters with 9-point Chinese characters and keeping same baseline increases line height to 9 pts plus extra height for Latin descender u Result is more like 10 points: shifts text too high in dialog box originally designed to handle one language

Complex Scripts u Unicode covers many complex scripts, e.g., Arabic, Thai u Complex-scripts require layout engine that translates character codes to glyph indices (often referencing ligatures) u General Unicode text engine has to have access to complex-script layout engine u At the previous Unicode conference David Brown discussed such an engine, Uniscribe, which runs on all Windows platforms and is shipped with recent versions of Internet Explorer u For performance: only use CS engine if needed

Neutrals u Many characters are neutral or multiscript and can be rendered with many different fonts u E.g., blank, ASCII punctuation, ASCII in general, other punctuation, and decimal digits u Some scripts render neutrals very differently than others and Unicodes occasional over-unification has complicated what font to use u E.g., Western ellipsis consists of three dots on baseline, while a Japanese ellipsis has three raised dots u Unicode Standard gives detailed rules for neutrals in BiDi text u Simple rule: neutrals are surrounded by nonneutral characters of same kind should be rendered with font of nonneutrals u Compatibility characters, such as ASCII fullwidth characters, reveal which script they belong to

Backward Compatibility u Unicode text engine has to be able to import and export text in other standards, which are defined by their codepages u Given nonUnicode plain text, which codepage should one use to convert to/from Unicode? u On localized systems, system code page is a good bet u In multilingual text, you can enter text using keyboards in a variety of languages that need either Unicode or multiple code pages u For searching text, best choice seems to be to use the current keyboard code page u If text begins with a UTF-8 BOM, use UTF-8 conversion u If text begins with a rich-text header, e.g., {\rtf or or <!doctype html, use appropriate conversion routine

Backward Compatibility (cont) u Need a little rich-text functionality (minimal language tagging) to display Unicode plain text unambiguously in some CJK scenarios u This functionality handles font choices and language-dependent glyph variants u There can be a disparity between typed text and set text u When a user types in text using a keyboard charset, edit engine knows charset and therefore can insert accurate Unicode text including which CJK glyph variant to use u Client gets text as pure ANSI (or Unicode) text without script clues u Would be handy to have script tags. Language tags also work, but are a case of overkill unless proofing tools are to be supported

Unicode on Win95/98 u Win95/98 supports a limited subset of Unicode text functions u ExtTextOutW() works in most cases. Not on Win95J or with metafiles, so convert back to ANSI whenever possible u Device drivers may not handle Unicode text u With TrueType its possible to force downloading of fonts and use Unicode more reliably u A number of GDI text APIs arent implemented, e.g., GetGlyphOutlineW(). u GetStringTypeExW is stubbed out, so all references to character property tables have to go through a codepage translation (WideCharToMultiByte()). u Text boxes, list boxes, comboboxes are all ANSI; use RichEdit for Unicode

Unicode Keyboard Input u National keyboards provide ways to input many Unicode characters. E.g., Greek, Russian, and all ordinary European text. u IMEs (input method editors) let you type phonetic characters to get a partially composed character sequence. Then type blank to request composition. If the composition is reasonably unique, you get a fully composed character; else you get menu of possible resolutions. u To enter Unicode Hex input type a Unicode hexadecimal code into the text. type a special hot key, e.g., Alt+x, to convert the hex to a Unicode character u Type Alt+X to replace a character by its hexadecimal number. u Input Sequence Checking. Vietnamese, Thai, and Indic languages dont allow all Unicode sequences to be valid and utilize special input sequence checking code to disallow illegal sequences. For example, Vietnamese only allows tone marks on vowels.

Unicode Surrogates u Discuss 3 display models that could enable Win9x/WinNTx based applications to display higher-plane characters (those in the 16 planes above the BMP). Ideas are still under development... u First uses a plane index and a 16-bit offset u Second uses a flat 32-bit index u Third uses surrogate-pair ligatures u Models arent mutually exclusive, since they involve different cmaps (compressed tables used to convert codepoints to glyphs) u All assume higher-plane characters are stored as standard Unicode surrogate pairs u Alternative representations include straight 32-bit characters and UTF- 8, but arent as practical

Unicode Surrogates (cont) u Using 2 16-bit surrogates to represent a single character complicates more than measurement and display of characters: u Arrow-key handlers and other methods that change character position must avoid ending up in between lead and trail surrogates u Input methods need to map to surrogate pair u Case changes, line-breaking rules, sorting, file formats, and backing- store manipulations in general have to recognize and deal with pairs u Surrogate code ranges make them easy to work with relative to multibyte encoding systems u All three display models assume that GDI remains unchanged (need to be able to run on OSs already in field u Also assume that 16-bit glyph indices are sufficient so that TrueType rasterizer doesnt need to be revised

Surrogate Planar Model u Characters in font all belong to a particular plane u No changes required to OS. Applications extend font binding logic to handle font switches to appropriate planes u Character indices remain 16-bit: allows ExtTextOutW family to be used directly u Model easy for apps to use today in platform-independent way if no complex scripts are involved u Complex scripts need layout engine. Then applications can ignore model issue, since layout engine handles OS/font interactions u Truncated 16-bit code indices may map codes in higher planes to common control or neutral codes u For surrogate-unaware text-processing code, some ranges would have to be reserved in upper planes

Surrogate Flat and Ligature Models u Flat 32-bit model uses 32-bit code to index into a new 32-bit cmap in font file to translate the codes to 16-bit glyph indices u Glyph indices are used to access TextOut family u Method is too tricky for most applications to handle directly: need surrogate-aware version of Uniscribe u Font binding is done using font signature u Alternatively, application could use 32-bit character strings with a 32- bit TextOut family housed in platform-independent component u Ligature model requires use of complex-script engine to access ligature tables

Comparison of Surrogate Models u Ease of implementation: for simple scripts, planar model is easiest. In worldwide-binary environment, need Uniscribe, which can handle OS/font interactions u Performance: Code to glyph mapping has to be done at some point. Uniscribe is slower and more RAM intensive than planar model or 32- bit TextOut component u Flexibility: flat and ligature models can access chars in all 17 planes even in same font; planar model one plane per font u Backward compatibility: planar model only needs appropriate fonts and surrogate-aware apps to work on all Windows platforms u Flat and ligature models require a complex-script engine or a 32-bit TextOut component to run on all Win9x/WinNTx platforms

Nonspacing Combining Marks u Multicode characters (surrogate pairs, CRLFs, combining-mark and variant-tag sequences) require special display/navigation handling u Render combining-mark sequences by standard systems calls and fonts that support combining marks. Better display needs layout engine that talks to OpenType u Simple caret movement across combining-mark sequences prevents stopping inside a sequence. Backspace key deletes one mark at a time u Mouse-cursor hit testing leaves selection at beginning/end of combining-mark sequence (more elegant model allows selection and editing of individual marks) u Cool thing: if you can navigate past CRLF combinations, you can modify corresponding code to handle surrogate pairs and combining- mark sequences quite easily

Glyph Variants u Character variant: 1) Different character open to future coding, 2) Prescribed variant (Mongollian), 3) Systematic semantic variation (different forms like italic, bold, script, Fraktur in math expressions) Glyph variant: 1) Artistic variant: free variation (57 &s in Poetica font), 2) Context preferred style (CJK language- based variants), 3) Overloaded code points (U+005C: \ ¥ ), 4) Historical variant: glyph changed over time u Identity variant: 2 external characters map to same Unicode character

Handling Glyph Variants u Character variant is open to separate encoding. But if already used, complicates search algorithms (Ş vs Romanian S comma) u Two approaches: inline variant marks and out-of-plane annotations u Inline variant marks need to be ignored in some searches u Out-of-plane annotation is invisible in plain text and requires more memory than inline variant mark u Semantically different characters, e.g., math italic b and math script b, need to be distinguishable in searches, so separate encoding or use of inline variant marks are desirable u Current proposal for inline variant marks defines 256 standard variant codes in plane 14 as well as 256 codes for user-defined variant codes

Conclusions u Have addressed issues encountered in creating Unicode editors. Issues include: u Automatic choice of fonts for Unicode plain text u Handling nonUnicode documents in Unicode text engines u Ways to input Unicode text u Combining-mark sequences, surrogate pairs, navigation in multicode text, and glyph variants u Some ideas have been implemented in RichEdit 3.0 control and other text engines u Unicode surrogate pairs and glyph variants need decisions...