Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Similar presentations


Presentation on theme: "Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+"— Presentation transcript:

1 Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

2 Introduction u RichEdit is a text engine with a hierarchy of presentation formats u Features such as automatic choice of fonts, rich text, 2D text objects u Handling nonUnicode documents in Unicode text engines u Describe interfaces and component usage u Ways to input Unicode text using IMEs, speech u Demo

3 Whats RichEdit? u RichEdit 4.x is set of plain/rich-text, single/multiline Unicode/ANSI edit controls and combo/listboxes in single world-wide binary u Multilevel undo, message & com interfaces, Word compatibility, pretty rich text u Outline view, zoom, font binding, latest in IME support, and rich complex script support (BiDi, Indic, and Thai)

4 Clients include u Handheld PC PocketWord u eBooks u OE (for mail header) u Borlands Delphi u SQL server dev tools, RAID u MSN Companion chat u Via Win2k Wrapper – cc:mail, WebEditPro, Eudora, Encarta, Money(US), Sibelius, Borland TRichedit class, apps created with VB, MFC… u Outlook mail note, post-it u Most Office dialogs u All OSes since Win98 u Wordpad, Charmap u Darwin installer u WebCalc u Project u Visual Studio, DaVinci u Publisher u Front Page

5 Some Fancier Features u Features added for ebooks: pagination, hyphenation, kerning, ClearType support, text wrap around embedded objects u Multilevel tables u Autocorrect u AutoURL detection (improved from 3.0)

6 2D Text Objects u RichEdit 4.5 (in development) supports WYSISYG editing of many 2D objects u Ruby, Tatenakayoko, Warichu, Kumimoji u Math: fractions, autosizing brackets, boxes, matrices, integrals u Demo will show some of these features

7 Backward Compatibility u Unicode text engines need to import/export text in other character sets u Given nonUnicode plain text, which codepage should one use to convert to/from Unicode? u On localized systems, system code page is a good bet u In multilingual text, you can enter text using keyboards in a variety of languages that need either Unicode or multiple code pages u For searching text, best choice seems to be to use the current keyboard code page u If text begins with a BOM, its Unicode u If text begins with a rich-text header, e.g., {\rtf or, use appropriate conversion routine

8 Backward Compatibility (cont) u Need a little rich-text functionality to display Unicode plain text unambiguously in some CJK scenarios u This functionality handles font choices and language- dependent glyph variants u When a user types in text using a keyboard charset, edit engine knows charset and therefore can insert accurate Unicode text including which CJK glyph variant to use u Client gets text as pure ANSI (or Unicode) text without script clues u Would be handy to have script tags

9 Complex Scripts u Unicode covers many complex scripts, e.g., Arabic, Indic, Thai, ancient Korean u Complex-scripts require layout engine that translates character codes to glyph indices (often referencing ligatures) u RichEdit uses Uniscribe and the MS line- layout component for complex scripts

10 Font Binding u Most Unicode characters belong to scripts u Associate with each position in a document a font bundle u When inserting characters, assign each one to a script u For CJK, check surrounding characters for Kana and Hangul as clues to use Japanese or Korean fonts instead of Chinese u Assign scripts to neutrals and digits u Keyboard language, especially IMEs, provide strong binding clues u Format inserted characters with fonts assigned to scripts. Check current font to see if it supports required script u RichEdit 4.0 has 50 scripts for Unicode 3.1. Client can specify what default font to use for a given script.

11 Language Detection & Font Binding u Korean and Japanese are often easy to spot because of Hangul and Kana characters, respectively u For CJK can convert back to codepage and see if errors occur (Ken Lundes suggestion) u For proofing purposes, accurate language identification is needed. For font binding, script identification is usually sufficient u Typically more than one language corresponds to a script, e.g., Latin script. Essentially only one uses the Korean script u Natural language processing techniques allow good language identification if more than a few words are involved, e.g., a sentence

12 Font Sizing u In dialogs, 8-pt Latin characters are commonly used u 8-pt Chinese characters are hard to read, so better to use 9 points in combination with 8-pt Latin characters u Latin characters have bigger descenders than Chinese characters, since latter only need room for underline u Combining 8-pt Latin characters with 9-point Chinese characters and keeping same baseline increases line height to 9 pts plus extra height for Latin descender u Result is more like 10 points: shifts text too high in dialog box originally designed to handle one language

13 Unicode Surrogate Pairs u Using 2 16-bit surrogates to represent a single character complicates more than measurement and display of characters: u Arrow-key handlers and other methods that change character position must avoid ending up in between lead and trail surrogates u Input methods need to map to surrogate pair u Case changes, line-breaking rules, sorting, file formats, and backing-store manipulations in general have to recognize and deal with pairs u Surrogate code ranges make them easy to work with relative to multibyte encoding systems

14 Nonspacing Combining Marks u Multicode characters (surrogate pairs, CRLFs, combining-mark and variant-tag sequences) require special display/navigation handling u Render combining-mark sequences by standard systems calls and fonts that support combining marks. Better display needs layout engine that talks to OpenType u Simple caret movement across combining-mark sequences prevents stopping inside a sequence. Backspace key deletes one mark at a time u Mouse-cursor hit testing leaves selection at beginning/end of combining-mark sequence (more elegant model allows selection and editing of individual marks) u Cool thing: if you can navigate past CRLF combinations, you can modify corresponding code to handle surrogate pairs and combining- mark sequences quite easily

15 Interfaces u Messages and keyboard u File read/write (plain text or RTF) u TOM (Text Object Model) u ITextServices/ITextHost interfaces

16 RichEdit Message Interface u System messages u keyboard messages u mouse messages u clipboard messages u Edit messages – RichEdit supports all but four of the system edit messages u RichEdit messages u Character/paragraph formatting u Text input/query u Notification

17 File Formats u Plain text can be saved/read encoded in any codepage, including Unicode and UTF-8 u RTF is the principle rich-text format u UTF-8 RTF is used preferentially for cut/copy/paste. Can be used in stream operations u Copying text to/from Word can be a handy way to get desired formatting into a RichEdit instance u HTML is available via system converters

18 TOM ( Text Object Model) u A set of COM dual interfaces that allow Unicode rich/plain text to be manipulated by VB, C/C++, and Java clients. u Access for spelling/grammar checkers u Accessibility u Powerful and efficient text processing primitives. Embedded scripts

19 TOM(cont) ITextDocument Top-level editing object ITextStoryRanges Enumerator for stories in document ITextRange Primary text interface: range of text ITextFont Character-attribute interface ITextPara Paragraph-attribute interface ITextTag HTML Tag interface ITextAttributes Tag-attribute enumerator ITextSelection Screen highlighted text range TextRange Selection inherits all range methods

20 ITextServices/ITextHost Interfaces u Windowless interfaces that go beyond message interface u In-place active state – use window of the container u Fewer system resources u Faster activation and deactivation

21 Other Components used u Uniscribe u MS line-layout component u Windows Text Services Framework u Callbacks for access to word-break, auto correct, hyphenation, and Clear Type libraries

22 Input methods u Support for the latest IMEs u Speech and handwriting input (Windows Text Services Framework) u Alt-x Unicode input method u Standard hot keys

23 IMEs u Support Level 2 and Level 3 IMEs u Support Active Input Method Manager (AIMM) u Reconversion - user can convert final string back to composition mode, allowing easy selection of a different candidate string. u Document feed - provides IME with text for current paragraph to increase conversion accuracy during typing. u Mouse Operation - gives user better control over candidate and UI windows u Caret position - gets current caret and line info, which IME98 uses to position UI windows (e.g., candidate list).

24 Windows Text Services Framework u Provide support for Far East input across language Win32 platforms to aware applications. u Provide consistent UI for different input methods u speech, handwriting, IME u Coordinated input u Data persistence for dynamic text editing u Richedit supports both the native mode and Active Input Method Manager (AIMM) mode

25 Hex to Unicode Input Method u Type Unicode character hexadecimal code u Make corrections as need be u Type Alt+x to convert to character u Type Alt+x to convert back to hex (useful especially for missing glyph character) u Resolve ambiguities by selection u Input higher-plane chars using 5 or 6-digit code u MS Word 2002 standard

26 Unicode combobox/listbox u Emulate the system combobox and listbox u Unicode supports on all Win32 platforms u Allow mixed languages between items u Modified EM_SETTEXTEX for inserting items u Use in Office applications

27 Demo

28 Conclusions u Have described RichEdit, an engine for text display and editing with a hierarchy of presentation formats u Automatic choice of fonts for Unicode plain text including surrogate-pair characters, combining mark sequences u Handling nonUnicode documents in Unicode text engines u Described interfaces and component usage u Ways to input Unicode text using IMEs, speech u Clients include many Office and Windows apps u Able to display 2D Text Objects such as Ruby and Warichu


Download ppt "Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+"

Similar presentations


Ads by Google