Arabic, Hebrew, Hindi and Thai Support in IBM’s Java 2

Arabic, Hebrew, Hindi and Thai Support in IBM’s Java 2
Doug Felt, Eric Mader, John Raley The Java2 platform provides the foundation for supporting a wide variety languages and customs. Text is represented with the Unicode character set, which covers most scripts in use today. Additionally, Java2 provides a rich set of capabilities for displaying and interacting with visually complex text. The Java2 FCS release from Sun supported right-to-left scripts such as Arabic and Hebrew. This talk describes how IBM's release of the Java2 supports Arabic and Hebrew, and adds support for Thai and the Devanagari writing system (used to write Hindi). This required no change to the public Java interfaces. Existing Java2 programs that take advantage of Java2 graphics and Swing (tm) lightweight components "just work." Center for Java Technology, Cupertino, CA M17N-2000 Conference Tsukuba, Japan March 25-27

Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2
Overview What is complex text? What does Java2 provide to support it? What has IBM added for Arabic, Hebrew, Hindi, and Thai? First, we will briefly go over some aspects of these languages that make them more difficult to display and edit. Next, we will describe the new APIs in Java2 for working with text, and how they enable display and editing of these languages. Finally, we will outline how we implemented this support in IBM’s JVM ‘under the covers’ of the existing API. In this presentation, we’re going to focus on the display and editing aspects of support for these languages. Our implementation also includes platform- independent support for number and date/time formatting, collation, code set conversion, and keyboard entry of these languages, but that will not be the focus. M17N-2000 Conference Tsukuba, Japan March 25-27

What Is Complex Text? Unicode: not just a bigger character set Bidirectionality: mixed directions on a line Shaping: character shapes depend on context Ligatures: mandatory special forms, and no Unicode equivalent Positioning: vertical and horizontal adjustments Reordering: character positions depend on context Split characters: some characters appear in more than one position Latin script, which is by far the most commonly-used script among software developers, is also the simplest script (especially as it is used to write English). Other writing systems exhibit complications not found in Latin script. We call text with these characteristics ‘Complex Text.’ Arabic, Hebrew, Hindi, and Thai all exhibit many of these features: Bidirectional text runs both right to left and left to right (for numbers). Letters can change shape depending on context. Mandatory ligatures must be formed. A ligature is a combination of more than one character. Sometimes the original character shapes are not easily identifiable in the resulting ligature. Vertical and horizontal adjustments (a form of kerning) are necessary to properly align characters. Some characters can reorder forwards or backwards in the text based on context. Also, some characters can appear in more than one visual position at the same time, occupying multiple character cells, or surrounding other characters. M17N-2000 Conference Tsukuba, Japan March 25-27

Bidirectional Text Visual order differs from storage order Arabic and Hebrew read right to left, but numbers still read left to right memory Not all scripts are read from left to right; Arabic and Hebrew are read from right to left. Unicode text is stored in the order in which it is read-- not the order in which it is displayed-- so a conversion must be done before displaying the text. The Unicode bidi algorithm defines how this is to be done. This example shows mixed English and Hebrew text. The text is stored in logical order, the order in which it is read. You can see that the Hebrew text is displayed from right to left, while the English reads left to right. Mixed directions occur even when all the text is Hebrew or Arabic, since numbers in these languages still are written left to right. So this complication arises even when you don’t mix Hebrew or Arabic with other writing systems. Since the Unicode bidi algorithm can require up to a paragraph of context, the minimal unit of analysis when laying out text is a paragraph. In a paragraph, each line of text is logically contiguous. If we were to wrap this example text over two lines, the Hebrew text would never display such that it started on the second line and finished on the first. So the final positions of characters also depend on where the line breaks are chosen. reading order M17N-2000 Conference Tsukuba, Japan March 25-27

Character Shaping Arabic character shapes change to connect adjacent characters In some scripts characters take on different shapes depending on the characters around them. For example, Arabic text is ‘cursive’ in that letters change shape to connect with the characters on either side. This cursive shaping is required whenever Arabic text is displayed-- it is not just fancy typography. This graphic shows text containing the Arabic letter ‘noon’ three times in succession. When displaying this sequence, three different shapes are used for the same letter: one that connects to the left, one that connects to both the left and the right, and one that connects to the right. Contextual shaping means that characters cannot be measured or rendered in isolation. For example, you cannot cache a character’s ‘width’ since the width can vary depending on context. M17N-2000 Conference Tsukuba, Japan March 25-27

Ligatures Arabic and Devanagari represent some character sequences with ligatures Other scripts display certain character sequences as a single shape, known as a ligature. In the top example, on the left you see the Arabic characters lam and alef. When rendered, this lam-alef sequence is displayed using the ligature on the right. In the bottom example, the three Devanagari characters on the left KA, VIRAMA, SSA are displayed using the single ligature on the right. Many of these ligatures are required, not optional as they generally are in English. The ligatures used depend on the language, not just the script. For instance, Hindi and Sanskrit both are written in the Devanagari script, but use different ligatures. Unlike the common typographic ligatures found in English, the resulting shape can be completely different from the original characters. Often there is no obvious place to put a caret, for example. M17N-2000 Conference Tsukuba, Japan March 25-27

Character Positioning
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 Character Positioning Thai (and other scripts) require characters to reposition Thai can have multiple combining marks over a single base character. This requires repositioning some of the marks so they don't collide with other marks. On the left is the Thai consonant KO KAI followed by the tone mark MAI THO. The tone mark positions directly above the consonant. On the right, the vowel mark SARA UEE has been added immediately after the consonant. It too positions above the consonant, and so the tone mark must be resized and moved up, so as not to collide with the vowel mark. Hindi also positions characters, but positions horizontally so that the vowel marks lie over the vowel stem of the consonant or consonant cluster. This technique is similar to kerning, which is used to control the amount of white space between characters. Here the consequences of not doing the adjustments are even more severe- characters collide or misalign. M17N-2000 Conference Tsukuba, Japan March 25-27

Reordering Some Hindi characters reorder based on context Logical Order Visual Order In some writing systems, such as Indic, individual characters can reorder. This example shows the logical and visual order of the characters in the Hindi word ‘patidev.’ Notice how the Devanagari vowel sign I reorders to the left of the consonant it follows. (For illustration, only the visual reordering is shown, positioning and other adjustments have not been applied.) Also, the Devanagari letter sequence RA + VIRAMA can reorder to the right of a consonant cluster (not illustrated here). This reordering is contextual and applies to individual characters, unlike, for example, the reordering that occurs in Arabic and Hebrew, which typically applies to a sequence. Since individual characters can move in opposite directions, Hindi text can sometimes get a bit ‘scrambled.’ Reordering differs from positioning in that it changes a character’s visual order- the character moves ‘before’ or ‘after’ neighboring characters. M17N-2000 Conference Tsukuba, Japan March 25-27

Split Characters Thai and many Indic languages display a single character in multiple positions Logical Characters Visual Glyphs Displayed Result In Thai, the character SARA AM has two components: a vowel and a diacritic. In some cases another character can be written between these two components, ‘splitting’ the SARA AM character. This is illustrated in the example. The three original characters NO NU, MAI THO, SARA AM at the top left are converted to four glyphs, and reordered, as shown below left. After positioning, they display as shown on the right. Many of the Indic scripts (although not Devanagari) have characters that display on both sides of a consonant cluster, such as the Tamil vowel sign OO. Unicode provides different ways of encoding these situations. Conceptually this can be thought of as normalizing the Unicode text into additional characters, then contextually reordering them. It’s not a good idea to make the typist have to think too much about this! M17N-2000 Conference Tsukuba, Japan March 25-27

‘Complex Text’ is Complex
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 ‘Complex Text’ is Complex And drawing it is just part of the problem As we’ve seen, rendering complex scripts poses a number of problems in terms of selecting and positioning the proper glyphs. Here we see a rather fanciful example of the complications that can occur. This sentence contains mixed directions, including nested bidi embeddings, and five different writing systems. We’ve left the translation of this sentence as an exercise for the reader! Although rendering such text offers plenty enough to worry about, there’s more- at some point most users want to edit text, not simply display it. M17N-2000 Conference Tsukuba, Japan March 25-27

Interacting with Text Relating text display to text storage Hit-testing: mapping from a graphical point to a text location Caret: showing positions between characters Arrow-key movement: traversing the text in visual order Selection: showing range(s) of characters Line Break: displaying a paragraph as lines within an area Editing complex scripts requires maintaining the relationship between what is displayed and the original Unicode text: - Each point on the text image should correspond to a unique position in the Unicode text. This should not be too different from what the user expects! - Conversely, each position in the logical text should have a corresponding visual position on the screen. In some cases, it is useful to show special carets when the position is between characters that appear at different visual positions. Carets and hit testing boundaries should agree. - Arrow key movement must take account of the visual order of the characters, and of ligatures and other situations that don’t allow carets. - A contiguous range of text can be discontiguous when displayed, and vice- versa. When highlighting text, as when making a selection, many users will prefer that the highlighting include only the selected characters, and so the highlight region will also be discontiguous in order to match the text. - Measuring text, as when positioning or breaking lines, must use metrics of the text as it appears in context. M17N-2000 Conference Tsukuba, Japan March 25-27

Supporting Complex Text
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 Supporting Complex Text When working with complex text: do not assume a uniform text direction do not measure or draw one character at a time do not rely on default character positions Java2 provides support for complex text Common approaches to text display and user interaction that work for English text often won’t handle complexities like these. Usually these approaches have implicit limitations that arise from assuming a uniform left-to-right text direction, measuring character-by-character, and relying on default character positions. As we have seen, supporting complex text can be daunting. It is difficult for many development teams to justify the effort. Additionally, architectural limitations such as these are often deeply ingrained in program code. The Java 2 platform provides a number of enhancements for complex text. These enhancements serve a broad range of clients: from those who simply want to present static or editable complex text to users, to those who want to write software such as text editors or web browsers. The next several slides describe in more detail how these new APIs in Java2 support user interaction with complex text. M17N-2000 Conference Tsukuba, Japan March 25-27

Java2 and Complex Text Existing APIs are enhanced Graphics.drawString() Swing JTextComponent editor classes New Complex Text APIs support: Styled text User interaction Paragraph breaking The Java2 platform has been enhanced for complex text in a number of ways. For drawing a single line of text in a particular font, clients can continue to use Graphics.drawString. Graphics.drawString will perform the required layout for any script supported by the JDK release. The Swing text components support editing and simple multiline display. These are convenient to use, and they also support complex text. Graphics.drawString and the Swing edit components make it easy to present both static and editable text to the user, but don’t easily allow clients to handle text interaction directly. These clients can use the new Complex Text APIs in Java2. M17N-2000 Conference Tsukuba, Japan March 25-27

Graphics.drawString()
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 Graphics.drawString() Advantages: Existing clients get improvements Easiest way to render complete pieces of unstyled text drawString lacks sufficient context for: Paragraph segments Segments of styled text Graphics.drawString() has been available since the first release of Java. In Java2, it has been enhanced to correctly display complex text, so existing clients get increased functionality. For rendering text like labels and one-line messages, drawString() is very convenient. But drawString() has a fundamental limitation: it does not have any context beyond the string it is rendering. This means that if you draw segments from a paragraph of text with drawString(), you may not get correct results. Characters outside of a segment can influence how the segment appears, particularly when the paragraph contains bidirectional text. Similarly, drawing styled text using repeated calls to drawString() for each style run will fail when the text requires reordering. The Complex Text APIs handle these situations. M17N-2000 Conference Tsukuba, Japan March 25-27

Swing JTextComponent Complete editor for fields, and unstyled and styled documents Enhanced to support complex text Some clients may need finer-grained control over presentation and interaction Swing’s JTextComponent classes have also been enhanced to support Complex Text. Using JTextComponent is the most convenient way to present a styled text editor to clients. However, JTextComponent limits clients to its architecture and user interface. Some clients may wish to use their own text frameworks rather than adopting Swing’s. These clients can still get access to the complex text support using the new Complex Text APIs. M17N-2000 Conference Tsukuba, Japan March 25-27

New Java2 APIs New in Java2, the java.awt.font package TextAttribute TextLayout LineBreakMeasurer General enough for variety of applications Allow for additional scripts and features The Complex Text APIs are in the java.awt.font package. Their purpose is to provide a set of operations-- drawing, styled text, hit-testing, caret and selection facilities, and line breaking-- that clients can use to build advanced user interfaces such as text editors and web browsers. These operations are at a level that is general enough to enable very sophisticated user interaction, and are free from assumptions that would prevent correct handling of complex text. As implementations of Java are released with support for more scripts and languages, existing clients can take advantage of the new support without modification. M17N-2000 Conference Tsukuba, Japan March 25-27

Styled Text Styled text is accessed with java.text.AttributedCharacterIterator Attributes include: Font characteristics (bold, italic) Color Underline Bidirectional behavior Others - open to future extension All Complex Text APIs can use styled text All of the Complex Text APIs use the new styled text model in Java 2. In Java 2 styled text is accessed through the AttributedCharacterIterator interface, which clients implement to provide access to their text. The class is storage-neutral, just like CharacterIterator is storage-neutral for unstyled text. Java2 currently supports a number of text attributes (or styles). There are styles for font characteristics such as weight and posture, text color, underlining, and customizing bidirectional behavior, to name a few. The set of styles is not closed, it can be extended in future versions of the JDK. All of the Complex Text APIs can use styled text; additionally, some have convenience methods that use unstyled text. M17N-2000 Conference Tsukuba, Japan March 25-27

TextLayout TextLayout represents one line or segment of styled text TextLayout provides Drawing Hit-testing Caret display Caret movement Selection The central class in the Complex Text APIs is TextLayout. TextLayout represents a single line or tab-delimited segment of text, possibly with multiple character styles. TextLayout supports a complete set of user-interface capabilities: drawing, hit- testing, caret handling, and selection. M17N-2000 Conference Tsukuba, Japan March 25-27

TextLayout: Drawing TextLayout draws styled text, with correct ordering within style runs Draws segments of paragraphs if produced by LineBreakMeasurer memory TextLayout correctly renders styled, complex text. In bidirectional text, a style run that is contiguous in memory is not necessarily contiguous in display. In this example, the underlined text is a single style run, but the words are not visually adjacent. (This is a prime example of why using iterative calls to drawString() to display styled text will fail - this approach would render the underlined text together, which would place the Hebrew words in the wrong order.) Drawing segments of a paragraph requires knowledge of the entire paragraph when the paragraph contains bidirectional text. TextLayout draws paragraph segments correctly when produced by a LineBreakMeasurer. LineBreakMeasurer is discussed later. reading order M17N-2000 Conference Tsukuba, Japan March 25-27

TextLayout: Hit-testing
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 TextLayout: Hit-testing Hit-testing is mapping from a graphical point to a text location (index and offset) Used when responding to mouse clicks Character 7, Offset 8 Hit-testing is mapping from a display point to a text location. It is typically used when responding to mouse clicks. TextLayout’s hit-testing reports two values: the index of the character at the given point, and whether the point is on the leading or trailing side of the character. The leading side of a character is encountered before the trailing side when reading the character. So, the leading side of a left-to-right character appears to the left of the trailing side, and vice-versa for a right-to-left character. A character index and side determine a position between two characters, called an offset. Offset n is between character n-1 and character n. The trailing side of character n-1 and the leading side of character n are both associated with offset n. The example shows hit-testing from a point on the left side of glyph 6 to the character at index 7. Since this character is read from right-to-left, the trailing side of the character was hit. The resulting offset is offset 8, between characters 7 and 8. Glyph 6 M17N-2000 Conference Tsukuba, Japan March 25-27

TextLayout: Caret Display
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 TextLayout: Caret Display Caret shows position between characters Can have “dual” carets in bidi text Single-caret only is also supported A caret is a graphical representation of a text offset (a position between two characters). Visually, a caret approximates where text inserted at this offset will appear. Some user interfaces will show two carets in bidirectional text at some offsets. In the example you can see one caret to the left of the Arabic text, and another caret to the left of the English text. These two carets represent the single insertion point at offset 16, between the two directional runs. Newly inserted Arabic characters will appear at the red caret, and English characters will appear at the black caret. TextLayout will generate carets; clients can just draw them without worrying about this issue. TextLayout also supports single-caret interfaces. Single-caret interfaces can be harder to implement than dual-caret interfaces, because in a single-caret interface, the client must keep track of both the caret’s offset and the side of the offset that the caret is associated with. Offset 16 Offset 16 M17N-2000 Conference Tsukuba, Japan March 25-27

TextLayout: Caret Movement
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 TextLayout: Caret Movement Arrow-key response should be “visual” No predictable relationship to offset TextLayout calculates correct offset In most user interfaces, the caret moves in response to pressing an arrow key. Arrow key response should be visual; that is, the left arrow should move the caret left, and the right arrow should move it right. Often, arrow key response is implemented by simply incrementing or decrementing an offset. However, this approach does not always produce the correct results. In general, there is not a predictable relationship between visual caret direction and offset values. TextLayout will select the offset that is appropriate to the visual caret direction. Sometimes moving by one unit changes the offset quite a bit. In the example, the red caret represents the insertion point at offset 16. Moving right from the red caret puts the insertion point at offset 15, while moving left puts it at offset 25. Offset 25 Offset 15 M17N-2000 Conference Tsukuba, Japan March 25-27

TextLayout: Logical Selection
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 TextLayout: Logical Selection Selection shows range(s) of characters Logical selection: Single range in memory Possibly discontinuous highlighting regions 16 20 26 16 8 A selection region is a shape that indicates one or more ranges of characters. Typically, the selected range(s) will be the target of the next user operation. In bidirectional text it is possible to have a single range of characters with multiple, discontinuous highlight regions. Always selecting single ranges, even when it results in discontinuous highlight regions, is called logical selection. The graphic shows a TextLayout with a logical selection. Clients don’t have to do anything special to deal with this odd-looking selection - they don’t even have to be aware of it. Characters 16-19 Characters 8-15 M17N-2000 Conference Tsukuba, Japan March 25-27

TextLayout: Visual Selection
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 TextLayout: Visual Selection Visual Selection: One continuous highlight region Possibly more than one range of characters Clients must be prepared to operate on multiple character ranges 16 20 26 16 8 The converse of logical selection is visual selection. With visual selection, the selection region remains contiguous, even if it covers multiple ranges in the text.. The example shows a TextLayout with a visual selection covering characters and Visual selection is typically considered more difficult for clients to support. In particular, clients must define operations such as copy and paste for multiple ranges of text. Characters 8-15, and 20-25 M17N-2000 Conference Tsukuba, Japan March 25-27

LineBreakMeasurer LineBreakMeasurer formats paragraphs into lines of text Lines are TextLayout instances For formatting a paragraph of text into lines, there is the LineBreakMeasurer class. LineBreakMeasurer produces lines which fit into specified widths. The lines are TextLayout instances, so you can draw them, hit-test them, and perform any other operation TextLayout supports. Additionally, they will retain enough paragraph information to render correctly. M17N-2000 Conference Tsukuba, Japan March 25-27

Complex Text Support Easy things are easy Static text: Graphics.drawString() Standard editors: Swing JTextComponent classes Advanced text interaction is enabled Complex Text APIs do standard text interaction tasks They are general enough for wide variety of clients All APIs allow adding support for new features and scripts So, Java2 has a wealth of APIs for supporting complex text. Some, such as Graphics.drawString() and the Swing editor classes, are geared toward ease-of- use, for clients performing certain well-defined tasks. Clients who want to implement custom text interaction can use the Complex Text APIs. These APIs perform standard, general-purpose operations. Most importantly, these APIs don’t preclude future enhancement. M17N-2000 Conference Tsukuba, Japan March 25-27

IBM’s Implementation of Java2
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 IBM’s Implementation of Java2 Improved Swing support of complex text Hindi and Thai localization Complex text architecture . The next several slides describe IBM’s upcoming release of Java2. There are three main areas in which we’ve worked to enhance IBM’s version of Java2. We’ve done our best to fully implement the new APIs. Although the APIs are there, some of the implementations in the initial reference release of Java2 from Sun were incomplete. We’ve also worked with Sun to improve this situation in future releases. We’ve gathered localization information and enhanced parts of the internationalization framework to support Hindi and Thai. And we’ve developed an internal architecture to more flexibly support complex text display and editing, which will make it easier to add new languages in the future. M17N-2000 Conference Tsukuba, Japan March 25-27

Improved Swing Support
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 Improved Swing Support All Swing text components now properly handle complex text ComponentOrientation is implemented Text editing components support Hindi and Thai The initial Java2 release from Sun made great progress towards supporting complex text, but the implementation still had limitations. In IBM’s Java2, all swing components now can work in a right-to-left orientation. Swing text components now support editing of complex text. We’ve modified some of the swing text internal classes to use the new architecture that we’ve developed to support complex text. These components now not only support bidirectional text, such as Arabic and Hebrew, but also text requiring reordering and positioning, such as Hindi and Thai. M17N-2000 Conference Tsukuba, Japan March 25-27

Hindi and Thai Localization
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 Hindi and Thai Localization Hindi Input Method Conversion between Unicode and ISCII Character and word break rules New font Thai Dictionary-based word break Collation Support for Hindi and Thai required additional work, such as input methods, code converters, and locale data. Some of this work came from other IBM sites. For Hindi, IBM India supplied an input method , and a codeset converter for ISCII, the single-byte Indian encoding standard. Hindi required new character and word break rules to deal with consonant clusters and the DANDA character, which indicates a full stop (similar to a period in English). We also added support for Hindi to the fonts that IBM makes available with its version of Java2. For Thai, since words are not delimited by spaces or other characters, we added a dictionary-based word-break implementation. Rich Gillam developed this technology and has presented it at past Unicode conferences. Also, we updated the Collation framework to correctly compare Thai strings. Thai comparison requires swapping leading consonants and vowels. M17N-2000 Conference Tsukuba, Japan March 25-27

Complex Text Architecture
Arabic, Hebrew, Hindi, and Thai support in IBM’s Java2 Complex Text Architecture Apply Unicode Bidi algorithm Identify script and language Assign physical fonts Use layout engines to select and position glyphs Use rasterizer to generate metrics, outlines, and rasters Convert glyph info back to character info Finally, we’ve internally factored support for complex text into a number of discrete functions, in order to make future enhancements easier. Here we list the operations in their rough order of application. The Unicode bidi algorithm is used to identify directional runs. Runs of text with a common script (and language) are identified. This information is used to control the selection of fonts, layout engines, and glyphs. Physical fonts are assigned to runs of text where the user has not explicitly selected a specific font. This can happen, for example, when clients request one of the logical typefaces such as ‘serif.’ The layout engine code determines what layout processes the font supports and applies one to generate glyphs and positions. It uses the rasterizer to acquire the glyph metric information it needs to position the glyphs. Character metric code uses backmapping and other information from the layout engine to synthesize character metrics from the glyph metrics. These metrics represent the results of layout in terms of the original character data, so that code for selection, hit testing, caret movement, and so on can remain independent of the details about how glyphs were used to represent the text. M17N-2000 Conference Tsukuba, Japan March 25-27

Bidi Analysis Implements Unicode Bidi algorithm Applies to entire paragraph Can control using styles or embedding codes Subdivides font runs into directional runs The bidirectional analysis implements the standard Unicode Bidi algorithm. This requires, in the general case, a complete paragraph of context. In Java2 this means that TextLayout instances created directly from text are considered complete paragraphs. LineBreakMeasurer can be used to generate TextLayouts that represent multiple lines or segments from the same paragraph. The Java2 APIs include TextAttributes that can be used to control the Bidi Algorithm by defining the overall paragraph direction, defining a run of characters as being in a left-to-right or right-to-left context, or overriding the inherent directionality of characters. When these attributes are not present, the Bidi implementation uses the defined Unicode directional formatting characters. The algorithm outputs ‘levels’ that represent nested directional runs, and determine the visual ordering of the text. This further segments the text into runs of characters at the same bidirectional level. The run direction is input to the layout engine, which uses it to visually order the characters in the run. It is also used to select mirrored versions of glyphs for mirrored characters. M17N-2000 Conference Tsukuba, Japan March 25-27

Script and Language Use script, language to drive layout identify special requirements use the appropriate font perform the proper layout select the right glyphs One of the first steps is to identify the script and language. This can help to identify special requirements for text processing early on. Different fonts can support the same Unicode range with different glyph shapes that are appropriate to a particular language or region, for example, Chinese or Japanese. The script and language can affect decisions about which font to use. Different scripts can require different layout engines. For example, there is as yet (as far as we know) no OpenType layout specification for Thai. If we encounter Thai we apply other layout engines based on known glyph encodings for Thai. Some layout engine use the language and script to select appropriate ligatures or other special forms. The example shows how a run of text is divided into script runs in the absence of higher-level specification. Some characters are ‘script neutral’ and are folded into the adjacent runs. Language analysis can also be performed. Ideally this information would be supplied externally, but there is no current API for tagging text with this information. Devanagari Hebrew Thai M17N-2000 Conference Tsukuba, Japan March 25-27

Font Assignment Logical fonts versus physical fonts Users can define physical font Incorporates some style information Font/style instance for each run of characters Logical: “default” The original Java fonts are ‘logical fonts’ like “serif”. These are mapped to some physical font family like TimesNewRoman. Java implementations often map the single logical font to multiple physical fonts in order to support different ranges of Unicode. It is the physical fonts that actually define the metrics and appearance of glyphs. In Java2, clients can identify a physical font by name without using a logical font description. They can instantiate a Font object and use it directly, or style runs of text using a TextAttribute together with a Font object. The text pipeline will then use this as an absolute directive to use the specified physical font. Users can alternatively specify only the attributes they are interested in, like ‘Bold’ or ‘12pt.’ This is turned into a logical font specification, which is used to select the best matching physical font for the given text. Priority is given to fonts that can actually display the text (rather than the ‘missing character’) with the goal of always displaying something meaningful to the end user. The physical font plus additional style information determines a font instance for each run of text. The style information selects among related fonts in a family, such as bold or italic faces, and may also specify options like underline. Physical: Mangal Physical: WorldType M17N-2000 Conference Tsukuba, Japan March 25-27

Layout Distinct from rasterization Input is characters, output is glyphs and positions Gets metrics from rasterizer when needed Keeps track of relationships between characters and glyphs Our architecture distinguishes a ‘layout engine’ from a ‘rasterizer.’ The layout engine is responsible for converting the logical character data into glyphs and positions. The rasterizer is responsible for generating metric data, outines, and rasters from a font instance, a glyph id, and a device context. The layout engine is a client of the rasterizer. The distinction is important because there can be many ways to lay out a line, and this is (in the main) independent of how the glyphs are rasterized. We use both platform and platform-independent rasterizers, and multiple layout engines depending on the script and font support. The layout engines identify a character for each glyph. Every character has at least one glyph, and so each character has a nominal position and advance. The layout engines also make available the ‘visual order’ of the glyphs, which allows clients to distinguish true reordering from positioning. M17N-2000 Conference Tsukuba, Japan March 25-27

Layout Engines Multiple engines OpenType layout for Arabic and Hindi Unicode presentation forms for Arabic Windows character set for Thai GX layout for Hindi Other custom glyphset support Automatic fallback to select engine based on font and script Currently, we perform these types of layout: Hindi and Arabic layout using OpenType, Arabic layout using Unicode Presentation Forms, Hindi using GX, and Thai using the both Windows Thai glyph set and internal IBM glyph sets. We’ve also investigated Hindi layout based on custom Devanagari glyph sets used in fonts in India. We use a fallback strategy to select a layout engine to use given a particular font and situation. If the font is a TrueType font, we look for OpenType or GX tables in the font, and check if they are appropriate for the script and language we are rendering. Otherwise, if we are rendering Arabic, we use Unicode presentation forms to do shaping and ligatures. Otherwise we probe the fonts for known glyph sets. In the future we plan to register a layout engine to use with a particular font and script/language, opening the possibility for more client control over how layout will be done. M17N-2000 Conference Tsukuba, Japan March 25-27

Rasterizer Deals with individual glyphs font specific glyph ids, not characters and code pages generates metrics, outlines and rasters caching separate from rasterizer Multiple rasterizers Get control point information from JDK rasterizer Used in some TT glyph positioning The rasterizer provides information about individual glyphs. Since the layout engine selects the glyphs, the rasterizer itself does not deal with character data, and is not involved with code set conversion. The rasterizer also does not position glyphs on a device, as this is also the layout engine’s responsibility. The rasterizer is, however, responsible for generating metrics and outlines. The rasterizer is also responsible for generating rasters, either by scan-conversion of the outlines, or by using bitmap data associated with the font. Input to the rasterizer is a glyph id, a font instance, and device information such as the transform, resolution, and bit depth. Our implementation can use multiple rasterizers, either the platform rasterizer, or our own platform-independent one. Since we need to access the raster using glyph ids we are sometimes limited by legacy platform API that uses characters and expect the rasterizer to do the character to glyph conversion. This is a problem when layout has generated glyph ids that have no corresponding Unicode character value. In these situations we must use the our own rasterizer. Since OpenType can use control points to position glyphs, we have enhanced the JDK TrueType rasterizer to return control point information. M17N-2000 Conference Tsukuba, Japan March 25-27

Character Metrics Convert glyph data back to character data Synthesize character positions and metrics Model how the user will interact with the text Once layout has generated glyphs and positions for a run of text, the text is ready to render. But when editing, the raw glyph data is complex to work with. So for editing we convert the glyph metric data into character metric data for the original Unicode text. This conversion process shields higher-level objects from the character-to-glyph transformation performed during layout. The character model we implement is simple. We chose it because it also presents a simple model to the user. In our model, we synthesize character data that makes split characters, local reorderings, and ligatures ‘disappear.’ The characters are made to have a consistent visual order. For instance, in the example, the Hindi text at the top displays as shown at the bottom. The leading character in each group at the top is assigned the advance between the red marks, the following character(s) in each group are assigned a zero advance. The hit test, caret, and selection behavior is based on these metric assignments, leading to the caret positions shown. More complex character models could be implemented without changing existing APIs. We may provide other character models in the future. M17N-2000 Conference Tsukuba, Japan March 25-27

Example: Arabic Characters Reordered Characters Shaped Glyphs This illustrates how rendering might work for Arabic. Bidirectional analysis determines a right-to-left run direction for the entire range of text shown. The second row in the sample illustrates the characters with right-to-left ordering. Arabic is detected by script analysis, and a physical font is chosen for the text. The layout system determines that the Arabic font has no OpenType support for Arabic, so Unicode-based layout is performed. Characters are assigned one of the Arabic Presentation Forms using the shaping rules. Ligature substitution is also performed in the same manner. Metrics are obtained from the platform rasterizer. Positioned Glyphs Character Metrics M17N-2000 Conference Tsukuba, Japan March 25-27

Example: Hindi Characters Reordered Glyphs Positioned Glyphs This illustrates how rendering might work for Hindi. Bidi analysis reveals no bidi styles, formatting codes, or character types, so the characters are assigned a left-to-right run. Devanagari is detected by script analysis, and Hindi is assigned. A physical font is selected if not already specified. The layout system determines that the font has OpenType support for Hindi, so performs OpenType layout. Tables in the font determine the initial glyphs by mapping directly from the characters (the top line), reorder the glyphs (the second line) and position the glyphs (the third line). Metrics are obtained from the rasterizer. The glyph reordering and metric data from the layout engine is used to assign positions to the characters in the original text. This information is passed back to TextLayout. TextLayout draws carets as shown. Character Metrics M17N-2000 Conference Tsukuba, Japan March 25-27

Future Work API enhancements where needed Pluggable layout engines and rasterizers Logical versus physical fonts More languages! More layout control (vertical) Here are some thoughts about future plans. We’d like to work with Sun to enhance the APIs so that more control over complex text is available to developers. Primarily we’d like to see more language support, and a distinction between logical versus physical fonts. We can also imagine new APIs that provide more direct control over the lower level stages of processing complex text. We’d like to formalize the layout engine and rasterizer APIs so that new ones can be added to the system independent of our releases. This will allow language and font support to be extended more easily. We’d like to support more languages, in particular more Indic languages. We’d like to provide more layout options, such as vertical layout and other CJK- related layout. If you have other suggestions, let us know! M17N-2000 Conference Tsukuba, Japan March 25-27

Conclusion Arabic, Hebrew, Hindi, and Thai are complex not something you do in an afternoon Java2 APIs are sufficient to handle them IBM’s Java2 supports Complex Text makes support for more languages a reality architected internally for future enhancement Complex text is, well, complex. This is not something you do in an afternoon. (If you can do this in an afternoon, please come talk to us!) APIs and implementations for complex text will continue to evolve for quite a while longer. The Java2 APIs are, in the main, sufficient to handle complex text. Existing functionality for static and editable text is easy to use. New functionality allows developers to work more directly with complex text, while still providing a fairly high level of abstraction. IBM’s Java2 provides enhancements over what initial releases of Java2 supported. We have added support for two new languages, Hindi and Thai. We have also taken a fairly ambitious tack in architecting our support, so that we can continue to enhance it in the future. M17N-2000 Conference Tsukuba, Japan March 25-27

Resources Java Internationalization: java.sun.com/products/jdk/1.2/docs/guide/internat Java2D (includes Complex Text): java.sun.com/products/jdk/1.2/docs/guide/2d Article “International Text in JDK 1.2”: IBM RichEdit Control (bidi-enabled) This paper online Here are some places to look for more information on these topics: Sun’s internationalization site focuses on classes for adapting to language- specific and country-specific conventions, and discusses internationalization issues not mentioned in this talk. TextLayout, LineBreakMeasurer, and other Complex Text APIs are part of Java2D. Sun’s site is a good place to learn more about Java2D. On IBM's site there is an article on the text APIs that we developed in cooperation with Sun for Java2. Finally, the IBM alphaworks website has a pure-Java bidi-enabled editor available for download. M17N-2000 Conference Tsukuba, Japan March 25-27

Arabic, Hebrew, Hindi and Thai Support in IBM’s Java 2

Similar presentations

Presentation on theme: "Arabic, Hebrew, Hindi and Thai Support in IBM’s Java 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Arabic, Hebrew, Hindi and Thai Support in IBM’s Java 2

Similar presentations

Presentation on theme: "Arabic, Hebrew, Hindi and Thai Support in IBM’s Java 2"— Presentation transcript:

Similar presentations

About project

Feedback