Unicode Support for Mathematics

Presentation on theme: "Unicode Support for Mathematics"— Presentation transcript:

Unicode Support for Mathematics
Murray Sargent III Microsoft In a number of previous Unicode conferences, I’ve presented ways in which Unicode can be used to encode mathematics, both in plain and in marked-up text. The present talk borrows from those presentations and focuses on the full support Unicode now promises to offer mathematical disciplines. 17th International Unicode Conference

Unicode Support for Mathematics
Overview Unicode math characters Semantics of math characters Unicode and markup Multiple ways of encoding math characters Not yet standardized math characters Inputting math symbols This talk describes the Unicode math repertoire and ways to enter and use it in TeX, MathML, and plain text. 17th International Unicode Conference

Unicode Math Characters
Unicode Support for Mathematics Unicode Math Characters 340 math chars exist in ASCII, U+2200 – U+22FF, arrows, combining marks of Unicode 3.0 996 math alphanumeric characters are proposed to be added as requested by STIX project. Plane 1 951 new math symbols and operators are proposed for BMP One math variant code One new combining character (reverse solidus). The next version of Unicode 3.x is likely to have a complete set of standard math characters in Unicode to support math publications on and off the web. MathML is a major beneficiary of this support. Specifically the plan is to include 951 new symbols and operators and 996 new alphanumeric symbols in addition to 340 symbols currently encoded for a total of 2287 math symbols. This repertoire is the result of input from many sources, notably from the STIX project and enables one to display virtually all standard mathematical symbols. In addition, this math support lends itself to a remarkably successful plain-text encoding that’s much more compact than MathML or TeX. 17th International Unicode Conference

Math Alphanumeric Characters
Unicode Support for Mathematics Math Alphanumeric Characters Math needs various Latin and Greek alphabets like normal, bold, italic, script, Fraktur, and open-face May appear to be font variations, but have distinct semantics Without these distinctions, you get gibberish, violating Unicode rule: plain text must contain enough info to permit the text to be rendered legibly, and nothing more Plain-text searches should distinguish between alphabets, e.g., search for script H shouldn’t match H, etc. Reduces markup verbosity Mathematics has need for a number of Latin and Greek alphabets that on first thought appear to be just font variations of one another, e.g., normal, bold, italic and script H. However in any given document, these characters have distinct mathematical semantics. For example, a normal H represents a different variable from a bold H, etc. If one drops these distinctions in plain text, one gets gibberish. The next slide shows that instead of the well-known Hamiltonian formula H = d(E² + H²), you’d get the integral equation H = d(E² + H²). Accordingly, the STIX project requests adding normal, bold, italic, script, etc., Latin and Greek alphabets. Straight encoding leads to 996 characters. Some useful common information is lost, such as all variants of H might not be trivially recognizable as H’s. But it does allow plain text to retain the proper character semantics and it allows simple (nonrich) search methods to work. For example when you want to search for a script upper-case H, you generally don’t want to find any other kind of H. Generally the math alphanumerics substantially reduce the verbosity of markup, although one can construct cases which aren’t so verbose. For example, if you had a sequence of bold italic characters, say abcd in bold italic, you could define markup to express this as <mbi>abcd</mbi>. This is 15 characters and using the math alphanumerics you need 8 UTF-16 codes, since the math bold italic letters are in plane 1 and are represented in UTF-16 by surrogate pairs. This is only about half as many codes as in the markup, although in ISO it’s a quarter as many. But I’d argue that this markup representation is poor for several reasons: 1) it complicates a search for a bold italic a, since the search engine needs to understand the tags and dissect the tag contents, 2) it doesn’t tag the characters individually as math identifiers, which is a MathML requirement, and 3) it introduces complexity into the tag model by introducing multiple variable identifier tags. The last of these disadvantages can be overcome by representing the nature of the variables with attributes, e.g., <mi style=bolditalic>, but this approach is indeed quite verbose for items as small as math characters. Admittedly this approach is necessary to handle (quite rare) alphanumeric math symbols that aren’t included in the math alphanumeric block. Searching for such symbols requires a sophisticated attribute-aware search engine since simple plain-text search engines would yield many undesired search hits. 17th International Unicode Conference

Unicode Support for Mathematics
Legibility Loss Without math alphabets, the Hamiltonian formula  H =  dτ [εE2 + μH2]  becomes an integral equation A normal H represents a different variable from a bold H, etc. If one drops these distinctions in plain text, one gets gibberish. For example, instead of the well-known Hamiltonian formula H = d(E² + H²), you’d get the integral equation H = d(E² + H²). One could conclude that representing mathematical expressions in plain text is hopeless, but with a little effort most mathematical expressions can be in plain text. In addition the resulting symbol set is useful in markup languages like MathML. 17th International Unicode Conference

Math Alphanumeric Chars (cont)
Unicode Support for Mathematics Math Alphanumeric Chars (cont) Bold a-z, A-Z, 0-9, -, -Ω Italic a-z, A-Z, -, -Ω Bold italic a-z, A-Z, -, -Ω Script a-z, A-Z Bold script a-z, A-Z Fraktur a-z, A-Z Bold Fraktur a-z, A-Z Open-face a-z, A-Z, 0-9 Sans-serif a-z, A-Z, 0-9 Sans-serif bold a-z, A-Z, 0-9, -, -Ω Sans-serif italic a-z, A-Z Sans-serif bold italic a-z, A-Z, -, -Ω Monospace a-z, A-Z, 0-9 Note that which fonts are used for these characters is beyond the scope of plain-text. The upper-case Greek letters -Ω are defined by the Unicode Greek character range U+0391 through U+03A9 plus the nabla (U+2207). - are defined by the Unicode Greek character range U+03B1 through U+03C9 plus the partial differential sign (U+2202) and the seven glyph variants of , , κ, φ, ρ, π, and ω given by (new BMP code that resembles U+220A), U+03D1, U+03F0, U+03D5, U+03F1, U+03D6, and U+ (since both glyphs for each of these can appear in the same document with different semantics). The upper-case position U+03A2 corresponding to the final sigma ς is used for the upper-case Θ variant, which looks like the usual Θ except that the “H” in the middle is replaced by a “-”. This gives 25+1 upper-case Greek characters and 25+8 lowercase characters. In addition, corresponding characters in the BMP are used for upright serifed characters when they occur in mathematical expressions. 17th International Unicode Conference

How Display Math Alphabets?
Unicode Support for Mathematics How Display Math Alphabets? Can use Unicode surrogate pair mechanisms available on OS Alternatively, bind to standard fonts and use corresponding BMP characters. Second approach probably faster and to display Unicode one needs font binding in any event. A single math font may look more consistent. The question arises as to how to implement the mathematical alphanumeric characters. One approach uses a dedicated math font along with Unicode surrogate pair support from the operating system. This approach has a couple of advantages: 1) the characters can be designed as a group to look good with one another, and 2) relatively little effort is needed in the math display engine. Alternatively, the math display engine can bind the math alphanumerics to standard fonts and use corresponding BMP characters. This approach may have faster performance (depending on how efficiently the operating system handles surrogate pairs). Font binding is an extra step, but it may already be available, since in general to display Unicode it’s very useful to have a font binding facility of some kind. 17th International Unicode Conference

Multiple Character Encodings
Unicode Support for Mathematics Multiple Character Encodings As with nonmath characters, math symbols can often be encoded in multiple ways, composed and decomposed E.g., ≠ can be U+003D, U+0338 or U+2260 Recommendation: use the fully composed symbol, e.g., U+2260 for ≠ For alphabetic characters, use the fully decomposed sequence, e.g., use U+0061, U+0308 for ä, not U+00E4 Some representations use markup for the alphabetic cases. This allows multicharacter combining marks. There have been many discussions as to various normalization forms for Unicode characters and Unicode Technical Report 15 discusses the subject in detail. Math characters are no exception: there are multiple ways of expressing various math characters. It would be nice to have a single way to represent any given character, since this would simplify recognizing the character in searches and other manipulations. Accordingly it’s worthwhile to give some guidelines. The first idea is to use the shortest form of a math operator symbol wherever possible. So U+2260 should be used for the not equal sign instead of the combining sequence U+003D U+0338. On the other hand, for alphabetic characters, use the fully decomposed sequence, e.g., use U+0061, U+0308 for ä, not U+00E4. Mathematics uses a multitude of combining marks that greatly exceeds the predefined composed characters in Unicode. It’s better to have the math display facility handle all of these cases uniformly to give a consistent look between characters that happen to have a fully composed Unicode character and those that don’t. The combining character sequences also typically have semantics as a group, so it’s handy to be able to manipulate and search for them individually without having to have special tables to decompose characters for this purpose. MathML uses markup for some alphabetic cases. This allows multicharacter combining marks, such as a tilda over two or more characters. 17th International Unicode Conference

Unicode Support for Mathematics
Compatibility Holes Compatibility holes (reserved positions) exist in some Unicode sequences to avoid duplicate encodings (ugh!) E.g., U+2071-U+2073 are holes for ¹²³, which are U+00B9, U+00B2, and U+00B3, respectively Math alphanumerics have holes corresponding to Letterlike symbols. Recommendation: you can use the hole codes internally, but should import and export the standard codes. Characters standards defined before Unicode included some of the most common math alphabetics. To be interoperable with these standards, Unicode added them in the Letterlike Symbols block U+2100 – U+214F. It’s undesirable to have two codepoints for the same character and so the math alphabetics that are already in Unicode are not defined in the math alphanumerics in plane 1. However to aid in implementations, holes occur in the math alphanumerics block at the positions where these characters would have been if the Letterlike Symbols hadn’t already been defined. The recommendation is that you can use the hole codes internally, but should import and export the standard Letterlike Symbol codes if they exist. To know if a character is a math alphanumeric character you can check for inclusion in the two ranges U+2100 – U+214F and U+1D400 – U+1D7FF. In C, you can see if the character ch is in these ranges using the if() statement if(IN_RANGE(0x2100, ch, 0x214F) || IN_RANGE(0x1D400, ch, 0x1D7FF)) {} where the macro IN_RANGE(n1, ch, n2) is defined by #define IN_RANGE(n1, b, n2) ((unsigned)((b) - (n1)) <= unsigned((n2) - (n1))) This macro effectively has only one goto and is almost as fast as a single compare. 17th International Unicode Conference

Unicode Support for Mathematics
Math Glyph Variants One approach to the math alphanumerics was to use a set of math glyph variant tags Such a tag follows a base character imparting a math style Approach was dropped since it seemed likely to be abused One math variant tag does exist for purposes of offering a different line slant for some composite symbols. Another way to represent math alphanumerics in plain text would be to use math variant tags that follow the appropriate base characters. This approach is more general than outright encoding since the variant tags could follow any characters in the BMP. However only certain characters should be eligible for these math styles, so one would have to have tables defining which combinations are legal and which should be discarded or ignored. The approach was dropped because it was felt that it could be abused too easily for nonmath, rich-text purposes that would be better handled using mark up. One math variant tag was introduced to get a different line slant for some composite symbols, most notably in the context of negation. Ordinarily negated math symbols have a forward slash overlay, but a vertical slash overlay is used as well and may have different semantics from the forward slash. 17th International Unicode Conference

Nonstandard Characters
Unicode Support for Mathematics Nonstandard Characters People will always invent new math characters that aren’t yet standardized. Use private use area for these with a higher-level marking that these are for math. This approach can lead to collisions in the math community (unless a standard is maintained) Cut/copy in plain text can have collisions with other uses of the private use area Mathematicians are by their natures inventive people and will continue to invent new symbols to express their theories. Until these symbols are used by a number of people, they shouldn’t be standardized. Nevertheless, one needs a way to handle these symbols in their initial nonstandard usage. The private use area (0xE000 – 0xF8FF) can be used for such nonstandard symbols. It’s a tricky business, since the PUA is used for many purposes. For example, it’s used on Microsoft operating systems to round-trip codes that aren’t currently in Unicode, most notably many Chinese characters. The precise usage may well change since many such symbols may be assigned to plane 2 (Extension B) and hence are standardized. When using the PUA, it’s a good idea to have higher-level backup to define what kind of characters are involved. If they are used as math symbols, it would be good to assign them a math attribute that’s maintained in a rich-text layer parallel to the plain text. Such layers are used by rich-text programs such as Microsoft Word and Internet Explorer. 17th International Unicode Conference

Unicode Support for Mathematics
Unicode and Markup Unicode was never intended to represent all aspects of text Language attribute: sort order, word breaks Rich (fancy) text formatting: built-up fractions Content tags: headings, abstract, author, figure Glyph variants: Poetica font: 58 ampersands; Mantinia font: novel ligatures (TT, TE, etc.) MathML adds XML tags for math constructs, but seems awfully wordy There is a gray zone between rich (fancy) and plain text: embedded codes. In fact, general rich text can be represented using plain text with embedded fields, as illustrated by Hewlett-Packard’s PCL5 print format and various markup languages. A problem with embedded rich text is that it’s hard to edit, since cursor movement involves skipping over embedded fields, and the text can confuse various text scanning programs, such as spelling and grammar checkers. Unicode defines a BiDi (bidirectional) algorithm for mixing left-to-right and right-to-left text that does use a few embedded codes, such as U+200e (left-to-right mark) and U+200f (right-to-left mark). In this talk, we discuss the addition of a few characters that lets most mathematical expressions be represented using plain text with a couple of embedded symbols. 17th International Unicode Conference

Unicode Support for Mathematics
Unicode Plain Text Can do a lot with plain text, e.g., BiDi Grey zone: use of embedded codes Unicode ascribes semantics to characters, e.g., paragraph mark, right-to-left mark Lots of interesting punctuation characters in range U+2000 to U+204F Extensive character semantics/properties tables, including mathematical, numerical In Unicode, many characters have default semantics which help in displaying them in mathematical formulae. For example, the characters 0-9 are integers, by default. In principle, one could treat them as variables and a general math languages needs to be able to declare them as such. But such a need is rare and it’s only needed for algebraic (symbolic) usage, not for display. One unfortunate thing about MathML is that these default properties of mathematical characters are ignored. You still have to declare mathematical alphabetic characters as variables using an <mi> tag, a numerical digit as a digit using an <mn> tag, and an operator as an operator using an <mo> tag. In our plain-text encoding of Unicode, such tags aren’t used. Accordingly cases that depend on overriding the default properties, e.g., using a digit as a variable in a symbolic manipulation program, cannot be handled easily. 17th International Unicode Conference

Unicode Character Semantics
Unicode Support for Mathematics Unicode Character Semantics Math characters have math property Math characters are numeric, variable, or operator, but not a combination Properties are useful in parsing math plain text MathML doesn’t use these properties: every quantity is explicitly tagged Properties still can be useful for inputting text for MathML (noone wants to type all those tags!) Sometimes default properties need to be overruled Might be useful to have more math properties Unicode assigns a math property to characters that are typically used for mathematics. Math characters are numeric, variable, or operator, but not a combination. So if you want to use a digit as a variable, you need a higher-level protocol. Unicode’s character properties are useful in parsing math plain text. MathML doesn’t use these properties: every quantity is explicitly tagged. This leads to markup that’s substantially more verbose than it would be if these tags were only used to overrule the default Unicode semantics. But the consensus of the MathML committee is that problems would occur if the tags are omitted even when the Unicode semantics are valid. Properties still can be useful for inputting text for MathML (noone wants to type all those tags!) as with the plain-text notation discussed later in this talk. It’s probably useful to have more math properties, such as operator types like relational, binary, unary, nary. Presumably if such properties are defined, they would be informative, rather than normative, especially since they might be used in other ways on occasions. 17th International Unicode Conference

Unicode Support for Mathematics
Plain Text Encoding TEX fraction numerator is what follows a { up to keyword \over Denominator is what follows the \over up to the matching } { } are not printed Simple rules give unambiguous “plain text”, but results don’t look like math How to make a plain text that looks like math? A TeX fraction numerator consists of the expression that follows a { up to the keyword \over and the denominator consists of what follows the \over up to the matching }. In both the fraction and subscript/superscript cases, the { } are not printed. These simple rules immediately give a “plain text” that is unambiguous, but looks quite different from the corresponding mathematical notation, thereby making it hard to read. It’s more appropriate to refer to TeX as a markup language, rather than plain text. 17th International Unicode Conference

Simple plain text encoding
Unicode Support for Mathematics Simple plain text encoding Simple operand is a span of non-operator characters E.g., simple numerator or denominator is terminated by any operator Operators include arithmetic operators, whitespace character, all U+22xx, an argument “break” operator (displayed as small raised dot), sub/superscript operators Fraction operator is given by the Unicode fraction slash operator U+2044 It’s possible to define a “plain text” encoding that looks much more like mathematics. Strictly speaking, some constructs require some simplified markup, but many expressions are literally plain (Unicode) text. The notation is handy as a math input language for more elaborate markup languages like TeX and MathML and can be used in its own right. We define a simple operand to consist of all consecutive non-operator characters. We call this sequence of one or more characters a span of non-operators. As such, a simple numerator or denominator is terminated by any operator, including, for example, arithmetic operators, the blank operator, all Unicode characters with codes U+22xx, and a special argument “break” operator consisting of a small raised dot. The fraction operator is given by the Unicode fraction slash operator U+2044. 17th International Unicode Conference

Unicode Support for Mathematics
Fractions abc/d gives More complicated operands use parentheses ( ), brackets [ ], or { } Outermost parens aren’t displayed in built-up form E.g., plain text (a + c)/d displays as Easier to read than TEX’s, e.g., {a + c \over d} MathML: <mfrac><mrow><mi>a</mi><mo>+</mo> <mi>c</mi></mrow><mrow><mi>d</mi> </mrow></mfrac> Neat feature: plain text usually looks like math For more complicated operands, such as those that include operators, parentheses ( ), brackets [ ], or { } can be used to enclose the desired character combinations. If parentheses are used and the outermost parenthesis set is preceded and followed by operators, that set is not displayed in built-up form, since usually one doesn’t want to see such parentheses. So the plain text (a + b)//c displays as shown in the slide. In practice, this approach leads to plain text that is significantly easier to read than TeX’s, e.g., {a + c \over d} , since in many cases, outermost parentheses are not needed, while TeX requires { }’s. To force the display of an outermost parenthesis set, one encloses the set, in turn, within parentheses, which then become the outermost set. A really neat feature of this notation is that the plain text is, in fact, a legitimate mathematical notation in its own right, so it’s relatively easy to read. 17th International Unicode Conference

Subscripts and Superscripts
Unicode Support for Mathematics Subscripts and Superscripts Unicode has numeric subscripts and superscripts along with some operators (U+2070-U+208E). Others need some kind of markup like <msup>…</msup> With special subscript and superscript operators (not yet in Unicode), these scripts can be encoded nestibly. Use parentheses as for fractions to overrule built-in precedence order. Nature isn’t so kind with subscripts and superscripts, but they’re still quite readable. Specifically, we introduce a subscript by a subscript operator with its own special glyph that resembles a subscripted down arrow. Similarly we introduce a superscript with a superscript operator, which has a glyph resembling a superscripted up arrow. The subscript itself can be any operand as defined above. These sub/superscript operators aren’t currently part of Unicode. Another compound subscript is a subscripted subscript, which works using right-to-left associativity. This associativity can be overruled using parentheses as describe for fractions. The slide shows examples of these subscript and superscript operators along with the corresponding TeX. 17th International Unicode Conference

Unicode Support for Mathematics
Unicode TEX Example TeX is an incredibly successful math representation and will continue to play an important role for many years to come. It’s interesting that Unicode can make TeX more readable and efficient. The same is true for MathML, although MathML’s XML syntax involves so many tags that it’s not readable without a computer generated display. 17th International Unicode Conference

Unicode Support for Mathematics
Symbol Entry GUI PCs can display a myriad glyphs, mathematics symbols, and international characters Hard to input special symbols. Menu methods are slow. Hot keys are great but hard to learn Reexamine and improve symbol-input and storage methods With left/right Ctrl/Alt keys, PC keyboard gives direct access to 600 symbols. Maximum possible = 2100 = 1030 Use on-screen, customizable, keyboards and symbol boxes Drag & drop any symbol into apps or onto keyboards This leads to the important problem of input ease. The ASCII math symbols are easy to find, e.g., + - / * [ ] ( ) { }, but often need to be used as themselves. Similarly it’s easier to type ASCII letters than italic letters, but when used as mathematical variables, such letters are traditionally italicized in print. Intelligent input algorithms can dramatically simplify the entry of mathematical symbols. A math shift facility for keyboard entry can bring up proper math symbols. The values chosen can be displayed on an on-screen keyboard. For example, the left Alt key can access the most common mathematical characters and Greek letters, the right Alt key could access italic characters plus a variety of arrows, and the right Ctrl key could access script characters and other mathematical symbols. The numeric key pad offers locations for a variety of symbols, such as sup/superscript digits using the left Alt key. Other possibilities involve the CapsLock, NumLock and ScrollLock keys in combinations with the left/right Ctrl/Alt keys. This approach rapidly approaches millions of combinations, i.e., more than Unicode can handle! The autocorrect feature of MS Word 97 (and later) offers another way of entering characters for people familiar with TeX. For example, type \alpha and you get . The symbol box is an array of symbols chosen by the user or by displaying the characters in a font. Symbols in symbol boxes can be dragged & dropped onto keys on the on-screen keyboard(s), or directly into applications. 17th International Unicode Conference

Hex to Unicode Input Method
Unicode Support for Mathematics Hex to Unicode Input Method Type Unicode character hexadecimal code Make corrections as need be Type Alt+x to convert to character Type Alt+x to convert back to hex (useful especially for “missing glyph” character) Resolve ambiguities by selection Input higher-plane chars using 5 or 6-digit code New MS Office standard A handy hex-to-Unicode entry method works with WordPad 2000, Office 2000 edit boxes, RichEdit controls in general, and in the next version of Microsoft Word. Basically you type a character’s hexadecimal code (in ASCII), making corrections as need be, and then type Alt+x. Presto! The hexadecimal code is replaced by the corresponding Unicode character. The Alt+x can be a toggle (as in the next version of Microsoft Office). That is, type it once to convert the hex code to a character and type it again to convert the character back to a hex code. If the hex code is preceded by one or more hexadecimal digits, you need to “select” the code so that the preceding hexadecimal characters aren’t included in the code. The code can range up to the value 0x10FFFF, which is the highest character in the 17 planes of Unicode. 17th International Unicode Conference

Built-Up Formula Heuristics
Unicode Support for Mathematics Built-Up Formula Heuristics Math characters identify themselves and neighbors as math E.g., fraction (U2044), ASCII operators, U2200–U22FF, and U20D0–U20FF identify neighbors as mathematical Math characters include various English and Greek alphabets When heuristics fail, user can select math mode: WYSIWYG instead of visible math on/off codes Unicode plain-text encoded mathematical expressions can be used “as is” for simple documentation purposes. Use in more elegant documentation and in programming languages requires knowledge of the underlying mathematical structure. This section describes some of the heuristics that can distill the structure out of the plain text. Many mathematical expressions patently identify themselves as mathematical, obviating the need to declare them explicitly as such. One of TeX’s greatest limitations is its inability to detect expressions that are obviously mathematical, but that are not enclosed within $’s. An advantage of recognizing mathematical expressions without math-on/math-off syntax is that it is much more tolerant to user errors involving$’s. Resyncing is automatic, while in TeX you have to start up again from the omission in question. This approach might also be useful in converting the mathematical literature that’s not yet available in an object-oriented machine-readable form, into that form. The basic idea is that math characters identify themselves as such and potentially identify their neighbors as math characters as well. For example, the myriad Unicode math operator symbols and symbol combining marks (U+20d0 - U+20ff) identify the characters immediately surrounding them as parts of math expressions. The Unicode math alphabets contain the vast majority of alphabetic math characters used in print and automatically characterize themselves and their neighbors as members of mathematical expressions. 17th International Unicode Conference

Unicode Support for Mathematics
Operator Precedence Everyone knows that multiply takes precedence over add, e.g., 3+5×3 = 18, not 24 C-language precedence is too intricate for most programmers to use extensively TEX doesn’t use precedence; relies on { } to define operator scope In general, ( ) can be used to clarify or overrule precedence Precedence reduces clutter, so some precedence is desirable (else things look like LISP!) But keep it simple enough to remember easily Operands in subscripts, superscripts, fractions, roots, boxes, etc. are defined in part in terms of operators and operator precedence. While such notions are very familiar to mathematically oriented people, some of the symbols that we define as operators might surprise one at first. Most notably, the space (ASCII 32) is an important operator in the plain-text encoding of mathematics. 17th International Unicode Conference

Layout Operator Precedence
Unicode Support for Mathematics Layout Operator Precedence Subscript, superscript ¯ ­ Integral, sum ò S P Functions Ö Times, divide / * × · • Other operators Space ". , = - + LF Tab Right brackets )]}| Left brackets ([{ End of paragraph FF CR EOP A minimal list of operators is given above, where where LF = U+000A, FF = U+000C, and CR = U+000D. As in arithmetic, operators have precedence, which streamlines the interpretation of operands. The operators are grouped above in order of increasing precedence, with equal precedence values on the same line. For example, in arithmetic, 3+1/2 = 3.5, not 2. 17th International Unicode Conference

Mathematics as a Programming Language
Unicode Support for Mathematics Mathematics as a Programming Language Fortran made great steps in getting computers to understand mathematics Java accepts Unicode variable names C++ has preprocessor and operator overloading, but needs extensions to be really powerful Use Unicode characters including math alphanumerics Use plain-text encoding of mathematical expressions Can’t use all mathematical expressions as code, but can go much further than current languages go When to to multiply? In abstract, multiplication is infinitely fast and precise, but not on a computer There has been substantial discussion recently concerning what Unicode characters should be considered as potential program identifiers. Accordingly I include some slides that show how nice it is to have typical mathematical symbols available in computer programs. Java has made an important step in this direction by allowing Unicode variable names. The math alphanumerics allow this approach to go further with relatively little effort for compilers. A key point is that the compiler should display the desired characters in both edit and debug windows. A preprocessor can translate MathML, for example, into C++, but it won’t be able to make the debug windows use the math-oriented characters unless it can handle the underlying Unicode characters. The advantages of using the Unicode plain text in computer programs are at least threefold: 1) many formulas in document files can be programmed simply by copying them into a program file and inserting appropriate multiplication dots. This dramatically reduces coding time and errors. 2) The use of the same notation in programs and the associated journal articles and books leads to an unprecedented level of self documentation. 3) In addition to providing useful tools for the present, these proposed initial steps should help us figure out how to accomplish the ultimate goal of teaching computers to understand and use arbitrary mathematical expressions. 17th International Unicode Conference

Unicode Support for Mathematics
void IHBMWM(void) { gammap = gamma*sqrt(1 + I2); upsilon = cmplx(gamma+gamma1, Delta); alphainc = alpha0*(1-(gamma*gamma*I2/gammap)/(gammap + upsilon)); if (!gamma1 && fabs(Delta*T1) < 0.01) alphacoh = -half*alpha0*I2*pow(gamma/gammap, 3); else Gamma = 1/T1 + gamma1; I2sF = (I2/T1)/cmplx(Gamma, Delta); betap2 = upsilon*(upsilon + gamma*I2sF); beta = sqrt(betap2); alphacoh = 0.5*gamma*alpha0*(I2sF*(gamma + upsilon) /(gammap*gammap - betap2)) *((1+gamma/beta)*(beta - upsilon)/(beta + upsilon) - (1+gamma/gammap)*(gammap - upsilon)/ (gammap + upsilon)); } alpha1 = alphainc + alphacoh; To get an idea as to the differences between the standard way of programming mathematical formulas and the proposed way, compare the version of a C++ routine entitled IHBMWM (inhomogeneously broadened multiwave mixing) on this slide to that on then next two. This code was written in 1987. 17th International Unicode Conference

Unicode Support for Mathematics
The above function runs fine with current C++ compilers, but C++ does impose some serious restrictions based on its limited operator table. For example, vectors can be multiplied together using dot, cross, and outer products, but there’s only one asterisk to overload in C++. 17th International Unicode Conference

Unicode Support for Mathematics
In built-up form, the function looks even more like mathematics, as shown in this slide. The ability to use the second and third versions of the program is built into the PS Technical Word Processor. With it we already come much closer to true formula translation on input, and the output is displayed in standard mathematical notation. Lines of code can be previewed in built-up format, complete with fraction bars, square roots, and large parentheses. To code a formula, you copy (cut and paste) it from a technical document into a program file, insert appropriate raised dots for multiplication and compile. No change of variable names are needed. Call that 70% of true formula translation. In this way, the C++ function on the preceding page compiles without modification. The code appears nearly the same as the formulas in print [see Chaps. 5 and 8 of P. Meystre and M. Sargent III (1991), Elements of Quantum Optics, Springer-Verlag]. 17th International Unicode Conference

Unicode Support for Mathematics
Conclusions Unicode provides great support for math in both marked up and plain text Unicode character properties facilitate plain-text encoding of mathematics but aren’t used in MathML Heuristics allow plain text to be built up Need two more Unicode assignments: subscript and superscript operators On-screen keyboards and symbol boxes aid formula entry Unicode math characters could be useful for programming languages Unicode 3.0 has many useful math characters, but the STIX ensemble of mathematical communities (which include MathML) identified many more math characters that they feel are needed for manipulating and displaying mathematical expressions. Most of the characters requested by STIX are in the final stages of being approved for inclusion in Unicode and ISO It turns out that many of these characters could be represented in other ways, either in plain text via variant tags or in marked-up text via tags. For purposes of data interchange, it’s important to standardize on one encoding that works for math text in a variety of formats, plain and marked up. The encoding chosen does work very well in these formats. With a few additions to Unicode, mathematical expressions can represented with a remarkably readable Unicode plain-text format. The text consists of combinations of operators and operands. A simple operand consists of a span of non-operators, a definition that dramatically reduces the number of parenthesis-override pairs and thereby increases the readability of the plain text. Heuristics can be applied to the Unicode plain text to recognize what parts of a document are mathematical expressions. This allows the Unicode plain text to be used in a variety of ways, including in technical document preparation, symbolic manipulation, and numerical computation. Export to MathML, compilers, and other consumers of mathematical expressions is straightforward, so the approach can be used for handy math input methods as well as a notation in its own right. 17th International Unicode Conference

Similar presentations