Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.

Slides:



Advertisements
Similar presentations
Draft Java/ICU Internationalization Architecture Mark Davis.
Advertisements

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
Open-Source Approaches to Unicode Enablement Panel Discussion.
26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager.
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
Date: Subject:Distributed Data Processing Name:Maria Br ü ckner.
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
SRDC Ltd. 1. Problem  Solutions  Various standardization efforts ◦ Document models addressing a broad range of requirements vs Industry Specific Document.
What is a Programming Language? The computer operates using binary numbers. The computer only knows about 1’s and 0’s. Humans can also use 1’s and 0’s,
DT228/3 Web Development Introduction to Java Server Pages (JSP)
Presented by IBM developer Works ibm.com/developerworks/ 2006 January – April © 2006 IBM Corporation. Making the most of Creating Eclipse plug-ins.
Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.
Portability CPSC 315 – Programming Studio Spring 2008 Material from The Practice of Programming, by Pike and Kernighan.
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Agile Testing with Testing Anywhere The road to automation need not be long.
24rd Internationalization and Unicode Conference, Atlanta, GA USA – Sept 2003 Common XML Locale Repository Dr. Mark Davis Steven.
Version Enterprise Architect Redefines Modeling in 2006 An Agile and Scalable modeling solution Provides Full Lifecycle.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Java Security Updated May Topics Intro to the Java Sandbox Language Level Security Run Time Security Evolution of Security Sandbox Models The Security.
© 2005 IBM Corporation 28th Internationalization and Unicode Conference Getting Started with ICU George Rhoten IBM Globalization Center of Competency.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
Java Security. Topics Intro to the Java Sandbox Language Level Security Run Time Security Evolution of Security Sandbox Models The Security Manager.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
27th Internationalization and Unicode ConferenceBerlin, Germany, April 2005 ICU Overview The Open-Source Unicode Library, v3.2 Markus Scherer ICU Manager.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
NHS CFH Approach to HL7 CDA Rik Smithies Chair HL7 UK NProgram Ltd.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
© 2006 IBM Corporation 29th Internationalization and Unicode Conference ICU Overview: The Open Source Unicode Library Markus Scherer IBM Globalization.
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Why Java? A brief introduction to Java and its features Prepared by Mithat Konar.
San Jose, California – September, 2002 Transliteration of Indic Scripts Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency.
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
1 Example application: source code analysis 125 file types; 8029 files; 4689 non-Java; 1112 svn revisions.
Versus JEDEC STAPL Comparison Toolkit Frank Toth February 20, 2000.
How Java becomes agile riding Rhino Xavier Casellato VP of engineering, Liligo.com.
Getting Started with ICU
Tammy Dahlgren with Tom Epperly, Scott Kohn, and Gary Kumfert Center for Applied Scientific Computing Common Component Architecture Working Group October.
Generative Programming. Automated Assembly Lines.
Integrating netCDF and OPeNDAP (The DrNO Project) Dr. Dennis Heimbigner Unidata Go-ESSP Workshop Seattle, WA, Sept
26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.
Eagle: Maturation and Evolution 17th Annual Tcl Conference Joe Mistachkin.
COP4020 Programming Languages Names, Scopes, and Bindings Prof. Xin Yuan.
SiD Workshop October 2013, SLACDmitry Onoprienko SiD Workshop SLAC, October 2013 Dmitry Onoprienko SLAC, SCA FreeHEP based software status: Jas 3, WIRED,
Python Overview  Last week Python 3000 was released  Python 3000 == Python 3.0 == Py3k  Designed to break backwards compatibility with the 2.x.
Understanding Data Types and Collections Lesson 2.
Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 Sorting and Searching Helena Shih GCoC Manager IBM.
Cross Language Clone Analysis Team 2 October 13, 2010.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
1 CSC 427: Data Structures and Algorithm Analysis Fall 2006 See online syllabus (also available through Blackboard): Course goals:
Plug-in Architectures Presented by Truc Nguyen. What’s a plug-in? “a type of program that tightly integrates with a larger application to add a special.
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.
ICU Overview: The Open Source Unicode Library
Cupertino, CA, USA / September, 2000First ICU DeveloperWorkshop1 Transformation Support Alan Liu Globalization Center of Competency IBM Emerging Technology.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
What’s New in Xilinx Ready-to-use solutions. Key New Features of the Foundation Series 1.5/1.5i Release  New device support  Integrated design environment.
Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM.
SE goes software engineering; (practically) managing the Compose
CPSC 315 – Programming Studio Spring 2012
MAKE SDTM EASIER START WITH CDASH !
An ICU Overview Mark Davis Chief Globalization Architect, IBM
Collation in ICU Mark Davis IBM Globalization Center of Competency
Portability CPSC 315 – Programming Studio
Developing and testing enterprise Java applications
SE goes software engineering; (practically) managing the Compose
David Cleverly – Development Lead
Presentation transcript:

Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA

2 Overview What is ICU4J? ICU and the JDK, a brief history Benefits and tradeoffs of ICU4J Features of ICU4J Performance of ICU4J Using ICU4J Conclusion and References

3 What is ICU4J? Internationalization Library –Sister project of ICU (C/C++) –Open-source, non-viral license –Sponsored by IBM Unicode Standard compliant, up-to-date 100% Pure Java Enhances and extends JDK functionality Over five years of continuous development

4 ICU and Java, a History Started with Java 1.1 internationalization –Much code contributed by IBM/Taligent –IBM provided support, bug fixes, enhancements Became open-source project in 2000 –ICU4C code started with port from Java Continued contributions to Java since then –TextLayout, OpenType layout, Normalization

5 Collaboration with Java Teams We continue to work with Java internationalization, graphics2D teams We participate in Java expert groups (e.g. JSR 204, Supplementary Support) Differences –perspectives (conformance, features versus size) –processes (open source versus corporate/JSR) –timetable (twice a year versus every two years)

6 Benefits Fully implements current standards –Unicode collation, normalization, break iteration –Updated more frequently than Java Full CLDR data Improved performance Open source, open license, customizable Compatible with ICU C/C++ libraries and data Runs on JDK 1.4 –Get supplementary support without moving to 1.5

7 Tradeoffs Not built-in, unlike Java i18n support Some API differences –But generally a superset of the Java API –Some differences unavoidable due to class restrictions –Rule syntax differs to varying degrees Data differences –ICU4J uses its own CLDR data, not the JVM’s data Size –Can trim ICU4J, but it will always be larger than 0K

8 Features of ICU4J Collation Normalization Break Iteration UnicodeSet and Transforms Character Properties Locale data Other –Calendars, Formatters, IDNA, StringPrep, IMEs

9 Collation Full UCA (Unicode Collation Algorithm) –Java does not implement UCA collation Locale data –Over 60 tailorings for locale-specific collation –Variants: Pinyin, stroke, traditional, etc. Performance –sorting: 2 to 20 times faster –sort key generation: 1.5 to 4 times faster –sort key length: 2/3 to 1/4 the length of Java sort keys

10 Normalization Java does not provide normalization APIs –Java uses ICU’s implementation internally –Useful for searching, string equivalence, simplifying processing of text Full implementation of Unicode standard –NFC, NFD, NFKC, NFKD –Also provides FCD ‘quick check’ for optimization

11 Break Iteration Fully conforms to Unicode specifications –supplementary characters, Hangul Tags –e.g., “what kind of word was this” Title case iteration Rule-based, dictionary-based for Thai

12 Unicode Set and Transforms UnicodeSet –collections of characters based on properties –logical set operations, flexible –“[[:mark:]&[\u0600-\u067f]]” Transliterator –general transformations, with chaining and editing –converts between scripts, e.g. Greek/Latin, Devanagari/Gujarati –rule-based, rules for common conversions supplied\ UScriptRun

13 Character Properties All Unicode character properties –over 80, Java provides access to about 10 All defined code points Current with latest Unicode release –ICU4J 3.0 uses Unicode data Fast access to character data

14 Locale Data Standard data, included with ICU4J –CLDR (Common Locale Data Repository) –Ensures same data is available everywhere –Can share resource data with ICU4C applications More locales, more kinds of data –~230 locales, compared to ~130 for Java –Can modularize to include only the data you need RFC3066bis support (language_script_region) –e.g., zh_Hans, zh_Hant –keywords (orthogonal variants)

15 Performance of ICU4J Instantiation times are comparable –Common instantiate and reuse model –ICU4J and Java both use caches to limit impact Collation performance faster –faster sorting, smaller sort keys Performance is difficult to measure –JVM makes a difference –ICU4J performs well in spot tests –Use a scenario that matters to you to test

16 Property Data Timings JVMICU4JJava(J-I)/I Sun ns/op101 ns/op13% Sun 1.5.0b2117 ns/op102 ns/op-13% IBM ns/op66 ns/op32% 1.13MHz PIII, Win2K Nanoseconds/operation for character property access (getType, toLowerCase, getDirectionality) on three JVMs.

17 Sizes of ICU4J Full jar file: 2,700K Modular builds for common subsets –normalizer: 420K –collator: 1,400K –calendar: 1,300K –break iterator: 1,300K –basic properties: 500K –full properties: 1,200K –formatting: 2,200K –transforms: 1,500K

18 Using ICU4J Jar file, just add to class path –Or roll into your distribution, it’s Open Source! –Modular builds help you to trim ICU4J’s code –Data can be trimmed to further reduce size Parallel APIs –APIs on parallel classes are generally a superset –Change import (one line change) or change class name –Some differences unavoidable (our supplementary support for Java 1.4 can’t add API to String)

19 Code Examples (1) import com.ibm.icu.text.BreakIterator; BreakIterator b = BreakIterator.getWordInstance(); b.setText(text); for (int pos = b.first(); pos != BreakIterator.DONE; pos = b.next()) { doSomething(pos); }

20 Code Examples (2) import com.ibm.icu.lang.UCharacter; int cp, pos = 0; while (pos < text.length()) { cp = UCharacter.codePointAt(text, pos); if (UCharacter.getType(cp) == UCharacter.SURROGATE) return true; pos += UCharacter.charCount(cp); }

21 Code Examples (3) import com.ibm.icu.util.ULocale; import com.ibm.icu.text.Collator; import java.util.Arrays; ULocale ulocale = new Collator col = Collator.getInstance(ulocale); String[] list =... Arrays.sort(list, col);

22 Conclusion ICU4J is not for you if –you have tight size constraints –you require the Java runtime behavior ICU4J is for you if –you need full compliance with current standards –you need current or additional locale and property data –you need customizability –you need features missing from Java (normalization) –you need additional performance

23 References ICU4J – Java – – Unicode, CLDR – –