26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

Slides:



Advertisements
Similar presentations
Module R2 CS450. Next Week R1 is due next Friday ▫Bring manuals in a binder - make sure to have a cover page with group number, module, and date. You.
Advertisements

INTRODUCTION Chapter 1 1. Java CPSC 1100 University of Tennessee at Chattanooga 2  Difference between Visual Logic & Java  Lots  Visual Logic Flowcharts.
The Web Warrior Guide to Web Design Technologies
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Chapter 2 Getting Started with Java Program development.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Chapter 2 Getting Started with Java Structure of.
Web Page Development Identify elements of a Web Page Start Notepad
1 Fall 2008ACS-1903 for Loop Reading files String conversions Random class.
Introduction to a Programming Environment
Guide To UNIX Using Linux Third Edition
1 The First Step Learning objectives write Java programs that display text on the screen. distinguish between the eight built-in scalar types of Java;
Cupertino, CA, USA / September, 2000First ICU DeveloperWorkshop1 Date/Time/Number Formatting Alan Liu Globalization Center of Competency IBM Emerging Technology.
CS 46B: Introduction to Data Structures July 30 Class Meeting Department of Computer Science San Jose State University Summer 2015 Instructor: Ron Mak.
What is RobotC?!?! Team 2425 Hydra. Overview What is RobotC What is RobotC used for What you need to program a robot How a robot program works Framework.
© 2005 IBM Corporation 28th Internationalization and Unicode Conference Getting Started with ICU George Rhoten IBM Globalization Center of Competency.
Javascript and the Web Whys and Hows of Javascript.
Introduction to Java Appendix A. Appendix A: Introduction to Java2 Chapter Objectives To understand the essentials of object-oriented programming in Java.
Prepared by Uzma Hashmi Instructor Information Uzma Hashmi Office: B# 7/ R# address: Group Addresses Post message:
CS346 - Javascript 1, 21 Module 1 Introduction to JavaScript CS346.
Georgia Institute of Technology Creating and Modifying Text part 1 Barb Ericson Georgia Institute of Technology Oct 2005.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
Java: Chapter 1 Computer Systems Computer Programming II.
General Programming Introduction to Computing Science and Programming I.
Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.
IPC144 Introduction to Programming Using C Week 1 – Lesson 2
Unicode Support in ICU for Java Doug Felt Globalization Center of Competency, San Jose, CA.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Program A computer program (also software, or just a program) is a sequence of instructions written in a sequence to perform a specified task with a computer.
Spring 2008 Mark Fontenot CSE 1341 Principles of Computer Science I Note Set 2.
Getting Started with ICU
POS 406 Java Technology And Beginning Java Code
Chapter 14 Internationalization F Processing Date and Time –Locale –Date –TimeZone –Calendar and GregorianCalendar –DateFormat and SimpleDateFormat F Formatting.
Liang, Introduction to Java Programming, Fifth Edition, (c) 2005 Pearson Education, Inc. All rights reserved Chapter 26 Internationalization.
Chapter 12: Internationalization Processing Date and Time Processing Date and Time  Locale  Date  TimeZone  Calendar and GregorianCalendar  DateFormat.
 Pearson Education, Inc. All rights reserved Introduction to Java Applications.
Fundamental Programming: Fundamental Programming Introduction to C++
Java Classes. Consider this simplistic class public class ProjInfo {ProjInfo() {System.out.println("This program computes factorial of number"); System.out.println("passed.
Guide to Oracle 10g ITBIS373 Database Development Lecture 4a - Chapter 4: Using SQL Queries to Insert, Update, Delete, and View Data.
Introduction to Java Lecture Notes 3. Variables l A variable is a name for a location in memory used to hold a value. In Java data declaration is identical.
Introduction to Programming Writing Java Beginning Java Programs.
Chapter 14 Internationalization F Processing Date and Time –Locale –Date –TimeZone –Calendar and GregorianCalendar –DateFormat and SimpleDateFormat F Formatting.
XP Tutorial 8 Adding Interactivity with ActionScript.
BEGINNING PROGRAMMING.  Literally – giving instructions to a computer so that it does what you want  Practically – using a programming language (such.
1 CSC241: Object Oriented Programming Lecture No 25.
Strings and Text File I/O (and Exception Handling) Corresponds with Chapters 8 and 17.
Core Java Introduction Byju Veedu Ness Technologies httpdownload.oracle.com/javase/tutorial/getStarted/intro/definition.html.
Fall 2002CS 150: Intro. to Computing1 Streams and File I/O (That is, Input/Output) OR How you read data from files and write data to files.
Vladimir Misic: Characters and Strings1Tuesday, 9:39 AM Characters and Strings.
Character Encoding & Handling doubles Pepper. Character encoding schemes EBCDIC – older with jumps in alphabet ASCII 1967 (7 bit)– Handled English, –ASCII.
 2007 Pearson Education, Inc. All rights reserved C Arrays.
Chapter 9 1 Chapter 9 – Part 2 l Overview of Streams and File I/O l Text File I/O l Binary File I/O l File Objects and File Names Streams and File I/O.
Files Tutor: You will need ….
Java Doc Guideline R.SANTHANA GOPALAN. Java Doc Guideline Audience Internal Developers PQA - who write test plans PPT – who write the documentation Customers.
Text Files and String Processing
1 Chapter 20 Internationalization. 2 Objectives F To describe Java's internationalization features (§ 20.1). F To construct a locale with language, country,
Aside: Running Supplied *.java Programs Just double clicking on a *.java file may not be too useful! 1.In Eclipse, create a project for this program or.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 ICU Low-level Utilities and Resource Management Vladimir Weinstein Globalization Center.
Announcements Assignment 1 due Wednesday at 11:59PM Quiz 1 on Thursday 1.
Java: Variables and Methods By Joshua Li Created for the allAboutJavaClasses wikispace.
A data type in a programming language is a set of data with values having predefined characteristics.data The language usually specifies:  the range.
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
C Programming Day 2. 2 Copyright © 2005, Infosys Technologies Ltd ER/CORP/CRS/LA07/003 Version No. 1.0 Union –mechanism to create user defined data types.
NXT File System Just like we’re able to store multiple programs and sound files to the NXT, we can store text files that contain information we specify.
Intro to ETEC Java.
Formatting Output & Enumerated Types & Wrapper Classes
How to Run a Java Program
An Introduction to Java – Part I, language basics
Homework Reading Programming Assignments Finish K&R Chapter 1
Fundamental Programming
Presentation transcript:

26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader

Getting Started with ICU 26th Internationalization and Unicode Conference 2 San Jose, September 2004 Agenda  Getting & setting up ICU4C  Using conversion engine  Using break iterator engine  Getting & setting up ICU4J  Using collation engine  Using message formats

Getting Started with ICU 26th Internationalization and Unicode Conference 3 San Jose, September 2004 Getting ICU4C   Get the latest release  Get the binary package  Source download for modifying build options  CVS for bleeding edge: 

Getting Started with ICU 26th Internationalization and Unicode Conference 4 San Jose, September 2004 Setting up ICU4C  Unpack binaries  If you need to build from source –Windows: MSVC.Net 2003 Project, CygWin + MSVC 6, just CygWin –Unix: runConfigureICU make install make check

Getting Started with ICU 26th Internationalization and Unicode Conference 5 San Jose, September 2004 Testing ICU4C  Windows - run: cintltst, intltest, iotest  Unix - make check (again)  See it for yourself: #include #include "unicode/utypes.h" #include "unicode/ures.h" main() { UErrorCode status = U_ZERO_ERROR; UResourceBundle *res = ures_open(NULL, "", &status); if(U_SUCCESS(status)) { printf("everything is OK\n"); } else { printf("error %s opening resource\n", u_errorName(status)); } ures_close(res); }

Getting Started with ICU 26th Internationalization and Unicode Conference 6 San Jose, September 2004 Conversion Engine - Opening  ICU4C uses open/use/close paradigm  Open a converter: UErrorCode status = U_ZERO_ERROR; UConverter *cnv = ucnv_open(encoding, &status); if(U_FAILURE(status)) { /* process the error situation, die gracefully */ }  Almost all APIs use UErrorCode for status  Check the error code!

Getting Started with ICU 26th Internationalization and Unicode Conference 7 San Jose, September 2004 What Converters are Available  ucnv_countAvailable() – get the number of available converters  ucnv_getAvailable – get the name of a particular converter  Lot of frameworks allow this examination

Getting Started with ICU 26th Internationalization and Unicode Conference 8 San Jose, September 2004 Converting Text Chunk by Chunk char buffer[DEFAULT_BUFFER_SIZE]; char *bufP = buffer; len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE, source, sourceLen, &status); if(U_FAILURE(status)) { if(status == U_BUFFER_OVERFLOW_ERROR) { status = U_ZERO_ERROR; bufP = (UChar *)malloc((len + 1) * sizeof(char)); len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE, source, sourceLen, &status); } else { /* other error, die gracefully */ } /* do interesting stuff with the converted text */

Getting Started with ICU 26th Internationalization and Unicode Conference 9 San Jose, September 2004 Converting Text Character by Character  Works only from code page to Unicode UChar32 result; char *source = start; char *sourceLimit = start + len; while(source < sourceLimit) { result = ucnv_getNextUChar(cnv, &source, sourceLimit, &status); if(U_FAILURE(status)) { /* die gracefully */ } /* do interesting stuff with the converted text */ }

Getting Started with ICU 26th Internationalization and Unicode Conference 10 San Jose, September 2004 Converting Text Piece by Piece while((!feof(f)) && ((count=fread(inBuf, 1, BUFFER_SIZE, f)) > 0) ) { source = inBuf; sourceLimit = inBuf + count; do { target = uBuf; targetLimit = uBuf + uBufSize; ucnv_toUnicode(conv, &target, targetLimit, &source, sourceLimit, NULL, feof(f)?TRUE:FALSE, /* pass 'flush' when eof */ /* is true (when no more data will come) */ &status); if(status == U_BUFFER_OVERFLOW_ERROR) { // simply ran out of space – we'll reset the // target ptr the next time through the loop. status = U_ZERO_ERROR; } else { // Check other errors here and act appropriately } text.append(uBuf, target-uBuf); count += target-uBuf; } while (source < sourceLimit); // while simply out of space }

Getting Started with ICU 26th Internationalization and Unicode Conference 11 San Jose, September 2004 Clean up!  Whatever is opened, needs to be closed  Converters use ucnv_close  Sample uses conversion to convert code page data from a file

Getting Started with ICU 26th Internationalization and Unicode Conference 12 San Jose, September 2004 Break Iteration - Introduction  Four types of boundaries: –Character, word, line, sentence  Points to a boundary between two characters  Index of character following the boundary  Use current() to get the boundary  Use first() to set iterator to start of text  Use last() to set iterator to end of text

Getting Started with ICU 26th Internationalization and Unicode Conference 13 San Jose, September 2004 Break Iteration - Navigation  Use next() to move to next boundary  Use previous() to move to previous boundary  Returns DONE if can’t move boundary

Getting Started with ICU 26th Internationalization and Unicode Conference 14 San Jose, September 2004 Break Itaration – Checking a position  Use isBoundary() to see if position is boundary  Use preceeding() to find boundary at or before  Use following() to find boundary at or after

Getting Started with ICU 26th Internationalization and Unicode Conference 15 San Jose, September 2004 Break Iteration - Opening  Use the factory methods: Locale locale = …; // locale to use for break iterators UErrorCode status = U_ZERO_ERROR; BreakIterator *characterIterator = BreakIterator::createCharacterInstance(locale, status); BreakIterator *wordIterator = BreakIterator::createWordInstance(locale, status); BreakIterator *lineIterator = BreakIterator::createLineInstance(locale, status); BreakIterator *sentenceIterator = BreakIterator::createSentenceInstance(locale, status);  Don’t forget to check the status!

Getting Started with ICU 26th Internationalization and Unicode Conference 16 San Jose, September 2004 Set the text  We need to tell the iterator what text to use: UnicodeString text; readFile(file, text); wordIterator->setText(text);  Reuse iterators by calling setText() again.

Getting Started with ICU 26th Internationalization and Unicode Conference 17 San Jose, September 2004 Break Iteration - Counting words in a file: int32_t countWords(BreakIterator *wordIterator, UnicodeString &text) { U_ERROR_CODE status = U_ZERO_ERROR; UnicodeString word; UnicodeSet letters(UnicodeString("[:letter:]"), status); int32_t wordCount = 0; int32_t start = wordIterator->first(); for(int32_t end = wordIterator->next(); end != BreakIterator::DONE; start = end, end = wordIterator->next()) { text->extractBetween(start, end, word); if(letters.containsSome(word)) { wordCount += 1; } return wordCount; }

Getting Started with ICU 26th Internationalization and Unicode Conference 18 San Jose, September 2004 Break Iteration – Breaking lines int32_t previousBreak(BreakIterator *breakIterator, UnicodeString &text, int32_t location) { int32_t len = text.length(); while(location < len) { UChar c = text[location]; if(!u_isWhitespace(c) && !u_iscntrl(c)) { break; } location += 1; } return breakIterator->previous(location + 1); }

Getting Started with ICU 26th Internationalization and Unicode Conference 19 San Jose, September 2004 Break Iteration – Cleaning up  Use delete to delete the iterators delete characterIterator; delete wordIterator; delete lineIterator; delete sentenceIterator;

Getting Started with ICU 26th Internationalization and Unicode Conference 20 San Jose, September 2004 Useful Links  Homepage:  API documents:  User guide:

Getting Started with ICU 26th Internationalization and Unicode Conference 21 San Jose, September 2004 Getting ICU4J  Easiest – pick a.jar file off download section on  Use the latest version if possible  For sources, download the source.jar  For bleeding edge, use the latest CVS  sr/cvs/icu4j

Getting Started with ICU 26th Internationalization and Unicode Conference 22 San Jose, September 2004 Setting up ICU4J  Check that you have the appropriate JDK version  Try the test code (ICU4J 3.0 or later): import com.ibm.icu.util.ULocale; import com.ibm.icu.util.UResourceBundle; public class TestICU { public static void main(String[] args) { UResourceBundle resourceBundle = UResourceBundle.getBundleInstance(null, ULocale.getDefault()); }  Add ICU’s jar to classpath on command line  Run the test suite

Getting Started with ICU 26th Internationalization and Unicode Conference 23 San Jose, September 2004 Building ICU4J  Need ant in addition to JDK  Use ant to build  We also like Eclipse

Getting Started with ICU 26th Internationalization and Unicode Conference 24 San Jose, September 2004 Collation Engine  More on collation in a couple of hours!  Used for comparing strings  Instantiation: ULocale locale = new ULocale("fr"); Collator coll = Collator.getInstance(locale); // do useful things with the collator  Lives in com.ibm.icu.text.Collator

Getting Started with ICU 26th Internationalization and Unicode Conference 25 San Jose, September 2004 String Comparison  Works fast  You get the result as soon as it is ready  Use when you don’t need to compare same strings many times int compare(String source, String target);

Getting Started with ICU 26th Internationalization and Unicode Conference 26 San Jose, September 2004 Sort Keys  Used when multiple comparisons are required  Indexes in data bases  ICU4J has two classes  Compare only sort keys generated by the same type of a collator

Getting Started with ICU 26th Internationalization and Unicode Conference 27 San Jose, September 2004 CollationKey class  JDK compatible  Saves the original string  Compare keys with compareTo method  Get the bytes with toByteArray method  We used CollationKey as a key for a TreeMap structure

Getting Started with ICU 26th Internationalization and Unicode Conference 28 San Jose, September 2004 RawCollationKey class  Does not store the original string  Get it by using getRawCollationKey method  Mutable class, can be reused  Simple and lightweight

Getting Started with ICU 26th Internationalization and Unicode Conference 29 San Jose, September 2004 Message Format - Introduction  Assembles a user message from parts  Some parts fixed, some supplied at runtime  Order different for different languages: –English: My Aunt’s pen is on the table. –French: The pen of my Aunt is on the table.  Pattern string defines how to assemble parts: –English: {0}''s {2} is {1}. –French: {2} of {0} is {1}.  Get pattern string from resource bundle

Getting Started with ICU 26th Internationalization and Unicode Conference 30 San Jose, September 2004 Message Format - Example String person = …; // e.g. “My Aunt” String place = …; // e.g. “on the table” String thing = …; // e.g. “pen” String pattern = resourceBundle.getString(“personPlaceThing”); MessageFormat msgFmt = new MessageFormat(pattern); Object arguments[] = {person, place, thing); String message = msgFmt.format(arguments); System.out.println(message);

Getting Started with ICU 26th Internationalization and Unicode Conference 31 San Jose, September 2004 Message Format – Different data types String pattern = “On {0, date} at {0, time} there was {1}.”; MessageFormat fmt = new MessageFormat(pattern); Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 }; System.out.println(fmt.format(args)); On Jul 17, 2004 at 2:15:08 PM there was a power failure.  We can also format other data types, like dates  We do this by adding a format type:  This will output:

Getting Started with ICU 26th Internationalization and Unicode Conference 32 San Jose, September 2004 Message Format – Format styles String pattern = “On {0, date, full} at {0, time, full} there was {1}.”; MessageFormat fmt = new MessageFormat(pattern); Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 }; System.out.println(fmt.format(args)); On Saturday, July 17, 2004 at 2:15:08 PM PDT there was a power failure.  This will output:  Add a format style:

Getting Started with ICU 26th Internationalization and Unicode Conference 33 San Jose, September 2004 Message Format – Format style details Format TypeFormat StyleSample Output number (none)123, integer123,457 currency$123, percent12% date (none)Jul 17, 2004 short7/17/04 mediumJul 17, 2004 longJuly 17, 2004 fullSaturday, July 17, 2004 time (none)2:15:08 PM short2:15 PM medium2:14:08 PM long2:15:08 PM PDT full2:15:08 PM PDT

Getting Started with ICU 26th Internationalization and Unicode Conference 34 San Jose, September 2004 Message Format – No format type Data TypeSample Output Number123, Date7/17/04 2:15 PM Stringon the table othersoutput of toString() method  If no format type, data formatted like this:

Getting Started with ICU 26th Internationalization and Unicode Conference 35 San Jose, September 2004 Message Format – Counting files  Pattern to display number of files: There are {1, number, integer} files in {0}. String pattern = resourceBundle.getString(“fileCount”); MessageFormat fmt = new MessageFormat(fileCountPattern); String directoryName = … ; Int fileCount = … ; Object args[] = {directoryName, new Integer(fileCount)}; System.out.println(fmt.format(args)); There are 1,234 files in myDirectory.  Code to use the pattern:  This will output messages like:

Getting Started with ICU 26th Internationalization and Unicode Conference 36 San Jose, September 2004 Message Format – Problems counting files  If there’s only one file, we get: There are 1 files in myDirectory.  Could fix by testing for special case of one file  But, some languages need other special cases: –Dual forms –Different form for no files –Etc.

Getting Started with ICU 26th Internationalization and Unicode Conference 37 San Jose, September 2004 Message Format – Choice format  Choice format handles all of this  Use special format element: There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}.  Using this pattern with the same code we get: There are no files in thisDirectory. There is one file in thatDirectory. There are 1,234 files in myDirectory.

Getting Started with ICU 26th Internationalization and Unicode Conference 38 San Jose, September 2004 Message Format – Choice format patterns  Selects a string based on number  If string is a format element, process it  Splits real line into two or more ranges  Range specifiers separated by vertical bar (“|”)  Lower limit, separator, string  Separator indicates type of lower limit: SeparatorLower Limit # inclusive ≤ < exclusive

Getting Started with ICU 26th Internationalization and Unicode Conference 39 San Jose, September 2004 Message Format – Choice pattern details  Here’s our pattern again: There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}.  First range is [0..1) –Really [-∞..1)  Second range is [1..1]  Third range is (1..∞]

Getting Started with ICU 26th Internationalization and Unicode Conference 40 San Jose, September 2004 Message Format – Other details  Format style can be a pattern string –Format type number : use DecimalFormat pattern –Format type date, time : use SimpleDateFormat pattern  Quoting in patterns –Enclose special characters in single quotes –Use two consecutive single quotes to represent one The '{' character, the '#' character and the '' character.

Getting Started with ICU 26th Internationalization and Unicode Conference 41 San Jose, September 2004 Useful Links  Homepage:  API documents:  User guide: