Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Globalization Gotchas
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Draft Java/ICU Internationalization Architecture Mark Davis.
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Transparency No. 1 Java Collection API : Built-in Data Structures for Java.
Recursion Chapter 14. Overview Base case and general case of recursion. A recursion is a method that calls itself. That simplifies the problem. The simpler.
Chapter 17 Recursion.
List Implementations That Use Arrays
The List Type Lecture 9 Hartmut Kaiser
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Hash Tables.
CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.
CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 7: Recursion Java Software Structures: Designing and Using.
1 of 31 Images from Africa. 2 of 31 My little Haitian friend Antoine (1985)
1 of 32 Images from Africa. 2 of 32 My little Haitian friend Antoine (1985)
CSci 1130 Intro to Programming in Java
Indexing.
Chapter 9 Interactive Multimedia Authoring with Flash Introduction to Programming 1.
Introduction to Programming G51PRG University of Nottingham Revision 1
DATA TYPES, VARIABLES, ARITHMETIC. Variables A variable is a “named container” that holds a value. A name for a spot in the computer’s memory This value.
Formal Language, chapter 4, slide 1Copyright © 2007 by Adam Webber Chapter Four: DFA Applications.
Air Force Institute of Technology Electrical and Computer Engineering
Programming Paradigms and languages
Data Manipulation Overview and Applications. Agenda Overview of LabVIEW data types Manipulating LabVIEW data types –Changing data types –Byte level manipulation.
Hashing as a Dictionary Implementation
IP Routing Lookups Scalable High Speed IP Routing Lookups.
Constants and Data Types Constants Data Types Reading for this class: L&L,
Hashing Chapters What is Hashing? A technique that determines an index or location for storage of an item in a data structure The hash function.
Java Syntax Primitive data types Operators Control statements.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
2 Systems Architecture, Fifth Edition Chapter Goals Describe numbering systems and their use in data representation Compare and contrast various data.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Chapter 7 Indexing Objectives: To get familiar with: Indexing
Lecture 1: Overview of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++ Designed.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Java Primitives The Smallest Building Blocks of the Language (corresponds with Chapter 2)
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
Basic Java Syntax CSE301 University of Sunderland Harry R Erwin, PhD.
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
Arrays Tonga Institute of Higher Education. Introduction An array is a data structure Definitions  Cell/Element – A box in which you can enter a piece.
Java Simple Types CSIS 3701: Advanced Object Oriented Programming.
 Character set is a set of valid characters that a language can recognise.  A character represents any letter, digit or any other sign  Java uses the.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
ISBN Chapter 6 Data Types Introduction Primitive Data Types User-Defined Ordinal Types.
Java Programming Java Basics. Data Types Java has two main categories of data types: –Primitive data types Built in data types Many very similar to C++
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Cupertino, CA, USA / September, 2000First ICU Developer Workshop1 Sorting and Searching Helena Shih GCoC Manager IBM.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
8-1 Compilers Compiler A program that translates a high-level language program into machine code High-level languages provide a richer set of instructions.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Question of the Day  What three letter word completes the first word and starts the second one: DON???CAR.
 Variables are nothing but reserved memory locations to store values. This means that when you create a variable you reserve some space in memory. 
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
Java Basics. Tokens: 1.Keywords int test12 = 10, i; int TEst12 = 20; Int keyword is used to declare integer variables All Key words are lower case java.
Basic Data Types อ. ยืนยง กันทะเนตร คณะเทคโนโลยีสารสนเทศและการสื่อสาร มหาวิทยาลัยพะเยา Chapter 4.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.
Primitive/Reference Types and Value Semantics
Chapter 6: Data Types Lectures # 10.
Selenium WebDriver Web Test Tool Training
Digital Encodings.
Presentation transcript:

Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies

Caution Characters ambiguous, sometimes: –Graphemes: x̣ (also ch, … ) –Code points: –Code units: (or UTF-8: 78 CC A3) For programmers –Unicode associates codepoints (or sequences of codepoints) with properties –See UTR#17

The Problem Programs often have to do lookups –Look up properties by codepoint –Map codepoints to values –Test codepoints for inclusion in set e.g. value == true/false Easy with 256 codepoints: just use array

Size Matters Not so easy with Unicode! Unicode 3.0 –subset (except PUA) –up to FFFF 16 = 65, Unicode 3.1 –full range –up to 10FFFF 16 = 1,114,111 10

Array Lookup With ASCII Simple Fast Compact –codepoint bit: 32 bytes –codepoint short: ½ K With Unicode Simple Fast Huge (esp. v3.1) –codepoint bit: 136 K –codepoint short: 2.2 M

Further complications Mappings, tests, properties often must be for sequences of codepoints. –Human languages don t just use single codepoints. – ch in Spanish, Slovak; etc.

First step: Avoidance Properties from libraries often suffice –Test for (Character.getType(c) == Nd) instead of long list of codepoints Easier Automatically updated with new versions Data structures from libraries often suffice –Java Hashtable –ICU (Java or C++) CompactArray –JavaScript properties Consult

Data structures: criteria Speed –Read (static) –Write (dynamic) –Startup Memory footprint –Ram –Disk Multi-threading

Hashtables Advantages –Easy to use out-of-the-box –Reasonably fast –General Disadvantages –High overhead –Discrete (no range lookup) –Much slower than array lookup

Overhead: char1 char2 value next key overhead char1 overhead char2 overhead … hash … overhead

Trie Advantages –Nearly as fast as array lookup –Much smaller than arrays or Hashtables –Take advantage of repetition Disadvantages –Not suited for rapidly changing data –Best for static, preformed data

Trie structure … Index Data M1M2 Codepoint

Trie code 5 Operations –Shift, Lookup, Mask, Add, Lookup v = data[index[c>>S1]+(c&M2)]] S1 M1M2 Codepoint

Trie: double indexed Double, for more compaction: –Slightly slower than single index –Smaller chunks of data, so more compaction

Trie: double indexed … … … Index2 Data Index1 M1M3M2 Codepoint

Trie code: double indexed b1 = index1[ c >> S1 ] b2 = index2[ b1 + ((c >> S2) & M2)] v = data[ b2 + (c & M3) ] S2 S1 M1M3M2 Codepoint

Inversion List Compaction of set of codepoints Advantages –Simple –Very compact –Faster write than trie –Very fast boolean operations Disadvantages –Slower read than trie or hashtable

Inversion List Structure Structure –Index (optional) –List of codepoints in ascending order Example Set [ , 0135, 19A3-201B ] A3 201C Index 0: 1: 2: 3: 4: 5: in out in out in out

Inversion List Example Find smallest i such that c < data[i] –If no i, i = length Then c List odd(i) Examples: –In:0023, 0135 –Out:001A, 0136, A A3 201C Index 0: 1: 2: 3: 4: 5: in out in out in out

Inversion List Operations Fast Boolean Operations Example: Negation A3 201C Index 0: 1: 2: 3: 4: 5: A3 201C Index 1: 3: 2: 4: 5: 6: :

Inversion List: Binary Search from Programming Pearls Completely unrolled, precalculated parameters int index = startIndex; if (x >= data[auxStart]) { index += auxStart; } switch (power) { case 21: if (x < data[t = index-0x10000]) index = t; case 20: if (x < data[t = index-0x8000]) index = t; …

Inversion Map Inversion List plus Associated Values –Lookup index just as in Inversion List –Take corresponding value A3 201C Index 0: 1: 2: 3: 4: 5: : 1: 2: 3: 4: 5: 6:

Key String Value Problem –Often almost all values are 1 codepoint –But, must map to strings in a few cases –Don t want overhead for strings always Solution –Exception values indicate extra processing –Can use same solution for UTF-16 code units

Example Get a character ch Find its value v If v is in [D800..E000], may be string –check v2 = valueException[v - D800] –if v2 not null, process it, continue Process v

String Key Value Problem –Often almost all keys are 1 codepoint –Must have string keys in a few cases –Don t want overhead for strings always Solution –Exception values indicate possible follow-on codepoints –Can use same solution for UTF-16 code units –Use key closure!

Closure If (X + Y) is a key, then X is a key Before s x sh y shch z After shc yw c w s x sh y shch z c w

Why Closure? shcha … x y yw z not found, use last

Bitpacking Squeeze information into value Example: Character Properties –category: 5 bits –bidi: 4 bits (+ exceptions) –canonical category: 6 bits + expansion compressCanon = [bits >> SHIFT] & MASK; canon = expansionArray[compressCanon];

Statetables Classic: –entry = stateTable[ state, ch ]; –state = entry.state; –doSomethingWith( entry.action ); –until (state < 0);

Statetables Unicode: –type = trie[ch]; –entry = stateTable[ state, type ]; –state = entry.state; –doSomethingWith( entry.action ); –until (state < 0); Also, String Key Value

Sample Data Structures: ICU Trie: CompactArray –Customized for each datatype –Automatic expansion –Compact after setting Character Properties –use CompactArray, Bitpacking Inversion List: UnicodeSet –Boolean Operations

Sample Usage #1: ICU Collation –Trie lookup –Expanding character: String Key Value –Contracting character: Key String Value Break Iterators –For grapheme, word, line, sentence break –Statetable

Sample Usage #2: ICU Transliteration –Requires Mapping codepoints in context to others Rearranging codepoints Controlling the choice of mapping –Character Properties –Inversion List –Exception values

Sample Usage #3: ICU Character Conversion –From Unicode to bytes Trie –From bytes to Unicode Arrays for simple maps Statetables for complex maps –recognizes valid / invalid mappings –provides compaction Complications –Invalid vs. Valid mapped vs. Valid unmapped –Fallbacks

References Unicode Open Source ICU – –ICU4j: Java API –ICU4c: C and C++ APIs Other references see Mark s website: –

Q & A