Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies
Caution Characters ambiguous, sometimes: –Graphemes: x̣ (also ch, … ) –Code points: –Code units: (or UTF-8: 78 CC A3) For programmers –Unicode associates codepoints (or sequences of codepoints) with properties –See UTR#17
The Problem Programs often have to do lookups –Look up properties by codepoint –Map codepoints to values –Test codepoints for inclusion in set e.g. value == true/false Easy with 256 codepoints: just use array
Size Matters Not so easy with Unicode! Unicode 3.0 –subset (except PUA) –up to FFFF 16 = 65, Unicode 3.1 –full range –up to 10FFFF 16 = 1,114,111 10
Array Lookup With ASCII Simple Fast Compact –codepoint bit: 32 bytes –codepoint short: ½ K With Unicode Simple Fast Huge (esp. v3.1) –codepoint bit: 136 K –codepoint short: 2.2 M
Further complications Mappings, tests, properties often must be for sequences of codepoints. –Human languages don t just use single codepoints. – ch in Spanish, Slovak; etc.
First step: Avoidance Properties from libraries often suffice –Test for (Character.getType(c) == Nd) instead of long list of codepoints Easier Automatically updated with new versions Data structures from libraries often suffice –Java Hashtable –ICU (Java or C++) CompactArray –JavaScript properties Consult
Data structures: criteria Speed –Read (static) –Write (dynamic) –Startup Memory footprint –Ram –Disk Multi-threading
Hashtables Advantages –Easy to use out-of-the-box –Reasonably fast –General Disadvantages –High overhead –Discrete (no range lookup) –Much slower than array lookup
Overhead: char1 char2 value next key overhead char1 overhead char2 overhead … hash … overhead
Trie Advantages –Nearly as fast as array lookup –Much smaller than arrays or Hashtables –Take advantage of repetition Disadvantages –Not suited for rapidly changing data –Best for static, preformed data
Trie structure … Index Data M1M2 Codepoint
Trie code 5 Operations –Shift, Lookup, Mask, Add, Lookup v = data[index[c>>S1]+(c&M2)]] S1 M1M2 Codepoint
Trie: double indexed Double, for more compaction: –Slightly slower than single index –Smaller chunks of data, so more compaction
Trie: double indexed … … … Index2 Data Index1 M1M3M2 Codepoint
Trie code: double indexed b1 = index1[ c >> S1 ] b2 = index2[ b1 + ((c >> S2) & M2)] v = data[ b2 + (c & M3) ] S2 S1 M1M3M2 Codepoint
Inversion List Compaction of set of codepoints Advantages –Simple –Very compact –Faster write than trie –Very fast boolean operations Disadvantages –Slower read than trie or hashtable
Inversion List Structure Structure –Index (optional) –List of codepoints in ascending order Example Set [ , 0135, 19A3-201B ] A3 201C Index 0: 1: 2: 3: 4: 5: in out in out in out
Inversion List Example Find smallest i such that c < data[i] –If no i, i = length Then c List odd(i) Examples: –In:0023, 0135 –Out:001A, 0136, A A3 201C Index 0: 1: 2: 3: 4: 5: in out in out in out
Inversion List Operations Fast Boolean Operations Example: Negation A3 201C Index 0: 1: 2: 3: 4: 5: A3 201C Index 1: 3: 2: 4: 5: 6: :
Inversion List: Binary Search from Programming Pearls Completely unrolled, precalculated parameters int index = startIndex; if (x >= data[auxStart]) { index += auxStart; } switch (power) { case 21: if (x < data[t = index-0x10000]) index = t; case 20: if (x < data[t = index-0x8000]) index = t; …
Inversion Map Inversion List plus Associated Values –Lookup index just as in Inversion List –Take corresponding value A3 201C Index 0: 1: 2: 3: 4: 5: : 1: 2: 3: 4: 5: 6:
Key String Value Problem –Often almost all values are 1 codepoint –But, must map to strings in a few cases –Don t want overhead for strings always Solution –Exception values indicate extra processing –Can use same solution for UTF-16 code units
Example Get a character ch Find its value v If v is in [D800..E000], may be string –check v2 = valueException[v - D800] –if v2 not null, process it, continue Process v
String Key Value Problem –Often almost all keys are 1 codepoint –Must have string keys in a few cases –Don t want overhead for strings always Solution –Exception values indicate possible follow-on codepoints –Can use same solution for UTF-16 code units –Use key closure!
Closure If (X + Y) is a key, then X is a key Before s x sh y shch z After shc yw c w s x sh y shch z c w
Why Closure? shcha … x y yw z not found, use last
Bitpacking Squeeze information into value Example: Character Properties –category: 5 bits –bidi: 4 bits (+ exceptions) –canonical category: 6 bits + expansion compressCanon = [bits >> SHIFT] & MASK; canon = expansionArray[compressCanon];
Statetables Classic: –entry = stateTable[ state, ch ]; –state = entry.state; –doSomethingWith( entry.action ); –until (state < 0);
Statetables Unicode: –type = trie[ch]; –entry = stateTable[ state, type ]; –state = entry.state; –doSomethingWith( entry.action ); –until (state < 0); Also, String Key Value
Sample Data Structures: ICU Trie: CompactArray –Customized for each datatype –Automatic expansion –Compact after setting Character Properties –use CompactArray, Bitpacking Inversion List: UnicodeSet –Boolean Operations
Sample Usage #1: ICU Collation –Trie lookup –Expanding character: String Key Value –Contracting character: Key String Value Break Iterators –For grapheme, word, line, sentence break –Statetable
Sample Usage #2: ICU Transliteration –Requires Mapping codepoints in context to others Rearranging codepoints Controlling the choice of mapping –Character Properties –Inversion List –Exception values
Sample Usage #3: ICU Character Conversion –From Unicode to bytes Trie –From bytes to Unicode Arrays for simple maps Statetables for complex maps –recognizes valid / invalid mappings –provides compaction Complications –Invalid vs. Valid mapped vs. Valid unmapped –Fallbacks
References Unicode Open Source ICU – –ICU4j: Java API –ICU4c: C and C++ APIs Other references see Mark s website: –
Q & A