Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Similar presentations


Presentation on theme: "Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies."— Presentation transcript:

1 Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies

2 Caution Characters ambiguous, sometimes: –Graphemes: x̣ (also ch, … ) –Code points: –Code units: (or UTF-8: 78 CC A3) For programmers –Unicode associates codepoints (or sequences of codepoints) with properties –See UTR#17

3 The Problem Programs often have to do lookups –Look up properties by codepoint –Map codepoints to values –Test codepoints for inclusion in set e.g. value == true/false Easy with 256 codepoints: just use array

4 Size Matters Not so easy with Unicode! Unicode 3.0 –subset (except PUA) –up to FFFF 16 = 65, Unicode 3.1 –full range –up to 10FFFF 16 = 1,114,111 10

5 Array Lookup With ASCII Simple Fast Compact –codepoint bit: 32 bytes –codepoint short: ½ K With Unicode Simple Fast Huge (esp. v3.1) –codepoint bit: 136 K –codepoint short: 2.2 M

6 Further complications Mappings, tests, properties often must be for sequences of codepoints. –Human languages don t just use single codepoints. – ch in Spanish, Slovak; etc.

7 First step: Avoidance Properties from libraries often suffice –Test for (Character.getType(c) == Nd) instead of long list of codepoints Easier Automatically updated with new versions Data structures from libraries often suffice –Java Hashtable –ICU (Java or C++) CompactArray –JavaScript properties Consult

8 Data structures: criteria Speed –Read (static) –Write (dynamic) –Startup Memory footprint –Ram –Disk Multi-threading

9 Hashtables Advantages –Easy to use out-of-the-box –Reasonably fast –General Disadvantages –High overhead –Discrete (no range lookup) –Much slower than array lookup

10 Overhead: char1 char2 value next key overhead char1 overhead char2 overhead … hash … overhead

11 Trie Advantages –Nearly as fast as array lookup –Much smaller than arrays or Hashtables –Take advantage of repetition Disadvantages –Not suited for rapidly changing data –Best for static, preformed data

12 Trie structure … Index Data M1M2 Codepoint

13 Trie code 5 Operations –Shift, Lookup, Mask, Add, Lookup v = data[index[c>>S1]+(c&M2)]] S1 M1M2 Codepoint

14 Trie: double indexed Double, for more compaction: –Slightly slower than single index –Smaller chunks of data, so more compaction

15 Trie: double indexed … … … Index2 Data Index1 M1M3M2 Codepoint

16 Trie code: double indexed b1 = index1[ c >> S1 ] b2 = index2[ b1 + ((c >> S2) & M2)] v = data[ b2 + (c & M3) ] S2 S1 M1M3M2 Codepoint

17 Inversion List Compaction of set of codepoints Advantages –Simple –Very compact –Faster write than trie –Very fast boolean operations Disadvantages –Slower read than trie or hashtable

18 Inversion List Structure Structure –Index (optional) –List of codepoints in ascending order Example Set [ , 0135, 19A3-201B ] A3 201C Index 0: 1: 2: 3: 4: 5: in out in out in out

19 Inversion List Example Find smallest i such that c < data[i] –If no i, i = length Then c List odd(i) Examples: –In:0023, 0135 –Out:001A, 0136, A A3 201C Index 0: 1: 2: 3: 4: 5: in out in out in out

20 Inversion List Operations Fast Boolean Operations Example: Negation A3 201C Index 0: 1: 2: 3: 4: 5: A3 201C Index 1: 3: 2: 4: 5: 6: :

21 Inversion List: Binary Search from Programming Pearls Completely unrolled, precalculated parameters int index = startIndex; if (x >= data[auxStart]) { index += auxStart; } switch (power) { case 21: if (x < data[t = index-0x10000]) index = t; case 20: if (x < data[t = index-0x8000]) index = t; …

22 Inversion Map Inversion List plus Associated Values –Lookup index just as in Inversion List –Take corresponding value A3 201C Index 0: 1: 2: 3: 4: 5: : 1: 2: 3: 4: 5: 6:

23 Key String Value Problem –Often almost all values are 1 codepoint –But, must map to strings in a few cases –Don t want overhead for strings always Solution –Exception values indicate extra processing –Can use same solution for UTF-16 code units

24 Example Get a character ch Find its value v If v is in [D800..E000], may be string –check v2 = valueException[v - D800] –if v2 not null, process it, continue Process v

25 String Key Value Problem –Often almost all keys are 1 codepoint –Must have string keys in a few cases –Don t want overhead for strings always Solution –Exception values indicate possible follow-on codepoints –Can use same solution for UTF-16 code units –Use key closure!

26 Closure If (X + Y) is a key, then X is a key Before s x sh y shch z After shc yw c w s x sh y shch z c w

27 Why Closure? shcha … x y yw z not found, use last

28 Bitpacking Squeeze information into value Example: Character Properties –category: 5 bits –bidi: 4 bits (+ exceptions) –canonical category: 6 bits + expansion compressCanon = [bits >> SHIFT] & MASK; canon = expansionArray[compressCanon];

29 Statetables Classic: –entry = stateTable[ state, ch ]; –state = entry.state; –doSomethingWith( entry.action ); –until (state < 0);

30 Statetables Unicode: –type = trie[ch]; –entry = stateTable[ state, type ]; –state = entry.state; –doSomethingWith( entry.action ); –until (state < 0); Also, String Key Value

31 Sample Data Structures: ICU Trie: CompactArray –Customized for each datatype –Automatic expansion –Compact after setting Character Properties –use CompactArray, Bitpacking Inversion List: UnicodeSet –Boolean Operations

32 Sample Usage #1: ICU Collation –Trie lookup –Expanding character: String Key Value –Contracting character: Key String Value Break Iterators –For grapheme, word, line, sentence break –Statetable

33 Sample Usage #2: ICU Transliteration –Requires Mapping codepoints in context to others Rearranging codepoints Controlling the choice of mapping –Character Properties –Inversion List –Exception values

34 Sample Usage #3: ICU Character Conversion –From Unicode to bytes Trie –From bytes to Unicode Arrays for simple maps Statetables for complex maps –recognizes valid / invalid mappings –provides compaction Complications –Invalid vs. Valid mapped vs. Valid unmapped –Fallbacks

35 References Unicode Open Source ICU –http://oss.software.ibm.com/icuhttp://oss.software.ibm.com/icu –ICU4j: Java API –ICU4c: C and C++ APIs Other references see Mark s website: –http://www.macchiato.comhttp://www.macchiato.com

36 Q & A


Download ppt "Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies."

Similar presentations


Ads by Google