Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Structures and Algorithms Course slides: Radix Search, Radix sort, Bucket sort, Huffman compression.

Similar presentations

Presentation on theme: "Data Structures and Algorithms Course slides: Radix Search, Radix sort, Bucket sort, Huffman compression."— Presentation transcript:

1 Data Structures and Algorithms Course slides: Radix Search, Radix sort, Bucket sort, Huffman compression

2 Lecture 10: Searching Radix Searching  For many applications, keys can be thought of as numbers  Searching methods that take advantage of digital properties of these keys are called radix searches  Radix searches treat keys as numbers in base M (the radix) and work with individual digits

3 Lecture 10: Searching Radix Searching  Provide reasonable worst-case performance without complication of balanced trees.  Provide way to handle variable length keys.  Biased data can lead to degenerate data structures with bad performance.

4 Lecture 10: Searching The Simplest Radix Search  Digital Search Trees — like BSTs but branch according to the key’s bits.  Key comparison replaced by function that accesses the key’s next bit.

5 Lecture 10: Searching Digital Search Example A RHC SE A 00001 S 10011 E 00101 R 10010 C 00011 H 01000

6 6 Digital Search Trees  Consider BST search for key K  For each node T in the tree we have 4 possible results 1)T is empty (or a sentinel node) indicating item not found 2)K matches T.key and item is found 3)K < T.key and we go to left child 4)K > T.key and we go to right child  Consider now the same basic technique, but proceeding left or right based on the current bit within the key

7 7 Digital Search Trees  Call this tree a Digital Search Tree (DST)  DST search for key K  For each node T in the tree we have 4 possible results 1)T is empty (or a sentinel node) indicating item not found 2)K matches T.key and item is found 3)Current bit of K is a 0 and we go to left child 4)Current bit of K is a 1 and we go to right child  Look at example on board

8 8 Digital Search Trees  Run-times?  Given N random keys, the height of a DST should average O(log 2 N)  Think of it this way – if the keys are random, at each branch it should be equally likely that a key will have a 0 bit or a 1 bit  Thus the tree should be well balanced  In the worst case, we are bound by the number of bits in the key (say it is b)  So in a sense we can say that this tree has a constant run-time, if the number of bits in the key is a constant  This is an improvement over the BST

9 9 Digital Search Trees  But DSTs have drawbacks  Bitwise operations are not always easy  Some languages do not provide for them at all, and for others it is costly  Handling duplicates is problematic  Where would we put a duplicate object?  Follow bits to new position?  Will work but Find will always find first one  Actually this problem exists with BST as well  Could have nodes store a collection of objects rather than a single object

10 10 Digital Search Trees  Similar problem with keys of different lengths  What if a key is a prefix of another key that is already present?  Data is not sorted  If we want sorted data, we would need to extract all of the data from the tree and sort it  May do b comparisons (of entire key) to find a key  If a key is long and comparisons are costly, this can be inefficient

11 Lecture 10: Searching Digital Search  Requires O(log N) comparisons on average  Requires b comparisons in the worst case for a tree built with N random b-bit keys

12 Lecture 10: Searching Digital Search  Problem: At each node we make a full key comparison — this may be expensive, e.g. very long keys  Solution: store keys only at the leaves, use radix expansion to do intermediate key comparisons

13 Lecture 10: Searching Radix Tries  Used for Retrieval [sic]  Internal nodes used for branching, external nodes used for final key comparison, and to store data

14 Lecture 10: Searching Radix Trie Example S A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 R E CA H

15 Lecture 10: Searching Radix Tries  Left subtree has all keys which have 0 for the leading bit, right subtree has all keys which have 1 for the leading bit  An insert or search requires O(log N) bit comparisons in the average case, and b bit comparisons in the worst case

16 Lecture 10: Searching Radix Tries  Problem: lots of extra nodes for keys that differ only in low order bits (See R and S nodes in example above)  This is addressed by Patricia trees, which allow “lookahead” to the next relevant bit  Practical Algorithm To Retrieve Information Coded In Alphanumeric (Patricia)  In the slides that follow the entire alphabet would be included in the indexes

17 17 Radix Search Tries  Benefit of simple Radix Search Tries  Fewer comparisons of entire key than DSTs  Drawbacks  The tree will have more overall nodes than a DST  Each external node with a key needs a unique bit-path to it  Internal and External nodes are of different types  Insert is somewhat more complicated  Some insert situations require new internal as well as external nodes to be created  We need to create new internal nodes to ensure that each object has a unique path to it  See example

18 18 Radix Search Tries  Run-time is similar to DST  Since tree is binary, average tree height for N keys is O(log 2 N)  However, paths for nodes with many bits in common will tend to be longer  Worst case path length is again b  However, now at worst b bit comparisons are required  We only need one comparison of the entire key  So, again, the benefit to RST is that the entire key must be compared only one time

19 19 Improving Tries  How can we improve tries?  Can we reduce the heights somehow?  Average height now is O(log 2 N)  Can we simplify the data structures needed (so different node types are not required)?  Can we simplify the Insert?  We will examine a couple of variations that improve over the basic Trie

20 Bucket-Sort and Radix-Sort20 Bucket-Sort  Let be S be a sequence of n (key, element) entries with keys in the range [0, N  1]  Bucket-sort uses the keys as indices into an auxiliary array B of sequences (buckets) Phase 1: Empty sequence S by moving each entry (k, o) into its bucket B[k] Phase 2: For i  0, …, N  1, move the entries of bucket B[i] to the end of sequence S  Analysis:  Phase 1 takes O(n) time  Phase 2 takes O(n  N) time Bucket-sort takes O(n  N) time Algorithm bucketSort(S, N) Input sequence S of (key, element) items with keys in the range [0, N  1] Output sequence S sorted by increasing keys B  array of N empty sequences while  S.isEmpty() f  S.first() (k, o)  S.remove(f) B[k].insertLast((k, o)) for i  0 to N  1 while  B[i].isEmpty() f  B[i].first() (k, o)  B[i].remove(f) S.insertLast((k, o))

21 Bucket Sort Each element of the array is put in one of the N “ buckets ”

22 Bucket Sort Now, pull the elements from the buckets into the array At last, the sorted array (sorted in a stable way):

23 Bucket-Sort and Radix-Sort23 Example  Sorting a sequence of 4-bit integers 1001 0010 1101 0001 1110 0010 1110 1001 1101 0001 1001 1101 0001 0010 1110 1001 0001 0010 1101 1110 0001 0010 1001 1101 1110

24 Bucket-Sort and Radix-Sort24 Example  Key range [0, 9] 7, d1, c3, a7, g3, b7, e1, c3, a3, b7, d7, g7, e Phase 1 Phase 2 0123456789 B 1, c7, d7, g3, b3, a7, e 

25 Bucket-Sort and Radix-Sort25 Properties and Extensions  Key-type Property  The keys are used as indices into an array and cannot be arbitrary objects  No external comparator  Stable Sort Property  The relative order of any two items with the same key is preserved after the execution of the algorithm Extensions  Integer keys in the range [a, b]  Put entry (k, o) into bucket B[k  a]  String keys from a set D of possible strings, where D has constant size (e.g., names of the 50 U.S. states)  Sort D and compute the rank r(k) of each string k of D in the sorted sequence  Put entry (k, o) into bucket B[r(k)]

26 Bucket-Sort and Radix-Sort26 Lexicographic Order  A d- tuple is a sequence of d keys (k 1, k 2, …, k d ), where key k i is said to be the i- th dimension of the tuple  Example:  The Cartesian coordinates of a point in space are a 3-tuple  The lexicographic order of two d- tuples is recursively defined as follows (x 1, x 2, …, x d )  (y 1, y 2, …, y d )  x 1  y 1  x 1   y 1  (x 2, …, x d )  (y 2, …, y d ) I.e., the tuples are compared by the first dimension, then by the second dimension, etc.

27 Bucket-Sort and Radix-Sort27 Lexicographic-Sort  Let C i be the comparator that compares two tuples by their i- th dimension  Let stableSort(S, C) be a stable sorting algorithm that uses comparator C  Lexicographic-sort sorts a sequence of d- tuples in lexicographic order by executing d times algorithm stableSort, one per dimension  Lexicographic-sort runs in O(dT(n)) time, where T(n) is the running time of stableSort Algorithm lexicographicSort(S) Input sequence S of d-tuples Output sequence S sorted in lexicographic order for i  d downto 1 stableSort(S, C i ) Example: (7,4,6) (5,1,5) (2,4,6) (2, 1, 4) (3, 2, 4) (2, 1, 4) (3, 2, 4) (5,1,5) (7,4,6) (2,4,6) (2, 1, 4) (5,1,5) (3, 2, 4) (7,4,6) (2,4,6) (2, 1, 4) (2,4,6) (3, 2, 4) (5,1,5) (7,4,6)

28 Bucket-Sort and Radix-Sort28 Radix-Sort  Radix-sort is a specialization of lexicographic-sort that uses bucket-sort as the stable sorting algorithm in each dimension  Radix-sort is applicable to tuples where the keys in each dimension i are integers in the range [0, N  1]  Radix-sort runs in time O(d( n  N)) Algorithm radixSort(S, N) Input sequence S of d-tuples such that (0, …, 0)  (x 1, …, x d ) and (x 1, …, x d )  (N  1, …, N  1) for each tuple (x 1, …, x d ) in S Output sequence S sorted in lexicographic order for i  d downto 1 bucketSort(S, N)

29 Bucket-Sort and Radix-Sort29 Radix-Sort for Binary Numbers  Consider a sequence of n b -bit integers x  x b  … x 1 x 0  We represent each element as a b -tuple of integers in the range [0, 1] and apply radix- sort with N  2  This application of the radix- sort algorithm runs in O(bn) time  For example, we can sort a sequence of 32-bit integers in linear time Algorithm binaryRadixSort(S) Input sequence S of b-bit integers Output sequence S sorted replace each element x of S with the item (0, x) for i  0 to b  1 replace the key k of each item (k, x) of S with bit x i of x bucketSort(S, 2)

30 Does it Work for Real Numbers?  What if keys are not integers?  Assumption: input is n reals from [0, 1)  Basic idea:  Create N linked lists ( buckets ) to divide interval [0,1) into subintervals of size 1/ N  Add each input element to appropriate bucket and sort buckets with insertion sort  Uniform input distribution  O(1) bucket size  Therefore the expected total time is O(n)  Distribution of keys in buckets similar with …. ?

31 Radix Sort  What sort will we use to sort on digits?  Bucket sort is a good choice:  Sort n numbers on digits that range from 1.. N  Time: O( n + N )  Each pass over n numbers with d digits takes time O( n+k ), so total time O( dn+dk )  When d is constant and k= O( n ), takes O( n ) time

32 Radix Sort Example  Problem: sort 1 million 64-bit numbers  Treat as four-digit radix 2 16 numbers  Can sort in just four passes with radix sort!  Running time: 4( 1 million + 2 16 )  4 million operations  Compare with typical O( n lg n ) comparison sort  Requires approx lg n = 20 operations per number being sorted  Total running time  20 million operations

33 Radix Sort  In general, radix sort based on bucket sort is  Asymptotically fast (i.e., O( n ))  Simple to code  A good choice  Can radix sort be used on floating-point numbers?

34 Summary: Radix Sort  Radix sort:  Assumption: input has d digits ranging from 0 to k  Basic idea:  Sort elements by digit starting with least significant  Use a stable sort (like bucket sort) for each stage  Each pass over n numbers with 1 digit takes time O( n+k ), so total time O( dn+dk )  When d is constant and k= O( n ), takes O( n ) time  Fast, Stable, Simple  Doesn ’ t sort in place

35 35 Multiway Tries  RST that we have seen considers the key 1 bit at a time  This causes a maximum height in the tree of up to b, and gives an average height of O(log 2 N) for N keys  If we considered m bits at a time, then we could reduce the worst and average heights  Maximum height is now b/m since m bits are consumed at each level  Let M = 2 m  Average height for N keys is now O(log M N), since we branch in M directions at each node

36 36 Multiway Tries  Let's look at an example  Consider 2 20 (1 meg) keys of length 32 bits  Simple RST will have  Worst Case height = 32  Ave Case height = O(log 2 [2 20 ])  20  Multiway Trie using 8 bits would have  Worst Case height = 32/8 = 4  Ave Case height = O(log 256 [2 20 ])  2.5  This is a considerable improvement  Let's look at an example using character data  We will consider a single character (8 bits) at each level  Go over on board

37 37 Multiway Tries  So what is the catch (or cost)?  Memory  Multiway Tries use considerably more memory than simple tries  Each node in the multiway trie contains M pointers/references  In example with ASCII characters, M = 256  Many of these are unused, especially  During common paths (prefixes), where there is no branching (or "one-way" branching)  Ex: through and throughout  At the lower levels of the tree, where previous branching has likely separated keys already

38 38 Patricia Trees  Idea:  Save memory and height by eliminating all nodes in which no branching occurs  See example on board  Note now that since some nodes are missing, level i does not necessarily correspond to bit (or character) i  So to do a search we need to store in each node which bit (character) the node corresponds to  However, the savings from the removed nodes is still considerable

39 39 Patricia Trees  Also, keep in mind that a key can match at every character that is checked, but still not be actually in the tree  Example for tree on board:  If we search for TWEEDLE, we will only compare the T**E**E  However, the next node after the E is at index 8. This is past the end of TWEEDLE so it is not found  Run-time?  Similar to those of RST and Multiway Trie, depending on how many bits are used per node

40 40 Patricia Trees  So Patricia trees  Reduce tree height by removing "one-way" branching nodes  Text also shows how "upwards" links enable us to use only one node type  TEXT VERSION makes the nodes homogeneous by storing keys within the nodes and using "upwards" links from the leaves to access the nodes  So every node contains a valid key. However, the keys are not checked on the way "down" the tree – only after an upwards link is followed  Thus Patricia saves memory but makes the insert rather tricky, since new nodes may have to be inserted between other nodes  See text

41 PATRICIA TREE  A particular type of “trie”  Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.

42 PATRICIA TREE  Therefore, PATRICIA TREE will have the following attributes in its internal nodes:  Index bit (check bit)  Child pointers (each node must contain exactly 2 children)  On the other hand, leave nodes must be storing actual content for final comparison

43 SISTRING  Sistring is the short form of ‘Semi-Infinite String’  String, no matter what they actually are, is a form of binary bit pattern. (e.g. 11001)  One of the sistring in the above example is 11001000…  There are totally 5 sistrings in this example

44 SISTRING  Sistrings are theoretically of infinite length  110010000…  10010000…  0010000…  010000…  10000…  Practically, we cannot store it infinite. For the above example, we only need to store each sistrings up to 5 bits long. They are descriptive enough distinguish each from one another.

45 SISTRING  Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea!  e.g. CUHK  Corresponding sistrings would be  CUHK000…  UHK000…  HK000…  K000…  We require each should be at least 4 characters long.  (Why we pad 0/NULL at the end of sistring?)

46 SISTRING (USAGE)  SISTRINGs are efficient in storing substring information.  A string with n characters will have n(n+1)/2 sub-strings. Since the longest one is with size n. Storage requirement for sub-strings would be O( n 3 )  e.g. ‘CUHK’ is 4 character long, which consist of 4(5)/2 = 10 different sub-strings: C, U, …, CU, UK, …, CUH, UHK, CUHK.  Storage requirement is O( n 2 )max(length) -> O( n 3 )

47 SISTRING (USAGE)  We may instead storing the sistrings of ‘CUHK’, which requires O( n 2 ) storage.  CUHK <- represent C CU CUH CUHK at the same time  UHK0 <- represent U UH UHK at the same time  HK00 <- represent H HK at the same time  K000 <- represent K only  A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings.  Conclusion, sistrings is better representation for storing sub-string information.

48 PAT Tree  Now it is time for PAT Tree again  PAT Tree is a PATRICIA TREE store every sistrings of a document  What if the document is now contain simply ‘CUHK’?  We like character at this moment, but PATRICIA is working on bits, therefore, we have to know the bit pattern of each sistrings in order to know the actual figure of the PAT tree result  It looks frustrating for even small example, but it is how PAT tree works!

49 PAT Tree (Example)  By digitalizing the string, we can manually visualize how the PAT Tree could be.  Following is the actual bit pattern of the four sistrings  Once we understand how the PAT-tree work, we won’t detail it in later examples.

50 PAT Tree  In a document, we don’t view it as a packed string of characters. A document consist of words. e.g. “Hello. This is a simple document.”  In this case, sistrings can be applied in ‘document level’; the document is treated as a big string, we may tokenize it word-by-word, instead of character-by-character.

51 PAT Tree (Example)  This works! BUT…  We still need O(n 2 ) memory for storing those sistrings  We may reduce the memory to O(n) by making use of points.

52 PAT Tree (Actual Structure)  We need to maintain only the document itself  The PAT Tree acts as an index structure  Memory requirement  Document, O( n )  PAT Tree index, O( n )  Leaves pointers, O( n )  Therefore, PAT Tree is a linear data structure that contains sub-strings, O( n 3 ), information

53 Structure modification  We can see that node structure for internal node and leave node are not the same  tree will be more flexible if their nodes are generic (have a universal node structure)  Trade off: generic node structure will enlarge the individual node size  But..  Memory are cheap now  Even the low end computer can support hundreds MB of RAM  The modified tree is still a O( n ) structure

54 Structure of the modified node 1.Check Bit 2.Frequency Count 3.Link to a sistring 4.Pointers to the child nodes

55 Conclusion  PAT tree is a O( n ) data structure for document indexing  PAT tree is good for solving sub-string matching problem  Chinese PAT tree has sistrings in sentence level. Frequency count is introduced to overcome the duplicate sistrings problem  On generalizing the node structure, the modified version increase the pat tree capability for varies applications

56 56 Huffman Compression  Background:  Huffman works with arbitrary bytes, but the ideas are most easily explained using character data, so we will discuss it in those terms  Consider extended ASCII character set:  8 bits per character  BLOCK code, since all codewords are the same length  8 bits yield 256 characters  In general, block codes give:  For K bits, 2 K characters  For N characters,  log 2 N  bits are required  Easy to encode and decode

57 57 Huffman Compression  What if we could use variable length codewords, could we do better than ASCII?  Idea is that different characters would use different numbers of bits  If all characters have the same frequency of occurrence per character we cannot improve over ASCII  What if characters had different freqs of occurrence?  Ex: In English text, letters like E, A, I, S appear much more frequently than letters like Q, Z, X  Can we somehow take advantage of these differences in our encoding?

58 58 Huffman Compression  First we need to make sure that variable length coding is feasible  Decoding a block code is easy – take the next 8 bits  Decoding a variable length code is not so obvious  In order to decode unambiguously, variable length codes must meet the prefix property  No codeword is a prefix of any other  See example on board showing ambiguity if PP is not met  Ok, so now how do we compress?  Let's use fewer bits for our more common characters, and more bits for our less common characters

59 59 Huffman Compression

60 60 Huffman Compression

61 61 Huffman Compression  Huffman Algorithm:  Assume we have K characters and that each uncompressed character has some weight associated with it (i.e. frequency)  Initialize a forest, F, to have K single node trees in it, one tree per character, also storing the character's weight  while (|F| > 1)  Find the two trees, T1 and T2, with the smallest weights  Create a new tree, T, whose weight is the sum of T1 and T2  Remove T1 and T2 from the F, and add them as left and right children of T  Add T to F

62 62 Huffman Compression  See example on board  Huffman Issues: 1)Is the code correct? –Does it satisfy the prefix property? 2)Does it give good compression? 3)How to decode? 4)How to encode? 5)How to determine weights/frequencies?

63 63 Huffman Compression 1)Is the code correct?  Based on the way the tree is formed, it is clear that the codewords are valid  Prefix Property is assured, since each codeword ends at a leaf  all original nodes corresponding to the characters end up as leaves 2)Does it give good compression?  For a block code of N different characters,  log 2 N  bits are needed per character  Thus a file containing M ASCII characters, 8M bits are needed

64 64 Huffman Compression  Given Huffman codes {C 0,C 1,…C N-1 } for the N characters in the alphabet, each of length |C i |  Given frequencies {F 0,F 1,…F N-1 } in the file  Where sum of all frequencies = M  The total bits required for the file is:  Sum from 0 to N-1 of (|C i | * F i )  Overall total bits depends on differences in frequencies  The more extreme the differences, the better the compression  If frequencies are all the same, no compression  See example from board

65 65 Huffman Compression 3)How to decode?  This is fairly straightforward, given that we have the Huffman tree available start at root of tree and first bit of file while not at end of file if current bit is a 0, go left in tree else go right in tree // bit is a 1 if we are at a leaf output character go to root read next bit of file  Each character is a path from the root to a leaf  If we are not at the root when end of file is reached, there was an error in the file

66 66 Huffman Compression 4)How to encode?  This is trickier, since we are starting with characters and outputing codewords  Using the tree we would have to start at a leaf (first finding the correct leaf) then move up to the root, finally reversing the resulting bit pattern  Instead, let's process the tree once (using a traversal) to build an encoding TABLE.  Demonstrate inorder traversal on board

67 67 Huffman Compression 5)How to determine weights/frequencies?  2-pass algorithm  Process the original file once to count the frequencies, then build the tree/code and process the file again, this time compressing  Ensures that each Huffman tree will be optimal for each file  However, to decode, the tree/freq information must be stored in the file  Likely in the front of the file, so decompress first reads tree info, then uses that to decompress the rest of the file  Adds extra space to file, reducing overall compression quality

68 68 Huffman Compression  Overhead especially reduces quality for smaller files, since the tree/freq info may add a significant percentage to the file size  Thus larger files have a higher potential for compression with Huffman than do smaller ones  However, just because a file is large does NOT mean it will compress well  The most important factor in the compression remains the relative frequencies of the characters  Using a static Huffman tree  Process a lot of "sample" files, and build a single tree that will be used for all files  Saves overhead of tree information, but generally is NOT a very good approach

69 69 Huffman Compression  There are many different file types that have very different frequency characteristics  Ex:.cpp file vs..txt containing an English essay .cpp file will have many ;, {, }, (, ) .txt file will have many a,e,i,o,u,., etc.  A tree that works well for one file may work poorly for another (perhaps even expanding it)  Adaptive single-pass algorithm  Builds tree as it is encoding the file, thereby not requiring tree information to be separately stored  Processes file only one time  We will not look at the details of this algorithm, but the LZW algorithm and the self-organizing search algorithm we will discuss next are also adaptive

70 70 Huffman Shortcomings  What is Huffman missing?  Although OPTIMAL for single character (word) compression, Huffman does not take into ac- count patterns / repeated sequences in a file  Ex: A file with 1000 As followed by 1000 Bs, etc. for every ASCII character will not compress AT ALL with Huffman  Yet it seems like this file should be compressable  We can use run-length encoding in this case (see text)  However run-length encoding is very specific and not generally effective for most files (since they do not typically have long runs of each character)

Download ppt "Data Structures and Algorithms Course slides: Radix Search, Radix sort, Bucket sort, Huffman compression."

Similar presentations

Ads by Google