# Data Structures and Algorithms

## Presentation on theme: "Data Structures and Algorithms"— Presentation transcript:

Data Structures and Algorithms

Radix Searching For many applications, keys can be thought of as numbers Searching methods that take advantage of digital properties of these keys are called radix searches Radix searches treat keys as numbers in base M (the radix) and work with individual digits Lecture 10: Searching

Radix Searching Provide reasonable worst-case performance without complication of balanced trees. Provide way to handle variable length keys. Biased data can lead to degenerate data structures with bad performance. Lecture 10: Searching

Digital Search Trees — like BSTs but branch according to the key’s bits. Key comparison replaced by function that accesses the key’s next bit. Lecture 10: Searching

Digital Search Example
Lecture 10: Searching

Digital Search Trees Consider BST search for key K
For each node T in the tree we have 4 possible results T is empty (or a sentinel node) indicating item not found K matches T.key and item is found K < T.key and we go to left child K > T.key and we go to right child Consider now the same basic technique, but proceeding left or right based on the current bit within the key

Digital Search Trees Call this tree a Digital Search Tree (DST)
DST search for key K For each node T in the tree we have 4 possible results T is empty (or a sentinel node) indicating item not found K matches T.key and item is found Current bit of K is a 0 and we go to left child Current bit of K is a 1 and we go to right child Look at example on board

Digital Search Trees Run-times?
Given N random keys, the height of a DST should average O(log2N) Think of it this way – if the keys are random, at each branch it should be equally likely that a key will have a 0 bit or a 1 bit Thus the tree should be well balanced In the worst case, we are bound by the number of bits in the key (say it is b) So in a sense we can say that this tree has a constant run-time, if the number of bits in the key is a constant This is an improvement over the BST

Digital Search Trees But DSTs have drawbacks
Bitwise operations are not always easy Some languages do not provide for them at all, and for others it is costly Handling duplicates is problematic Where would we put a duplicate object? Follow bits to new position? Will work but Find will always find first one Actually this problem exists with BST as well Could have nodes store a collection of objects rather than a single object

Digital Search Trees Similar problem with keys of different lengths
What if a key is a prefix of another key that is already present? Data is not sorted If we want sorted data, we would need to extract all of the data from the tree and sort it May do b comparisons (of entire key) to find a key If a key is long and comparisons are costly, this can be inefficient

Digital Search Requires O(log N) comparisons on average
Requires b comparisons in the worst case for a tree built with N random b-bit keys Lecture 10: Searching

Digital Search Problem: At each node we make a full key comparison — this may be expensive, e.g. very long keys Solution: store keys only at the leaves, use radix expansion to do intermediate key comparisons Lecture 10: Searching

Radix Tries Used for Retrieval [sic]
Internal nodes used for branching, external nodes used for final key comparison, and to store data Lecture 10: Searching

Radix Trie Example A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 H E
Lecture 10: Searching

Radix Tries Left subtree has all keys which have 0 for the leading bit, right subtree has all keys which have 1 for the leading bit An insert or search requires O(log N) bit comparisons in the average case, and b bit comparisons in the worst case Lecture 10: Searching

Radix Tries Problem: lots of extra nodes for keys that differ only in low order bits (See R and S nodes in example above) This is addressed by Patricia trees, which allow “lookahead” to the next relevant bit Practical Algorithm To Retrieve Information Coded In Alphanumeric (Patricia) In the slides that follow the entire alphabet would be included in the indexes Lecture 10: Searching

Fewer comparisons of entire key than DSTs Drawbacks The tree will have more overall nodes than a DST Each external node with a key needs a unique bit-path to it Internal and External nodes are of different types Insert is somewhat more complicated Some insert situations require new internal as well as external nodes to be created We need to create new internal nodes to ensure that each object has a unique path to it See example

Radix Search Tries Run-time is similar to DST
Since tree is binary, average tree height for N keys is O(log2N) However, paths for nodes with many bits in common will tend to be longer Worst case path length is again b However, now at worst b bit comparisons are required We only need one comparison of the entire key So, again, the benefit to RST is that the entire key must be compared only one time

Improving Tries How can we improve tries?
Can we reduce the heights somehow? Average height now is O(log2N) Can we simplify the data structures needed (so different node types are not required)? Can we simplify the Insert? We will examine a couple of variations that improve over the basic Trie

Bucket-Sort Let be S be a sequence of n (key, element) entries with keys in the range [0, N - 1] Bucket-sort uses the keys as indices into an auxiliary array B of sequences (buckets) Phase 1: Empty sequence S by moving each entry (k, o) into its bucket B[k] Phase 2: For i = 0, …, N - 1, move the entries of bucket B[i] to the end of sequence S Analysis: Phase 1 takes O(n) time Phase 2 takes O(n + N) time Bucket-sort takes O(n + N) time Algorithm bucketSort(S, N) Input sequence S of (key, element) items with keys in the range [0, N - 1] Output sequence S sorted by increasing keys B  array of N empty sequences while S.isEmpty() f  S.first() (k, o)  S.remove(f) B[k].insertLast((k, o)) for i  0 to N - 1 while B[i].isEmpty() f  B[i].first() (k, o)  B[i].remove(f) S.insertLast((k, o)) Bucket-Sort and Radix-Sort

Bucket Sort Each element of the array is put in one of the N “buckets”

Bucket Sort Now, pull the elements from the buckets into the array
At last, the sorted array (sorted in a stable way):

Example Sorting a sequence of 4-bit integers 1001 0010 1101 0001 1110

Example Key range [0, 9] 7, d 1, c 3, a 7, g 3, b 7, e Phase 1 1 2 3 4
1 2 3 4 5 6 7 8 9 B 1, c 7, d 7, g 3, b 3, a 7, e Phase 2 1, c 3, a 3, b 7, d 7, g 7, e Bucket-Sort and Radix-Sort

Properties and Extensions
Key-type Property The keys are used as indices into an array and cannot be arbitrary objects No external comparator Stable Sort Property The relative order of any two items with the same key is preserved after the execution of the algorithm Extensions Integer keys in the range [a, b] Put entry (k, o) into bucket B[k - a] String keys from a set D of possible strings, where D has constant size (e.g., names of the 50 U.S. states) Sort D and compute the rank r(k) of each string k of D in the sorted sequence Put entry (k, o) into bucket B[r(k)] Bucket-Sort and Radix-Sort

Lexicographic Order A d-tuple is a sequence of d keys (k1, k2, …, kd), where key ki is said to be the i-th dimension of the tuple Example: The Cartesian coordinates of a point in space are a 3-tuple The lexicographic order of two d-tuples is recursively defined as follows (x1, x2, …, xd) < (y1, y2, …, yd)  x1 < y1  x1 = y1  (x2, …, xd) < (y2, …, yd) I.e., the tuples are compared by the first dimension, then by the second dimension, etc. Bucket-Sort and Radix-Sort

Lexicographic-Sort Algorithm lexicographicSort(S)
Input sequence S of d-tuples Output sequence S sorted in lexicographic order for i  d downto 1 stableSort(S, Ci) Let Ci be the comparator that compares two tuples by their i-th dimension Let stableSort(S, C) be a stable sorting algorithm that uses comparator C Lexicographic-sort sorts a sequence of d-tuples in lexicographic order by executing d times algorithm stableSort, one per dimension Lexicographic-sort runs in O(dT(n)) time, where T(n) is the running time of stableSort Example: (7,4,6) (5,1,5) (2,4,6) (2, 1, 4) (3, 2, 4) (2, 1, 4) (3, 2, 4) (5,1,5) (7,4,6) (2,4,6) (2, 1, 4) (5,1,5) (3, 2, 4) (7,4,6) (2,4,6) (2, 1, 4) (2,4,6) (3, 2, 4) (5,1,5) (7,4,6) Bucket-Sort and Radix-Sort

Radix-Sort Radix-sort is a specialization of lexicographic-sort that uses bucket-sort as the stable sorting algorithm in each dimension Radix-sort is applicable to tuples where the keys in each dimension i are integers in the range [0, N - 1] Radix-sort runs in time O(d( n + N)) Algorithm radixSort(S, N) Input sequence S of d-tuples such that (0, …, 0)  (x1, …, xd) and (x1, …, xd)  (N - 1, …, N - 1) for each tuple (x1, …, xd) in S Output sequence S sorted in lexicographic order for i  d downto 1 bucketSort(S, N) Bucket-Sort and Radix-Sort

Consider a sequence of n b-bit integers x = xb - 1 … x1x0 We represent each element as a b-tuple of integers in the range [0, 1] and apply radix- sort with N = 2 This application of the radix- sort algorithm runs in O(bn) time For example, we can sort a sequence of 32-bit integers in linear time Algorithm binaryRadixSort(S) Input sequence S of b-bit integers Output sequence S sorted replace each element x of S with the item (0, x) for i  0 to b - 1 replace the key k of each item (k, x) of S with bit xi of x bucketSort(S, 2) Bucket-Sort and Radix-Sort

Does it Work for Real Numbers?
What if keys are not integers? Assumption: input is n reals from [0, 1) Basic idea: Create N linked lists (buckets) to divide interval [0,1) into subintervals of size 1/N Add each input element to appropriate bucket and sort buckets with insertion sort Uniform input distribution  O(1) bucket size Therefore the expected total time is O(n) Distribution of keys in buckets similar with …. ?

Radix Sort What sort will we use to sort on digits?
Bucket sort is a good choice: Sort n numbers on digits that range from 1..N Time: O(n + N) Each pass over n numbers with d digits takes time O(n+k), so total time O(dn+dk) When d is constant and k=O(n), takes O(n) time

Radix Sort Example Problem: sort 1 million 64-bit numbers
Treat as four-digit radix 216 numbers Can sort in just four passes with radix sort! Running time: 4( 1 million )  4 million operations Compare with typical O(n lg n) comparison sort Requires approx lg n = 20 operations per number being sorted Total running time  20 million operations

Asymptotically fast (i.e., O(n)) Simple to code A good choice Can radix sort be used on floating-point numbers?

Assumption: input has d digits ranging from 0 to k Basic idea: Sort elements by digit starting with least significant Use a stable sort (like bucket sort) for each stage Each pass over n numbers with 1 digit takes time O(n+k), so total time O(dn+dk) When d is constant and k=O(n), takes O(n) time Fast, Stable, Simple Doesn’t sort in place

Multiway Tries RST that we have seen considers the key 1 bit at a time
This causes a maximum height in the tree of up to b, and gives an average height of O(log2N) for N keys If we considered m bits at a time, then we could reduce the worst and average heights Maximum height is now b/m since m bits are consumed at each level Let M = 2m Average height for N keys is now O(logMN), since we branch in M directions at each node

Multiway Tries Let's look at an example
Consider 220 (1 meg) keys of length 32 bits Simple RST will have Worst Case height = 32 Ave Case height = O(log2[220])  20 Multiway Trie using 8 bits would have Worst Case height = 32/8 = 4 Ave Case height = O(log256[220])  2.5 This is a considerable improvement Let's look at an example using character data We will consider a single character (8 bits) at each level Go over on board

Multiway Tries So what is the catch (or cost)? Memory
Multiway Tries use considerably more memory than simple tries Each node in the multiway trie contains M pointers/references In example with ASCII characters, M = 256 Many of these are unused, especially During common paths (prefixes), where there is no branching (or "one-way" branching) Ex: through and throughout At the lower levels of the tree, where previous branching has likely separated keys already

Patricia Trees Idea: Save memory and height by eliminating all nodes in which no branching occurs See example on board Note now that since some nodes are missing, level i does not necessarily correspond to bit (or character) i So to do a search we need to store in each node which bit (character) the node corresponds to However, the savings from the removed nodes is still considerable

Patricia Trees Also, keep in mind that a key can match at every character that is checked, but still not be actually in the tree Example for tree on board: If we search for TWEEDLE, we will only compare the T**E**E However, the next node after the E is at index 8. This is past the end of TWEEDLE so it is not found Run-time? Similar to those of RST and Multiway Trie, depending on how many bits are used per node

Patricia Trees So Patricia trees
Reduce tree height by removing "one-way" branching nodes Text also shows how "upwards" links enable us to use only one node type TEXT VERSION makes the nodes homogeneous by storing keys within the nodes and using "upwards" links from the leaves to access the nodes So every node contains a valid key. However, the keys are not checked on the way "down" the tree – only after an upwards link is followed Thus Patricia saves memory but makes the insert rather tricky, since new nodes may have to be inserted between other nodes See text

PATRICIA TREE A particular type of “trie”
Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.

PATRICIA TREE Therefore, PATRICIA TREE will have the following attributes in its internal nodes: Index bit (check bit) Child pointers (each node must contain exactly 2 children) On the other hand, leave nodes must be storing actual content for final comparison

SISTRING Sistring is the short form of ‘Semi-Infinite String’
String, no matter what they actually are, is a form of binary bit pattern. (e.g ) One of the sistring in the above example is … There are totally 5 sistrings in this example

SISTRING Sistrings are theoretically of infinite length
010000… 10000… Practically, we cannot store it infinite. For the above example, we only need to store each sistrings up to 5 bits long. They are descriptive enough distinguish each from one another.

SISTRING Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea! e.g. CUHK Corresponding sistrings would be CUHK000… UHK000… HK000… K000… We require each should be at least 4 characters long. (Why we pad 0/NULL at the end of sistring?)

SISTRING (USAGE) SISTRINGs are efficient in storing substring information. A string with n characters will have n(n+1)/2 sub-strings. Since the longest one is with size n. Storage requirement for sub-strings would be O(n3) e.g. ‘CUHK’ is 4 character long, which consist of 4(5)/2 = 10 different sub-strings: C, U, …, CU, UK, …, CUH, UHK, CUHK. Storage requirement is O(n2)max(length) -> O(n3)

SISTRING (USAGE) We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage. CUHK <- represent C CU CUH CUHK at the same time UHK0 <- represent U UH UHK at the same time HK00 <- represent H HK at the same time K000 <- represent K only A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings. Conclusion, sistrings is better representation for storing sub-string information.

PAT Tree Now it is time for PAT Tree again
PAT Tree is a PATRICIA TREE store every sistrings of a document What if the document is now contain simply ‘CUHK’? We like character at this moment, but PATRICIA is working on bits, therefore, we have to know the bit pattern of each sistrings in order to know the actual figure of the PAT tree result It looks frustrating for even small example, but it is how PAT tree works!

PAT Tree (Example) By digitalizing the string, we can manually visualize how the PAT Tree could be. Following is the actual bit pattern of the four sistrings Once we understand how the PAT-tree work, we won’t detail it in later examples.

PAT Tree In a document, we don’t view it as a packed string of characters. A document consist of words. e.g. “Hello. This is a simple document.” In this case, sistrings can be applied in ‘document level’; the document is treated as a big string, we may tokenize it word-by-word, instead of character-by- character.

PAT Tree (Example) This works! BUT…
We still need O(n2) memory for storing those sistrings We may reduce the memory to O(n) by making use of points.

PAT Tree (Actual Structure)
We need to maintain only the document itself The PAT Tree acts as an index structure Memory requirement Document, O(n) PAT Tree index, O(n) Leaves pointers, O(n) Therefore, PAT Tree is a linear data structure that contains sub-strings, O(n3), information

Structure modification
We can see that node structure for internal node and leave node are not the same tree will be more flexible if their nodes are generic (have a universal node structure) Trade off: generic node structure will enlarge the individual node size But.. Memory are cheap now Even the low end computer can support hundreds MB of RAM The modified tree is still a O(n) structure

Structure of the modified node
Check Bit Frequency Count Link to a sistring Pointers to the child nodes

Conclusion PAT tree is a O(n) data structure for document indexing
PAT tree is good for solving sub-string matching problem Chinese PAT tree has sistrings in sentence level. Frequency count is introduced to overcome the duplicate sistrings problem On generalizing the node structure, the modified version increase the pat tree capability for varies applications

Huffman Compression Background:
Huffman works with arbitrary bytes, but the ideas are most easily explained using character data, so we will discuss it in those terms Consider extended ASCII character set: 8 bits per character BLOCK code, since all codewords are the same length 8 bits yield 256 characters In general, block codes give: For K bits, 2K characters For N characters, log2N bits are required Easy to encode and decode

Huffman Compression What if we could use variable length codewords, could we do better than ASCII? Idea is that different characters would use different numbers of bits If all characters have the same frequency of occurrence per character we cannot improve over ASCII What if characters had different freqs of occurrence? Ex: In English text, letters like E, A, I, S appear much more frequently than letters like Q, Z, X Can we somehow take advantage of these differences in our encoding?

Huffman Compression First we need to make sure that variable length coding is feasible Decoding a block code is easy – take the next 8 bits Decoding a variable length code is not so obvious In order to decode unambiguously, variable length codes must meet the prefix property No codeword is a prefix of any other See example on board showing ambiguity if PP is not met Ok, so now how do we compress? Let's use fewer bits for our more common characters, and more bits for our less common characters

Huffman Compression

Huffman Compression

Huffman Compression Huffman Algorithm:
Assume we have K characters and that each uncompressed character has some weight associated with it (i.e. frequency) Initialize a forest, F, to have K single node trees in it, one tree per character, also storing the character's weight while (|F| > 1) Find the two trees, T1 and T2, with the smallest weights Create a new tree, T, whose weight is the sum of T1 and T2 Remove T1 and T2 from the F, and add them as left and right children of T Add T to F

Huffman Compression Huffman Issues: See example on board
Is the code correct? Does it satisfy the prefix property? Does it give good compression? How to decode? How to encode? How to determine weights/frequencies?

Huffman Compression Is the code correct?
Based on the way the tree is formed, it is clear that the codewords are valid Prefix Property is assured, since each codeword ends at a leaf all original nodes corresponding to the characters end up as leaves Does it give good compression? For a block code of N different characters, log2N bits are needed per character Thus a file containing M ASCII characters, 8M bits are needed

Huffman Compression Given Huffman codes {C0,C1,…CN-1} for the N characters in the alphabet, each of length |Ci| Given frequencies {F0,F1,…FN-1} in the file Where sum of all frequencies = M The total bits required for the file is: Sum from 0 to N-1 of (|Ci| * Fi) Overall total bits depends on differences in frequencies The more extreme the differences, the better the compression If frequencies are all the same, no compression See example from board

Huffman Compression How to decode?
This is fairly straightforward, given that we have the Huffman tree available start at root of tree and first bit of file while not at end of file if current bit is a 0, go left in tree else go right in tree // bit is a 1 if we are at a leaf output character go to root read next bit of file Each character is a path from the root to a leaf If we are not at the root when end of file is reached, there was an error in the file

Huffman Compression How to encode?
This is trickier, since we are starting with characters and outputing codewords Using the tree we would have to start at a leaf (first finding the correct leaf) then move up to the root, finally reversing the resulting bit pattern Instead, let's process the tree once (using a traversal) to build an encoding TABLE. Demonstrate inorder traversal on board

Huffman Compression How to determine weights/frequencies?
2-pass algorithm Process the original file once to count the frequencies, then build the tree/code and process the file again, this time compressing Ensures that each Huffman tree will be optimal for each file However, to decode, the tree/freq information must be stored in the file Likely in the front of the file, so decompress first reads tree info, then uses that to decompress the rest of the file Adds extra space to file, reducing overall compression quality

Huffman Compression Using a static Huffman tree
Overhead especially reduces quality for smaller files, since the tree/freq info may add a significant percentage to the file size Thus larger files have a higher potential for compression with Huffman than do smaller ones However, just because a file is large does NOT mean it will compress well The most important factor in the compression remains the relative frequencies of the characters Using a static Huffman tree Process a lot of "sample" files, and build a single tree that will be used for all files Saves overhead of tree information, but generally is NOT a very good approach