Discrete Methods in Mathematical Informatics

Slides:



Advertisements
Similar presentations
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Advertisements

WSPD Applications.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Fast Algorithms For Hierarchical Range Histogram Constructions
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
Greedy Algorithms Amihood Amir Bar-Ilan University.
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Constant-Time LCA Retrieval
Succinct Representations of Trees S. Srinivasa Rao Seoul National University.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Binary Trees A binary tree is made up of a finite set of nodes that is either empty or consists of a node called the root together with two binary trees,
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Data Structures – LECTURE 10 Huffman coding
Improvements on the Range-Minimum-Query- Problem
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
Succinct Representations of Trees
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Section 10.1 Introduction to Trees These class notes are based on material from our textbook, Discrete Mathematics and Its Applications, 6 th ed., by Kenneth.
Binary Trees. Binary Tree Finite (possibly empty) collection of elements A nonempty binary tree has a root element The remaining elements (if any) are.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
The LCA Problem Revisited
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
1 Splay trees (Sleator, Tarjan 1983). 2 Goal Support the same operations as previous search trees.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Succinct Data Structures
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Succinct Data Structures
CSCE 210 Data Structures and Algorithms
Succinct Data Structures
HUFFMAN CODES.
Tries 07/28/16 11:04 Text Compression
Succinct Data Structures
Succinct Data Structures
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Succinct Data Structures
Chapter 5 : Trees.
Succinct Data Structures
Reducing the Space Requirement of LZ-index
Ariel Rosenfeld Bar-Ilan Uni.
B+ Tree.
Heaps © 2010 Goodrich, Tamassia Heaps Heaps
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Advanced Algorithms Analysis and Design
Chapter 22 : Binary Trees, AVL Trees, and Priority Queues
Ch. 11 Trees 사실을 많이 아는 것 보다는 이론적 틀이 중요하고, 기억력보다는 생각하는 법이 더 중요하다.
Interval Heaps Complete binary tree.
Orthogonal Range Searching and Kd-Trees
Data Structures: Segment Trees, Fenwick Trees
Chapter 9: Huffman Codes
Multi-Way Search Trees
Discrete Methods in Mathematical Informatics
Range-Efficient Computation of F0 over Massive Data Streams
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Topic 5: Heap data structure heap sort Priority queue
The LCA Problem Revisited
Important Problem Types and Fundamental Data Structures
Switching Lemmas and Proof Complexity
Analysis of Algorithms CS 477/677
Presentation transcript:

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo http://researchmap.jp/sada/resources/

How to Evaluate A report on a paper explained in the course. explain a paper in detail, or implement a data structure Send a report by e-mail to sada@mist.i.u-tokyo.ac.jp by July 29th, 2016.

References R. F. Geary, N. Rahman, R. Raman, and V. Raman. A simple optimal representation for balanced parentheses. In Proc. CPM, pages 159--172, 2004. R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378--407, 2005. R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-aray Trees and Multisets. ACM Transactions on Algorithms (TALG) ,Volume 3 Issue 4, Article No. 43, November 2007. K. Sadakane. New Text Indexing Functionalities of the Compressed Suffix Arrays. Journal of Algorithms, 48(2):294--313, 2003. K. Sadakane. Compressed Suffix Trees with Full Functionality. Theory of Computing Systems, 41(4):589--607, 2007. P. Ferragina and R. Venturini. A Simple Storage Scheme for Strings Achieving Entropy Bounds. Theor. Comput. Sci. 372(1): 115-121 (2007). Jesper Jansson, Kunihiko Sadakane, Wing-Kin Sung. Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78(2): 619-631 (2012). K. Sadakane and R. Grossi. Squeezing Succinct Data Structures into Entropy Bounds. In Proc. ACM-SIAM SODA, pages 1230--1239, 2006.

Gonzalo Navarro, Kunihiko Sadakane Gonzalo Navarro, Kunihiko Sadakane. Fully-Functional Static and Dynamic Succinct Trees, ACM Transactions on Algorithms, 10(3), Article No. 16, 2014. A. Arasu, G. Manku. Approximate Frequency Counts over Data Streams, VLDB 2002. R. M. Karp, S. Shenker, C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems, 2003. A. Metwally, D. Agrawal, A. E. Abbadi. Efficient computation of frequent and top-k elements indata streams. Proc. ICDT, 2005. M. Datar, S. Muthukrishnan. Estimating rarity and similarity on data stream windows. Proc. ESA, pp. 323-334, 2002. G. Cormode, S. Muthukrishnan, I. Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. VLDB 2005. G. Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. J. Algorithms 55(1): 58-75 (2005). N. Shrivastava, C. Buragohain, D. Agrawal, S.Suri. New aggregation techniques for sensor networks. Proc. ACM SenSys, 2004.

Another Representation of Succinct Ordered Trees DFUDS (depth-first unary degree sequence) [Benoit et al. 05] encode node depths by unary codes in DFS order degree d ⇒ d (’s and a ) add ( at the head 2n bits 1 2 6 3 4 5 7 8 DFUDS U ((()((())))(())) 1 2 3 4 5 6 7 8

Proof: If n = 1, the root has no children (degree 0). The DFUDS is (). Lemma: DFUDS of an n-node ordered tree is a balanced parentheses sequence of length 2n. Proof: If n = 1, the root has no children (degree 0). The DFUDS is (). Assume that the claim holds for any tree with at most n1 nodes. Consider p trees whose DFUDS are U1, U2,..., Up. Total number of nodes is n1, total length of DFUDS is 2n2. Consider a tree such that the root node has these trees as children. The DFUDS U of such a tree is Ui without the head dummy open parenthesis degree of root p dummy open parenthesis

From the assumption of induction, Ui is balanced. Because the head open parenthesis is removed, one ( is missing. Among the head dummy parenthesis of U and the sequence for the root node ((p), p open parentheses are remaining. Therefore U is balanced. So, the claim holds for a tree with n nodes. Ui without the head dummy open parenthesis degree of root p dummy open parenthesis

Operations on DFUDS A node (degree d) is represented by the head position of the encoding of the node (d). The position in the parentheses sequence and the preorder can be converted by Degree:

i-th child v U1 U2 U3 (((()(())))((()))) v 1 2 6 5 3 4 7 8 9

Parent p 2 5 6 (((()(())))((()))) p 1 2 6 5 3 4 7 8 9

Number of descendants (Subtree size) Size of the subtree rooted at v is subtreesize(v) = (findclose(enclose(v))v)/2+1 p 2 5 6 (((()(())))((()))) p 1 2 6 5 3 4 7 8 9

LCA on DFUDS 1232345432123210 ((()((())))(())) ((()()())(()())) lca can be computed by almost the same algorithm as that for BP lca(x,y) = parent(RMQE(x,y1)+1) We use the leftmost one if there are more than one minimum 1 2 6 3 4 5 7 8 E 1232345432123210 DFUDS U ((()((())))(())) 1 2 3 4 5 6 7 8 BP P ((()()())(()())) E 1232323212323210

Let E[i] = rank((U,i)  rank)(U,i), subtrees of v be T1, T2,...,Tk, DFUDS of v be U[l0..r0], E[r0] = d, DFUDS of Ti be U[li..ri]. Lemma: E[ri] = E[ri-1]1 = di (1  i  k) E[j] > E[ri] (li  j < ri) v r0 r1 r2 r3 (((()(())))((()))) 123434543212343210 U E v 1 d 2 6 5 3 4 7 8 9

E[j] > E[ri] (li  j < ri) Proof: The DFUDS U[li..ri] for each subtree will be balanced by adding( at the head. ⇒ E[ri] = E[ri-1]1 = di E[j] > E[ri] (li  j < ri) v r0 r1 r2 r3 (((()(())))((()))) 123434543212343210 U E v 1 d 2 6 5 3 4 7 8 9

Lemma: lca(x,y) = parent(RMQE(x,y1)+1) Proof: Let v = lca(x,y). Among subtrees of v, say T1, T2,...,Tk , let T, T be those containing x and y, respectively. Let E[r] = d. Case 1: y < r (y is not the rightmost leaf of T ) Since E[y]  d+1, E[y1]  d+2, RMQE(x,y1) = r1 T T E d+1 d d1 > d+1

Because RMQ returns the leftmost one, r1 is obtained. Case 2: y = r Because E[y1] = d+1, RMQE(x,y1) takes the minimum value d+1 for r1 and y1. Because RMQ returns the leftmost one, r1 is obtained. In either case, RMQE(x,y1)+1 = l holds and parent(l) = lca(x,y) holds.

Other Operations 1 leaf-rank(v) = rank))(v) leaf-select(i) = select))(i) preorder-rank(v) = (rank)(v1))+1 preorder-select(i) = (select)(i1))+1 1 2 3 4 5 6 7 8 9 (((()(())))((()))) 123434543212343210 U E 1 2 6 5 3 4 7 8 9

Other Operations 2 inorder-rank(v) = leaf-rank(child(v,2)1) inorder-select(i) = parent(leaf-select(i)+1) leftmost-leaf(v) = leaf-select(leaf-rank(v1)+1) rightmost-leaf(v) = findclose(enclose(v)) 1 2 3 4 5 6 7 8 9 (((()(())))((()))) 123434543212343210 U E 1 2 6 5 3 4 7 8 9

Depths and Level Ancestors Depths and level ancestors are done in O(1) time, but the auxiliary data structures are complicated. The data structures are based on pioneers. DFUDS U ((()((())))(())) 1 2 3 4 5 6 7 8

Compressing DFUDS sequences DFUDS can be compressed into a kind of entropy of the tree 2n+o(n) bits in the worst case Def: Tree degree entropy ni: number of nodes with degree i Ex. full binary tree (any internal node has exactly two children) BP, LOUDS, original DFUDS: 2n bits New DFUDS: n bits

Lower Bound Number of ordered trees having ni nodes with degree i is Since , the tree degree entropy matches the information-theoretic lower bound for trees with given node degrees. [Rote 96]

New Representation ((()((())))(())) S: degree sequence of nodes (c.f. in DFUDS, it is encoded by unary codes.) Store S in a compressed form, and recover substrings of U when necessary Because  = n, we cannot compress S ⇒partition S into two; one for degree at least log n, and one for others DFUDS U ((()((())))(())) 1 2 3 4 5 6 7 8 New S 2 3 0 0 0 2 0 0

Compressing Large Degrees Store start and end positions of unary codes in DFUDS Number of nodes with degree at least log n is at most n/log n ⇒ encoded in O(n log log n/log n) bits 9 3 10 ((((((((()((())))(((((((((()))

Compressing Small degrees  = log n ⇒ nH0(S) = nH*(T) It is easy to merge two sequences in a substring of length log n bits, there are at most two nodes with large degrees

Range Minimum Query Problem (RMQ) Input: an array A[1,n] (can be preprocessed), range [i,j]  [1,n] Output: the position of a minimum value in the sub-array A[i,j] 143504537 A 123456789 RMQA(3,6) = 5 RMQA(6,8) = 8

Data Structure for RMQ Theorem: Let s(n) and t(n) denote the size and the query time of an RMQ data structure for an array of length n, respectively. A data structure for RMQ satisfying the following can be computed in O(n) time. Note: the input array is not necessary after preprocess.

Cartesian Tree Cartesian tree for an array A[1,n] stores the minimum value A[i] of A[1,n] in the root node has the Cartesian tree for A[1,i1] as the left subtree has the Cartesian tree for A[i+1,n] as the right subtree 143504537 A 1 3 3 4 4 5 5 7

Relation between Cartesian Tree and RMQ RMQ(i,j) = lca(i,j) 143504537 A 1 3 3 4 4 5 5 7

Property of Cartesian Tree Lemma: If we add A[n] to the Cartesian tree for A[1,n1], A[n] is stored on the path from the root to the rightmost leaf. Proof: Because A[n] is the rightmost element in the array, it cannot be a left child. 5 3 4 1 4 3 1 5 5 3 4 1 2 5 3 4 1 6 1 2 4 6 3 4 5

Construction of Cartesian Tree To add A[n] to the Cartesian tree for A[1,n1] compare it with elements between A[n1] and the root in order if an element x smaller than A[n] appears, insert A[n] there. right child of x becomes the left child of A[n] 4 3 1 5 5 3 4 1

Time Complexity Lemma: Cartesian tree is constructed in O(n) time. Proof: Let ci be the number of comparisons to insert A[i]. Then the total time complexity is Each node on the rightmost path of the Cartesian tree Cartesian tree will become a left child of A[i] after the comparison with it. Therefore each node will be compared at most once. This means that the total number of comparisons is at most n and the construction time is O(n).

BP Representation of Cartesian Tree 143504537 A 1 3 3 4 5 4 5 7 P’ 123234543434543212123434543232343210 P ((()((())()(())))()((()(()))()(()))) 1 4 3 5 4 5 3 7

Algorithm for RMQ Preprocess Query P’ Convert A[1,n] to the Cartesian tree Convert the Cartesian tree to BP sequence P Query To compute the position m of the minimum in A[i,j] i’ = select()(P,i), j’ = select()(P,j) Let m’ be the position of the minimum in P’[i’, j’] m = rank()(P,m’)+1 P’ 123234543434543212123434543232343210 P ((()((())()(())))()((()(()))()(()))) 1 4 3 5 4 5 3 7

Complexities Length of P: 4n RMQ on BP sequence is computed in O(1) time using the range min-max tree and the sparse table.

Optimal-Space Data Structure for RMQ Theorem: For an array of length n, there exists a data structure of 2n+o(n) bits supporting range minimum query in constant time.

2D-Min-Heap Def: The 2D-Min-Heap MA of an array A is a labeled and ordered tree with vertices v0, …, vn, where vi is labeled with i (0in). For 1in, the parent node of vi is vj iff j  i, A[j] < A[i], and A[k]  A[i] for all j < k  i. The order of the children is chosen such that their labels are increasing from left to right. Assume A[0] =  1 2 3 4 5 6 7 8 9 A 143504537

Lemma 1: Let MA be the 2D-Min-Heap of A. The node labels correspond to the preorder numbers of MA (starting at 0). Let i be a node with children x1, …, xk. Then A[i] < A[xj] for all 1  i  k, and A[xj]  A[xj-1] for all 1 < j  k. - 1 2 3 4 5 6 7 8 9 A 143504537 1 3 3 4 4 5 5 7

Lemma 2: Let MA be the 2D-Min-Heap of A Lemma 2: Let MA be the 2D-Min-Heap of A. For arbitrary nodes i and j, 1  i < j  n, let z denote the LCA of i and j in MA. Then if z = i, RMQ(i, j) is given by i, and otherwise, RMQ(i, j) is given by the child of z that is on the path from z to j. - 1 2 3 4 5 6 7 8 9 A 143504537 1 3 3 4 4 5 5 7

Proof: Let Tx denote the subtree of MA rooted at x Proof: Let Tx denote the subtree of MA rooted at x. Case z = i: This means that j is a descendant of i. Then all nodes i, i+1, …, j are in Tx and A[i] is the minimum in the query range [i, j]. Case z  i: Let x1, …, xk be the children of z. Let  and  (1      k) be defined such that Tx contains i, and Tx contains j. Because z  i, z < i. In other words, the LCA is not in the query range. Every node in [i, j] is in Tx for some     , and in particular x  [i, j] for all  <   . We see {x:  <   } are the only candidate positions of the minimum in A[i,j]. We see x (the child of z on the path to j) is the position of the minimum.

We can compute RMQ by using preorder-select lca level-ancestor on BP representing MA. This can be simplified by using DFUDS representation of MA. Lemma: Let U be the DFUDS of MA. Then RMQA(i, j) can be answered in O(1) time by x := select)(U, i+1), y := select)(U, j+1) w := RMQE(x, y) if rank)(U, findopen(U, w)) = i, return i else return rank)(U, w)

x := select)(U, i+1), y := select)(U, j) w := RMQE(x, y) if rank)(U, findopen(U, w)) = i, return i else return rank)(U, w) E 12323432321232321210 U ((()(())())(()())()) A 1 4 3 5 4 5 3 7 - 1 2 3 4 5 6 7 8 9 1 3 3 4 4 5 5 7

Application of RMQ Document Listing Query Input: search pattern p, document collection d1, d2,..., dk Output: ID’s of all document containing p at least once Document collection can be preprocessed (to construct some index) 1, 4, 5 p query Web search engines Full-text databases p p p d1 d2 d3 d4 d5

Cannot search arbitrary patterns Inverted File For each word, store a list of ID’s of documents containing the word. Doc. 1 she plays tennis. Doc. 2 he plays tennis. Cannot search arbitrary patterns word she 1 ID 1,2 plays 2 he tennis 1,2

Generalized Suffix Trees Suffix tree of a string made by concatenating all documents Any pattern is represented by an interval [l,r] of suffix array The interval is obtained in O(|p| log ) time b a 11 9 1 3 7 10 5 2 6 c $3 $1 $2 SA acb$1bcb$2aba$3 1 2 3 5 6 7 91011 d1 d2 d3 T:

Naive Algorithm for Document Listing Query D: array of document ID’s in lex. order Obtain the interval [l, r] corresponding to p Obtain all elements in D[l, r] and remove duplicates a: 1,3 b: 1,2,3 bc: 2 c: 1,2 11 9 1 3 7 10 5 2 6 SA D a b cb c bc acb$1bcb$2aba$3 1 2 3 5 6 7 91011 d1 d2 d3 T:

Muthukrishnan’s Algorithm Optimal time algorithm O(n) time preprocess, O(|p|+output) time query (for constant size alphabet)

Preprocess Concatenate the documents d1: a a b $3 $4 $1 $2 d2: b c a

Construct the suffix tree and suffix array For each node, store the interval [s,b] of suffix array in the subtree 7 2 1 10 13 11 5 14 3 9 6 15 4 8 12 a b c 16 $4 $3・・・ $2・・・ $1・・・ b ・・・ a ・・・ c ・・・ a b c $3 $4 $1 $2 T: [7,9] [1,6] [10,12] [7,8] [1,2] [3,4] [10,11] 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA

Compute the array D D[i] : the document ID containing SA[i] a b c d1: $1 $2 $3 $4 a b c $3 $4 $1 $2 T: a c b $3 $4 $2 suffix 6 : (SA = 6) D[i] = 2 (SA[i] = 6) 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA 1 3 4 2 D

Compute an array C C[i] = max{ j | j < i , D[j] = D[i] } (if no such j exists, C[i] = -D[i]) 1 2 3 6 5 4 11 10 9 8 7 16 15 14 13 12 i 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA 1 3 4 2 D C[11] = 7 C[14] = 11 -1 -3 -4 2 -2 1 7 6 4 3 5 12 10 11 9 8 C

Query Algorithm Lemma: If a document k contains a pattern p, there exists a unique i ∈ [s,b] such that D[i] = k and C[i] < s. If we output D[i] only for i  [s,b] such that C[i] < s, we can enumerate all document ID’s without duplication. 1 2 3 6 5 4 11 10 9 8 7 16 15 14 13 12 i 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA 1 3 4 2 D -1 -3 -4 2 -2 1 7 6 4 3 5 12 10 11 9 8 C

Query Algorithm Obtain the interval [s,b] of suffix array corresponding to p Find the position m of the minimum value of C[s,b] If C[m] < s, output D[m] and repeat step 2 for intervals [s,m-1] and [m+1,b] If C[m] ≧s, terminate 1 2 3 6 5 4 11 10 9 8 7 16 15 14 13 12 i 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA 1 3 4 2 D -1 -3 -4 2 -2 1 7 6 4 3 5 12 10 11 9 8 C 1 2 3 4

Time Complexity Time to compute the interval [s,b] for p: O(|p|) Time to output: O(output) Lemma: The number of times of range minimum queries is at most 2 output+1. Proof: In a search tree, at an internal node a new document is found and output. At a leaf, search fails. #internal nodes = output #leaves = #internal nodes+1 3 1