Discrete Methods in Mathematical Informatics

Discrete Methods in Mathematical Informatics
Kunihiko Sadakane The University of Tokyo

How to Evaluate A report on a paper explained in the course.
explain a paper in detail, or implement a data structure Send a report by to by July 29th, 2016.

References R. F. Geary, N. Rahman, R. Raman, and V. Raman. A simple optimal representation for balanced parentheses. In Proc. CPM, pages , 2004. R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2): , 2005. R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-aray Trees and Multisets. ACM Transactions on Algorithms (TALG) ,Volume 3 Issue 4, Article No. 43, November 2007. K. Sadakane. New Text Indexing Functionalities of the Compressed Suffix Arrays. Journal of Algorithms, 48(2): , 2003. K. Sadakane. Compressed Suffix Trees with Full Functionality. Theory of Computing Systems, 41(4): , 2007. P. Ferragina and R. Venturini. A Simple Storage Scheme for Strings Achieving Entropy Bounds. Theor. Comput. Sci. 372(1): (2007). Jesper Jansson, Kunihiko Sadakane, Wing-Kin Sung. Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78(2): (2012). K. Sadakane and R. Grossi. Squeezing Succinct Data Structures into Entropy Bounds. In Proc. ACM-SIAM SODA, pages , 2006.

Gonzalo Navarro, Kunihiko Sadakane
Gonzalo Navarro, Kunihiko Sadakane. Fully-Functional Static and Dynamic Succinct Trees, ACM Transactions on Algorithms, 10(3), Article No. 16, 2014. A. Arasu, G. Manku. Approximate Frequency Counts over Data Streams, VLDB 2002. R. M. Karp, S. Shenker, C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems, 2003. A. Metwally, D. Agrawal, A. E. Abbadi. Efficient computation of frequent and top-k elements indata streams. Proc. ICDT, 2005. M. Datar, S. Muthukrishnan. Estimating rarity and similarity on data stream windows. Proc. ESA, pp , 2002. G. Cormode, S. Muthukrishnan, I. Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. VLDB 2005. G. Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. J. Algorithms 55(1): (2005). N. Shrivastava, C. Buragohain, D. Agrawal, S.Suri. New aggregation techniques for sensor networks. Proc. ACM SenSys, 2004.

Another Representation of Succinct Ordered Trees
DFUDS (depth-first unary degree sequence) [Benoit et al. 05] encode node depths by unary codes in DFS order degree d ⇒ d (’s and a ) add ( at the head 2n bits 1 2 6 3 4 5 7 8 DFUDS U ((()((())))(())) 1 2 3 4 5 6 7 8

Proof: If n = 1, the root has no children (degree 0). The DFUDS is ().
Lemma: DFUDS of an n-node ordered tree is a balanced parentheses sequence of length 2n. Proof: If n = 1, the root has no children (degree 0). The DFUDS is (). Assume that the claim holds for any tree with at most n1 nodes. Consider p trees whose DFUDS are U1, U2,..., Up. Total number of nodes is n1, total length of DFUDS is 2n2. Consider a tree such that the root node has these trees as children. The DFUDS U of such a tree is Ui without the head dummy open parenthesis degree of root p dummy open parenthesis

From the assumption of induction, Ui is balanced.
Because the head open parenthesis is removed, one ( is missing. Among the head dummy parenthesis of U and the sequence for the root node ((p), p open parentheses are remaining. Therefore U is balanced. So, the claim holds for a tree with n nodes. Ui without the head dummy open parenthesis degree of root p dummy open parenthesis

Operations on DFUDS A node (degree d) is represented by the head position of the encoding of the node (d). The position in the parentheses sequence and the preorder can be converted by Degree:

i-th child v U1 U2 U3 (((()(())))((()))) v 1 2 6 5 3 4 7 8 9

Parent p 2 5 6 (((()(())))((()))) p 1 2 6 5 3 4 7 8 9

Number of descendants (Subtree size)
Size of the subtree rooted at v is subtreesize(v) = (findclose(enclose(v))v)/2+1 p 2 5 6 (((()(())))((()))) p 1 2 6 5 3 4 7 8 9

LCA on DFUDS 1232345432123210 ((()((())))(())) ((()()())(()()))
lca can be computed by almost the same algorithm as that for BP lca(x,y) = parent(RMQE(x,y1)+1) We use the leftmost one if there are more than one minimum 1 2 6 3 4 5 7 8 E DFUDS U ((()((())))(())) 1 2 3 4 5 6 7 8 BP P ((()()())(()())) E

Let E[i] = rank((U,i)  rank)(U,i), subtrees of v be T1, T2,...,Tk,
DFUDS of v be U[l0..r0], E[r0] = d, DFUDS of Ti be U[li..ri]. Lemma: E[ri] = E[ri-1]1 = di (1  i  k) E[j] > E[ri] (li  j < ri) v r0 r1 r2 r3 (((()(())))((()))) U E v 1 d 2 6 5 3 4 7 8 9

E[j] > E[ri] (li  j < ri)
Proof: The DFUDS U[li..ri] for each subtree will be balanced by adding( at the head. ⇒ E[ri] = E[ri-1]1 = di E[j] > E[ri] (li  j < ri) v r0 r1 r2 r3 (((()(())))((()))) U E v 1 d 2 6 5 3 4 7 8 9

Lemma: lca(x,y) = parent(RMQE(x,y1)+1)
Proof: Let v = lca(x,y). Among subtrees of v, say T1, T2,...,Tk , let T, T be those containing x and y, respectively. Let E[r] = d. Case 1: y < r (y is not the rightmost leaf of T ) Since E[y]  d+1, E[y1]  d+2, RMQE(x,y1) = r1 T T E d+1 d d1 > d+1

Because RMQ returns the leftmost one, r1 is obtained.
Case 2: y = r Because E[y1] = d+1, RMQE(x,y1) takes the minimum value d+1 for r1 and y1. Because RMQ returns the leftmost one, r1 is obtained. In either case, RMQE(x,y1)+1 = l holds and parent(l) = lca(x,y) holds.

Other Operations 1 leaf-rank(v) = rank))(v)
leaf-select(i) = select))(i) preorder-rank(v) = (rank)(v1))+1 preorder-select(i) = (select)(i1))+1 1 2 3 4 5 6 7 8 9 (((()(())))((()))) U E 1 2 6 5 3 4 7 8 9

Other Operations 2 inorder-rank(v) = leaf-rank(child(v,2)1)
inorder-select(i) = parent(leaf-select(i)+1) leftmost-leaf(v) = leaf-select(leaf-rank(v1)+1) rightmost-leaf(v) = findclose(enclose(v)) 1 2 3 4 5 6 7 8 9 (((()(())))((()))) U E 1 2 6 5 3 4 7 8 9

Depths and Level Ancestors
Depths and level ancestors are done in O(1) time, but the auxiliary data structures are complicated. The data structures are based on pioneers. DFUDS U ((()((())))(())) 1 2 3 4 5 6 7 8

Compressing DFUDS sequences
DFUDS can be compressed into a kind of entropy of the tree 2n+o(n) bits in the worst case Def: Tree degree entropy ni: number of nodes with degree i Ex. full binary tree (any internal node has exactly two children) BP, LOUDS, original DFUDS: 2n bits New DFUDS: n bits

Lower Bound Number of ordered trees having ni nodes with degree i is
Since , the tree degree entropy matches the information-theoretic lower bound for trees with given node degrees. [Rote 96]

New Representation ((()((())))(()))
S: degree sequence of nodes (c.f. in DFUDS, it is encoded by unary codes.) Store S in a compressed form, and recover substrings of U when necessary Because  = n, we cannot compress S ⇒partition S into two; one for degree at least log n, and one for others DFUDS U ((()((())))(())) 1 2 3 4 5 6 7 8 New S

Compressing Large Degrees
Store start and end positions of unary codes in DFUDS Number of nodes with degree at least log n is at most n/log n ⇒ encoded in O(n log log n/log n) bits 9 3 10 ((((((((()((())))(((((((((()))

Compressing Small degrees
 = log n ⇒ nH0(S) = nH*(T) It is easy to merge two sequences in a substring of length log n bits, there are at most two nodes with large degrees

Range Minimum Query Problem (RMQ)
Input: an array A[1,n] (can be preprocessed), range [i,j]  [1,n] Output: the position of a minimum value in the sub-array A[i,j] A RMQA(3,6) = 5 RMQA(6,8) = 8

Data Structure for RMQ Theorem: Let s(n) and t(n) denote the size and the query time of an RMQ data structure for an array of length n, respectively. A data structure for RMQ satisfying the following can be computed in O(n) time. Note: the input array is not necessary after preprocess.

Cartesian Tree Cartesian tree for an array A[1,n]
stores the minimum value A[i] of A[1,n] in the root node has the Cartesian tree for A[1,i1] as the left subtree has the Cartesian tree for A[i+1,n] as the right subtree A 1 3 3 4 4 5 5 7

Relation between Cartesian Tree and RMQ
RMQ(i,j) = lca(i,j) A 1 3 3 4 4 5 5 7

Property of Cartesian Tree
Lemma: If we add A[n] to the Cartesian tree for A[1,n1], A[n] is stored on the path from the root to the rightmost leaf. Proof: Because A[n] is the rightmost element in the array, it cannot be a left child. 5 3 4 1 4 3 1 5 5 3 4 1 2 5 3 4 1 6 1 2 4 6 3 4 5

Construction of Cartesian Tree
To add A[n] to the Cartesian tree for A[1,n1] compare it with elements between A[n1] and the root in order if an element x smaller than A[n] appears, insert A[n] there. right child of x becomes the left child of A[n] 4 3 1 5 5 3 4 1

Time Complexity Lemma: Cartesian tree is constructed in O(n) time.
Proof: Let ci be the number of comparisons to insert A[i]. Then the total time complexity is Each node on the rightmost path of the Cartesian tree Cartesian tree will become a left child of A[i] after the comparison with it. Therefore each node will be compared at most once. This means that the total number of comparisons is at most n and the construction time is O(n).

BP Representation of Cartesian Tree
A 1 3 3 4 5 4 5 7 P’ P ((()((())()(())))()((()(()))()(()))) 1 4 3 5 4 5 3 7

Algorithm for RMQ Preprocess Query P’
Convert A[1,n] to the Cartesian tree Convert the Cartesian tree to BP sequence P Query To compute the position m of the minimum in A[i,j] i’ = select()(P,i), j’ = select()(P,j) Let m’ be the position of the minimum in P’[i’, j’] m = rank()(P,m’)+1 P’ P ((()((())()(())))()((()(()))()(()))) 1 4 3 5 4 5 3 7

Complexities Length of P: 4n
RMQ on BP sequence is computed in O(1) time using the range min-max tree and the sparse table.

Optimal-Space Data Structure for RMQ
Theorem: For an array of length n, there exists a data structure of 2n+o(n) bits supporting range minimum query in constant time.

2D-Min-Heap Def: The 2D-Min-Heap MA of an array A is a labeled and ordered tree with vertices v0, …, vn, where vi is labeled with i (0in). For 1in, the parent node of vi is vj iff j  i, A[j] < A[i], and A[k]  A[i] for all j < k  i. The order of the children is chosen such that their labels are increasing from left to right. Assume A[0] =  1 2 3 4 5 6 7 8 9 A

Lemma 1: Let MA be the 2D-Min-Heap of A.
The node labels correspond to the preorder numbers of MA (starting at 0). Let i be a node with children x1, …, xk. Then A[i] < A[xj] for all 1  i  k, and A[xj]  A[xj-1] for all 1 < j  k. - 1 2 3 4 5 6 7 8 9 A 1 3 3 4 4 5 5 7

Lemma 2: Let MA be the 2D-Min-Heap of A
Lemma 2: Let MA be the 2D-Min-Heap of A. For arbitrary nodes i and j, 1  i < j  n, let z denote the LCA of i and j in MA. Then if z = i, RMQ(i, j) is given by i, and otherwise, RMQ(i, j) is given by the child of z that is on the path from z to j. - 1 2 3 4 5 6 7 8 9 A 1 3 3 4 4 5 5 7

Proof: Let Tx denote the subtree of MA rooted at x
Proof: Let Tx denote the subtree of MA rooted at x. Case z = i: This means that j is a descendant of i. Then all nodes i, i+1, …, j are in Tx and A[i] is the minimum in the query range [i, j]. Case z  i: Let x1, …, xk be the children of z. Let  and  (1      k) be defined such that Tx contains i, and Tx contains j. Because z  i, z < i. In other words, the LCA is not in the query range. Every node in [i, j] is in Tx for some     , and in particular x  [i, j] for all  <   . We see {x:  <   } are the only candidate positions of the minimum in A[i,j]. We see x (the child of z on the path to j) is the position of the minimum.

We can compute RMQ by using
preorder-select lca level-ancestor on BP representing MA. This can be simplified by using DFUDS representation of MA. Lemma: Let U be the DFUDS of MA. Then RMQA(i, j) can be answered in O(1) time by x := select)(U, i+1), y := select)(U, j+1) w := RMQE(x, y) if rank)(U, findopen(U, w)) = i, return i else return rank)(U, w)

x := select)(U, i+1), y := select)(U, j) w := RMQE(x, y)
if rank)(U, findopen(U, w)) = i, return i else return rank)(U, w) E U ((()(())())(()())()) A 1 4 3 5 4 5 3 7 - 1 2 3 4 5 6 7 8 9 1 3 3 4 4 5 5 7

Application of RMQ Document Listing Query
Input: search pattern p, document collection d1, d2,..., dk Output: ID’s of all document containing p at least once Document collection can be preprocessed (to construct some index) 1, 4, 5 p query Web search engines Full-text databases p p p d1 d2 d3 d4 d5

Cannot search arbitrary patterns
Inverted File For each word, store a list of ID’s of documents containing the word. Doc. 1　she plays tennis. Doc. 2　he plays tennis. Cannot search arbitrary patterns word she 1 ID 1,2 plays 2 he tennis 1,2

Generalized Suffix Trees
Suffix tree of a string made by concatenating all documents Any pattern is represented by an interval [l,r] of suffix array The interval is obtained in O(|p| log ) time b a 11 9 1 3 7 10 5 2 6 c $3 $1 $2 SA acb$1bcb$2aba$3 d1 d2 d3 T:

Naive Algorithm for Document Listing Query
D: array of document ID’s in lex. order Obtain the interval [l, r] corresponding to p Obtain all elements in D[l, r] and remove duplicates a: 1,3 b: 1,2,3 bc: 2 c: 1,2 11 9 1 3 7 10 5 2 6 SA D a b cb c bc acb$1bcb$2aba$3 d1 d2 d3 T:

Muthukrishnan’s Algorithm
Optimal time algorithm O(n) time preprocess, O(|p|+output) time query (for constant size alphabet)

Preprocess Concatenate the documents d1: a a b $3 $4 $1 $2 d2: b c a

Construct the suffix tree and suffix array
For each node, store the interval [s,b] of suffix array in the subtree 7 2 1 10 13 11 5 14 3 9 6 15 4 8 12 a b c 16 $4 $3･･･ $2･･･ $1･･･ b ･･･ a ･･･ c ･･･ a b c $3 $4 $1 $2 T: [7,9] [1,6] [10,12] [7,8] [1,2] [3,4] [10,11] 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA

Compute the array D D[i] : the document ID containing SA[i] a b c d1:
$1 $2 $3 $4 a b c $3 $4 $1 $2 T: a c b $3 $4 $2 suffix 6 : (SA = 6) D[i] = 2 (SA[i] = 6) 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA 1 3 4 2 D

Compute an array C C[i] = max{ j | j < i , D[j] = D[i] }
(if no such j exists, C[i] = -D[i]) 1 2 3 6 5 4 11 10 9 8 7 16 15 14 13 12 i 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA 1 3 4 2 D C[11] = 7 C[14] = 11 -1 -3 -4 2 -2 1 7 6 4 3 5 12 10 11 9 8 C

Query Algorithm Lemma: If a document k contains a pattern p, there exists a unique i ∈ [s,b] such that D[i] = k and C[i] < s. If we output D[i] only for i  [s,b] such that C[i] < s, we can enumerate all document ID’s without duplication. 1 2 3 6 5 4 11 10 9 8 7 16 15 14 13 12 i 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA 1 3 4 2 D -1 -3 -4 2 -2 1 7 6 4 3 5 12 10 11 9 8 C

Query Algorithm Obtain the interval [s,b] of suffix array corresponding to p Find the position m of the minimum value of C[s,b] If C[m] < s, output D[m] and repeat step 2 for intervals [s,m-1] and [m+1,b] If C[m] ≧s, terminate 1 2 3 6 5 4 11 10 9 8 7 16 15 14 13 12 i 1 10 13 11 7 2 6 9 3 14 5 16 12 8 4 15 SA 1 3 4 2 D -1 -3 -4 2 -2 1 7 6 4 3 5 12 10 11 9 8 C 1 2 3 4

Time Complexity Time to compute the interval [s,b] for p: O(|p|)
Time to output: O(output) Lemma： The number of times of range minimum queries is at most 2 output+1. Proof: In a search tree, at an internal node a new document is found and output. At a leaf, search fails. #internal nodes = output #leaves = #internal nodes+1 3 1

Discrete Methods in Mathematical Informatics

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discrete Methods in Mathematical Informatics

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics"— Presentation transcript:

Similar presentations

About project

Feedback