Presentation on theme: "5th July 2004CPM 20041 A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK)"— Presentation transcript:
5th July 2004CPM A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK) and Venkatesh Raman (Institute for Mathematical Sciences, Chennai, India)
5th July 2004CPM A Parentheses Data Structure Given: Balanced string of 2n parentheses. ( ( ( ( ) ) ) ( ) ( ) ) Support operations: –ENCLOSE ( i ) –FINDCLOSE ( i ), FINDOPEN( i ) –EXCESS ( i ) Applications to suffix tree, ordinal trees and stack- sortable permutations.
5th July 2004CPM Parentheses Representation 2n bits, O(n) time. Θ(n lg n ) bits, O(1) time. O(n) bits, O(1) time. [Jacobson, `89] 2n+o(n) bits, O(1) time. [Munro, Raman, `01] 2n+o(n) bits, O(1) time. New data structure. Our new DS – is simpler (no perfect hash tables), – smaller o(n) term, – uniform o(n) time and space construction algorithm. Implemented and shown to be quite practical – far more compact than D/S using naïve representation, – speed comparable to D/S using naïve representation.
5th July 2004CPM XML XML: eXtensible Markup Language –de facto standard for electronic data interchange. Document Object Model (DOM) standard API for manipulating XML documents –holds all data in memory, –large memory usage.
5th July 2004CPM Example XML document Bill Bloggs 1 April 1961 DOM NODE interface has methods PARENT(x), NEXTSIB(x), PREVSIB(x), LASTCHILD(x),FIRSTCHILD(x) person name firstname surname day month year dob
5th July 2004CPM Succinct DOM Succinct DOM: –uses far less space than standard DOM, –performance competitive with DOM. Node interface implemented by natural parentheses ops. Operations supported by parentheses data structures –Jacobson `89, –Munro and Raman `01, –Our new data structure.
5th July 2004CPM Our new D/S Input: balanced string of 2n parentheses. Assume recursive data structure to store balanced string of 2N 2n parentheses. If N is O(n / lg 2 n) store answers explicitly for every pair of parentheses. Otherwise Divide into blocks of size Number of blocks
5th July 2004CPM FINDCLOSE(x) ( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) ) FINDCLOSE(3)? Matching parenthesis inside block – near parenthesis. Pre-computed table stores position of matching parentheses for all near parentheses. –O(1) time if near parenthesis. –Table size is
5th July 2004CPM Pioneer Parentheses FINDCLOSE(5)? Matching parenthesis outside block – far parenthesis. b(p) = block# of parenthesis at position p = position of match of p q is 1 st far parenthesis before p p is pioneer if At most 2β-3 open pioneers. Similarly at most 2β-3 close pioneers. ( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )
5th July 2004CPM Pioneer Family Pioneer family: set of all opening and closing pioneers along with their matching parentheses. Balanced string of size at most 4β-6. ( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) ) ( ( ) )
5th July 2004CPM Our D/S ( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) ) ( ( ) ) NND 2N O(N / lg N) Two levels of recursion. When pioneer family is O(N/lg2N) we store explicit answers.
5th July 2004CPM Space usage NND uses O(N lg lg N / lg N) bits. Tables use O( N lg lg N / lg N) bits. S(n) = 2n+ O(n lg lg n / lg n) = 2n +o(n) bits.
5th July 2004CPM Pseudo-pioneers Near blocks: blocks which have no pioneers. Insert pseudo-pioneers at start and end of every near block. –Pseudo-pioneers do not effect FINDOPEN(x), FINDCLOSE(x), ENCLOSE(x) Gap between pioneers now at most 2B = O(lg N).
5th July 2004CPM NND 2n-bit vector used to find the pioneer for a far parenthesis. If pioneer at pos i in parentheses string then 1 at i in NND. Operations we need: –Find address of most recent 1 at position i r = Rank(i) p = Select(r) –Find i th 1in bit vector p = Select(i) We want succinct representation. D/S should be simple and fast. ( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )
5th July 2004CPM NND Bit vector of length M with N 1s. Gap between 1s at most (lg M) c. t = lg M / 2 c lg lg M.
5th July 2004CPM Select(i) Find i th 1 in bit vector. Array A1 stores position of every t th 1 –Space is Array A2 stores gaps between consecutive 1s –Space is O( N lg lg M ) or O( M lg lg M / lg M ) bits. Table T1 allows us to lookup sum of upto t gap. –Space is SELECT(i) i’ = i’’ = (i+1)/mod t y = concat of A2[i’+1],..,A2[i’+i’’] return A1[x] + T1[y]
5th July 2004CPM Rank(i) Prefix sum at position i. Need two more arrays and tables of size at most O(M lg lg M / lg M) bits.
5th July 2004CPM Implementation Details C++ on Sun UltraSparc-III and Pentium 4. Implemented new and optimised Jacobson D/S. CenterPoint XML for DOM. Sample of 12 XML documents of varying sizes and node counts. Blocksizes 32, 64, 128 and 256. Test was depth first tree walk, counting nodes of a given XML type.
5th July 2004CPM Space usage and performance Space usage for tree structure –Std DOM: 96 bits per node. –Jacobson: 3.3 – 16 bits per node. –New D/S: 2.9 – 12.8 bits per node. Avg performance for succinct D/S relative to std DOM –UltraSparc: 1 to 2.5 times slower. –Pentium 4: 1.7 to 4 times slower.
5th July 2004CPM Conclusions and Future work Conceptually simple succinct representation for balanced parentheses with O(1) time ops. o(n) time and space construction algorithm. Improved lower bound term for space bound. Relative performance very good on UltraSparc but poorer on Pentium 4, which has small cache –Cache optimisation is an interesting problem. Complete set of D/S for succinct DOM.