Presentation is loading. Please wait.

Presentation is loading. Please wait.

-1- VLDB 2005, Trondheim, Norway FiST: Scalable XML Document Filtering by Sequencing Twig Patterns Joonho Kwon, Praveen Rao, Bongki Moon, Sukho Lee FiSTFiST.

Similar presentations


Presentation on theme: "-1- VLDB 2005, Trondheim, Norway FiST: Scalable XML Document Filtering by Sequencing Twig Patterns Joonho Kwon, Praveen Rao, Bongki Moon, Sukho Lee FiSTFiST."— Presentation transcript:

1 -1- VLDB 2005, Trondheim, Norway FiST: Scalable XML Document Filtering by Sequencing Twig Patterns Joonho Kwon, Praveen Rao, Bongki Moon, Sukho Lee FiSTFiST School of Electrical Engineering and Computer Science, Seoul National University Department of Computer Science, University of Arizona

2 -2- VLDB 2005, Trondheim, Norway RoadmapRoadmap IntroductionIntroduction  Background and Motivations Index StructureIndex Structure  Profile Sequences  Sequence Index Filtering AlgorithmFiltering Algorithm  Progressive Subsequence Matching  Refinement for Branch Node Verification Experimental ResultsExperimental Results ConclusionsConclusions

3 -3- VLDB 2005, Trondheim, Norway IntroductionIntroduction Publish-subscribe systemsPublish-subscribe systems  Selective dissemination of information (SDI)  User profiles (or standing queries)  New content is matched against the user profiles and is delivered to interested users XML document filteringXML document filtering  User profiles (or twig patterns) are specified in the XPath language  Incoming XML document is delivered to users whose profiles have a match in the document  Reversal in the roles of twig patterns and XML documents Challenges:Challenges:  To minimize the filtering cost by effectively organizing a large number of user profiles  To achieve good scalability

4 -4- VLDB 2005, Trondheim, Norway IntroductionIntroduction XFilter (VLDB’00) and YFilter (TODS’03)XFilter (VLDB’00) and YFilter (TODS’03)  XFilter – each path expression is mapped to a FSM  YFilter – a single NFA for XPath expressions with shared processing MotivationsMotivations  To develop a scalable XML filtering system that supports processing of twig patterns  To support holistic matching of twig patterns without first matching the linear paths in the patterns and then merging these matches during post-processing  To inherently support ordered matching where the nodes in the twig pattern follow the document order in XML

5 -5- VLDB 2005, Trondheim, Norway Tree to Sequence Transformation Extended Prüfer Sequences (PRIX, ICDE’04)Extended Prüfer Sequences (PRIX, ICDE’04)  Extend leaf nodes of the tree with dummy child nodes (A,9) (B,5) (B,2)(D,4) (C,8) (C,7) B LPS(T): Tree T BACA (d,1)(d,3)(d,6) BDC (A,9) (B,5) (B,2)(D,4) (C,8) (C,7)

6 -6- VLDB 2005, Trondheim, Norway Sequence Representation A B D E GF Q 1 : /A[B//D]//E[G]/F LPS(Q 1 ): D B A G E F E A Twig Pattern A B BD C C LPS(T): B B D B A C C A Tree T XML DocumentUser Profile

7 -7- VLDB 2005, Trondheim, Norway FiSTFiST User profiles Profile sequences Sequence Index Filtering Algorithm XML documents

8 -8- VLDB 2005, Trondheim, Norway Index Structure: Profile Sequence Each twig pattern is mapped to a profile sequenceEach twig pattern is mapped to a profile sequence  Profile sequence is an ordered list of nodes  Each node has four attributes Label Qid Pos Sym DBAGEFEA 12345678 11111111 /////$$$//$# Q 1 : /A[B//D]//E[G]/FLPS(Q 1 ) = D B A G E F E A Ancestor-DescendantParent-Child Branch Branch + Ancestor-DescendantBranch + Root node

9 -9- VLDB 2005, Trondheim, Norway Index Structure: Sequence Index B A C D E Q 2,1 Q 1,1 Q 1 : /A[B//D]//E[G]/F Q 2 : //B[E]/C 1 1 // D E 2 1 / Pointers to nodes in the profile sequences Dynamic hash based index

10 -10- VLDB 2005, Trondheim, Norway Our Filtering Algorithm Progressive Subsequence MatchingProgressive Subsequence Matching  Property: If tree Q is a subtree of tree T, then LPS(Q) is a subsequence of LPS(T) Praveen Rao and Bongki Moon. PRIX: Indexing and Querying XML using Prüfer sequences (ICDE’04)Praveen Rao and Bongki Moon. PRIX: Indexing and Querying XML using Prüfer sequences (ICDE’04)  Identify those profile sequences that have a matching subsequence in the document sequence  Necessary but not a sufficient condition Refinement for Branch Node VerificationRefinement for Branch Node Verification  Progressive subsequence matching phase is followed by a refinement phase to discard false matches  The connectivity of the branch nodes in the candidates (twig patterns) is verified

11 -11- VLDB 2005, Trondheim, Norway Progressive Subsequence Matching The sequence representation of the document can be constructed as the document is parsed (e.g., SAX parser)The sequence representation of the document can be constructed as the document is parsed (e.g., SAX parser) The subsequence matching phase is progressiveThe subsequence matching phase is progressive  The sequence representation of the document is generated incrementally and the profile sequences (of twig patterns) that are subsequences are identified in steps Runtime global stackRuntime global stack  The stack stores node labels from the current node of the document being processed to the root  Elements are pushed and popped from the stack in document traversal order  Stack size is upper bound by the depth of the document

12 -12- VLDB 2005, Trondheim, Norway Incremental Generation of LPS A BBE DE C GFF D B LPS(T): DB EBA CBA GE FE FEA Stack A DB E EBACBA GE FE FEA leaf, o/p, pop, o/p leaf, o/p, pop, o/p non-leaf, pop, o/p push push push push non-leaf, pop

13 -13- VLDB 2005, Trondheim, Norway Progressive Subsequence Matching Sequence Index is used to simultaneously find the matching profiles by parsing the document only onceSequence Index is used to simultaneously find the matching profiles by parsing the document only once The Prüfer sequence label of the document is used as the hash key in the Sequence IndexThe Prüfer sequence label of the document is used as the hash key in the Sequence Index Additional tasks are performed based on the Sym attribute value (e.g., ‘/’, ‘//’, ‘$’) in profile sequence nodes to eliminate most false matches by using the runtime stackAdditional tasks are performed based on the Sym attribute value (e.g., ‘/’, ‘//’, ‘$’) in profile sequence nodes to eliminate most false matches by using the runtime stack  The remaining false matches are eventually removed during the refinement phase

14 -14- VLDB 2005, Trondheim, Norway Conceptual View The matching process progresses by copying nodes in the profile sequences into the Sequence Index (denotes transitions in a state machine)The matching process progresses by copying nodes in the profile sequences into the Sequence Index (denotes transitions in a state machine) A C D Q 1,1 Sequence Index B G Q 1,2 Q 1,3 DBAGEFEA 12345678 11111111 /////$$$//$# Profile Sequence of Q 1 Last node - match

15 -15- VLDB 2005, Trondheim, Norway Progressive Subsequence Matching Runtime stack contains a section of document LPS up to the nearest branch nodeRuntime stack contains a section of document LPS up to the nearest branch node A B B CC D E E D C B A XML document T stack LPS(T): …. E D C B A … C B A …

16 -16- VLDB 2005, Trondheim, Norway Progressive Subsequence Matching Benefits of the runtime stackBenefits of the runtime stack  Testing relationships during subsequence matching based on the Sym attribute ‘/’ and ‘//’. Let q and q ’ denote two consecutive nodes in the profile sequence TestPC(q,q ’ ) - tests parent-child relationship (/) between q ’ and q in the document TestPC(q,q ’ ) - tests parent-child relationship (/) between q ’ and q in the document TestAD(q,q ’ ) - tests ancestor-descendant relationship (//) between q ’ and q in the document TestAD(q,q ’ ) - tests ancestor-descendant relationship (//) between q ’ and q in the document  Avoiding frequent node copys to the Sequence Index  Limiting the range of subsequence matching

17 -17- VLDB 2005, Trondheim, Norway Testing Relationships between Nodes A B B CC D E E D C B A ECFB 2222 1234 // / $ / A E F Q 2,1 Twig pattern Q 2 XML document T B C E F B 2 5 $# Sequence Index stack Sym Recursively test till the nearest branch node without a ‘/’ or ‘//’ TestADTestPC

18 -18- VLDB 2005, Trondheim, Norway Avoiding Frequent Node Copys A B B CC D E E D C B A ECFB 2222 1234 // / $ / A E F Q 2,1 Twig pattern Q 2 XML document T B C E F B 2 5 $# Sequence Index A E FQ 2,4 Q 2,1 stack Sym Not copied

19 -19- VLDB 2005, Trondheim, Norway Limiting the Range of Subsequence Matching A B B CC D E E D C B A XML document T stack ECFB 2222 1234 // / $ / Twig pattern Q 2 B C E F B 2 5 $# LPS(T): … E D C B A … C B …LPS(Q 2 ): E C B F B C and E do not share an ancestor descendant relation

20 -20- VLDB 2005, Trondheim, Norway Refinement Phase The connectedness property of branch nodes in the candidates (twig patterns) should be tested to identify true matchesThe connectedness property of branch nodes in the candidates (twig patterns) should be tested to identify true matches To enable the refinement processTo enable the refinement process  Branch node processing – branch nodes in the profile sequences during subsequence matching The refinement phaseThe refinement phase  Root node processing - last node in the profile sequence  Uses the information collected during branch node processing

21 -21- VLDB 2005, Trondheim, Norway Branch and Root Node Processing (A,1) (B,2)(B,5) (E,7) (D,3)(E,4)(C,6) (G,8)(F,9)(F,10)stack B E C EBCB Twig pattern Q 3 XML document T 3333 1234 /$/$# A B //B[E]/C LPS(T): D B E B A C B A G E F E F E ALPS(Q 3 ): E B C B EC Root node processing: The intersection of BranchID sets for each branch node in the candidate twig pattern is tested 52 BranchID sets store node ids

22 -22- VLDB 2005, Trondheim, Norway FiST: Architecture Overview Sequence Index + Profile Sequences Filtering Algorithm SAX Parser Send filtered document XPath Twig Patterns (User Profiles) XML Document Filtering Engine Users XPath Parser Sequence Transformation

23 -23- VLDB 2005, Trondheim, Norway Experimental Results We measured the performance of FiST and YFilter for a variety of XML document sizes and twig patterns.We measured the performance of FiST and YFilter for a variety of XML document sizes and twig patterns. Experimental setupExperimental setup  2.4 GHz Pentium IV with 512 MB RAM running Linux DatasetsDatasets  Synthetic Treebank data using IBM’s XML data generator  1000 documents were generated using Treebank DTD  Recursion of elements, maximum document depth was 36  Dataset sizes [1KB, 10KB) – 1k[1KB, 10KB) – 1k [10KB, 20KB) – 10k[10KB, 20KB) – 10k [20KB, 30KB) – 20k[20KB, 30KB) – 20k [30KB, 123KB) – 30k[30KB, 123KB) – 30k

24 -24- VLDB 2005, Trondheim, Norway Experimental Results User profiles (twig patterns) were generated using the XPath Generator in YFilterUser profiles (twig patterns) were generated using the XPath Generator in YFilter  Uniform – (z = 0)  Skewed – (z = 0.9)  Maximum depth – 10  # of branches – 3 to 7  # of twig patterns – 50000 to 150000 For each twig set and document set, we measured the average filtering cost per documentFor each twig set and document set, we measured the average filtering cost per document  filtering time + document parsing time

25 -25- VLDB 2005, Trondheim, Norway Experimental Results We compared YFilter and FiST by observing the trends in filtering cost for three different aspects of scalabilityWe compared YFilter and FiST by observing the trends in filtering cost for three different aspects of scalability # of twig patterns# of branches size of input documents

26 -26- VLDB 2005, Trondheim, Norway Experimental Results FiST was implemented in C++ and YFilter was developed in JavaFiST was implemented in C++ and YFilter was developed in Java For fairness of comparison, we chose the following evaluation metricFor fairness of comparison, we chose the following evaluation metric  scaleup = Wall clock time (document parsing + filtering)Wall clock time (document parsing + filtering) We observed that FiST scaled better than YFilter under various situationsWe observed that FiST scaled better than YFilter under various situations  FiST’s filtering cost decreased with decrease in the number of matching user profiles  YFilter’s filtering cost increased as the size of the twig patterns increased (observed – base) base

27 -27- VLDB 2005, Trondheim, Norway Varying XML Document Sizes We measured the scaleup for FiST and YFilterWe measured the scaleup for FiST and YFilter FiST’s filtering cost grew slower than YFilterFiST’s filtering cost grew slower than YFilter uniform skewed

28 -28- VLDB 2005, Trondheim, Norway Varying Number of Branches We measured the scaleup for FiST and YFilterWe measured the scaleup for FiST and YFilter Increase in the branch size reduced the number of matched profilesIncrease in the branch size reduced the number of matched profiles FiST (uniform, 20k) YFilter (uniform, 20k)

29 -29- VLDB 2005, Trondheim, Norway Varying Number of Twig Patterns We measured the wall clock time for FiST and YFilterWe measured the wall clock time for FiST and YFilter FiST was significantly faster than YFilter for 20k and 30kFiST was significantly faster than YFilter for 20k and 30k uniformskewed

30 -30- VLDB 2005, Trondheim, Norway ConclusionsConclusions We have developed an XML filtering system called FiST that supports holistic matching of twig patternsWe have developed an XML filtering system called FiST that supports holistic matching of twig patterns  Avoids first matching the linear paths in the twigs and then merging the matches during post-processing  Transform twig patterns into profile sequences Inherent support for ordered matching of twig patternsInherent support for ordered matching of twig patterns Runtime stackRuntime stack  Stack size is upper bound by the depth of the document Holistic matching yielded good scalability for our filtering system under various situationsHolistic matching yielded good scalability for our filtering system under various situations

31 -31- VLDB 2005, Trondheim, Norway Questions?Questions? For more information, www.cs.arizona.edu/~bkmoon www.cs.arizona.edu/~bkmoon


Download ppt "-1- VLDB 2005, Trondheim, Norway FiST: Scalable XML Document Filtering by Sequencing Twig Patterns Joonho Kwon, Praveen Rao, Bongki Moon, Sukho Lee FiSTFiST."

Similar presentations


Ads by Google