S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

Slides:

Advertisements

Similar presentations

XML: Extensible Markup Language

Advertisements

February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.

B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree

Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.

Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.

1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.

QUANZHONG LI BONGKI MOON Indexing & Querying XML Data for../Regular Path Expressions/* SUNDAR SUPRIYA.

Binary Trees, Binary Search Trees COMP171 Fall 2006.

ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,

Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – Aslı UYAR

CS 171: Introduction to Computer Science II

Modern Information Retrieval

BTrees & Bitmap Indexes

On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.

Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.

Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)

1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.

E.G.M. PetrakisB-trees1 Multiway Search Tree (MST)  Generalization of BSTs  Suitable for disk  MST of order n:  Each node has n or fewer sub-trees.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.

Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.

Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.

Database Management 9. course. Execution of queries.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

Oct 29, 2001CSE 373, Autumn External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.

Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.

1 Trees A tree is a data structure used to represent different kinds of data and help solve a number of algorithmic problems Game trees (i.e., chess ),

CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.

Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.

Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.

Starting at Binary Trees

1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.

QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.

Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.

CS4432: Database Systems II Query Processing- Part 2.

Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Rooted Tree a b d ef i j g h c k root parent node (self) child descendent leaf (no children) e, i, k, g, h are leaves internal node (not a leaf) sibling.

Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:

XPath --XML Path Language Motivation of XPath Data Model and Data Types Node Types Location Steps Functions XPath 2.0 Additional Functionality and its.

Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.

Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.

BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.

CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi

Tries 07/28/16 11:04 Text Compression

Lecture 2- Query Processing (continued)

Early Profile Pruning on XML-aware Publish-Subscribe Systems

Multiway Search Tree (MST)

Lecture 11: B+ Trees and Query Execution

Presentation transcript:

S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601

I NTRODUCTION Graph indexes precise Path, (twig only few methods) Sequence indexing schemes Top-down or bottom-up XML document and XML queries in structure-encoded sequences Path and twig

T OP -D OWN S EQUENCE I NDEXES : V I ST

V I ST – V IRTUAL S UFFIX T REE Top-down Sequence Indexes Represent XML documents and XML queries in structure-encoded sequences Querying XML data is equivalent to finding subsequence matching Avoid to expensive join operations Provides unified index on both content and structure Support dynamic index update B + Trees which are supported in DBMSs

DTD OF PURCHASE RECORDS

A S INGLE P URCHASE R ECORD

P REORDER S EQUENCE OF XML Use capital letters to represent names of elements/attributes Use hash function h(), to encode attribute values into integers v 1 = h(“dell”) v 2 =h(“ibm”) Preorder sequence of XML purchase record example PSNv 1 IMv 2 Nv 3 IMv 4 INv 5 Lv 6 BLv 7 Nv 8 Isomorphic trees may produce different preorder seq. DTD schema embodies linear order of all elements/attributes Without DTD – use lexicographical order

S TRUCTURE -E NCODED S EQUENCE Definition: A Structure-Encoded Sequence, derived from a prefix traversal of semi-structured XML document, is a sequence of (symbol, prefix) pairs: D = (a 1,p 1 ), (a 2,p 2 ),…, (a n,p n ) Where a i represents a node in the XML document tree, (of which a 1, …,a n is the preorder sequence), and p i is the path from the root node to node a i.

S TRUCTURE -E NCODED S EQUENCE D= (P,ϵ),(S,P),(N,PS),(v 1,PSN),(I,PS),(M,PSI),(v 2,PSIM),(N,PSI), (v 3,PSIN),(I,PSI),(M,PSII),(v 4,PSIIM),(I,PS),(N,PSI),(v 5,PSIN), (L,PS),(v 6,PSL),(B,P),(L,PB),(v 7,PBL),(N,PB),(v 8,PBN)

XML Q UERIES IN G RAPH F ORM

XML Q UERIES IN P ATH E XPRESSION AND S EQUENCE F ORM Query: Path Expression Structure-Encoded Sequence Q1 : /Purchase/Seller/Item/Manufacturer (P, ϵ)(S,P)(I,PS)(M,PSI) Q2 : /Purchase[Seller[Loc = v 5 ]]/Buyer[Loc = v 7 ] (P, ϵ)(S,P)(L,PS)(v 5,PSL)(B,P)(L,PB)(v 7,PBL) Q3 : /Purchase/*[Loc = v 5 ] (P, ϵ)(L, P)(v 5,P*L) Q4 : /Purchase//Item[Manufacturer = v 3 ] (P, ϵ)(I,P//)(M, P//I)(v 3,P//IM)

Q UERYING XML THROUGH S TRUCTURE -E NCODED S EQUENCE M ATCHING Querying XML is equivalent to finding (non-contiguous) subsequence matches Most structural XML queries can be performed through direct subsequence matching Exception: branch has multiple identical child nodes Q 5 =/A[B/C]/B/D Two different sequences (A, ϵ)(B,A)(C,AB)(B,A)(D,AB) (A, ϵ)(B,A)(D,AB)(B,A)(C,AB) Find matches separately and union their result We may find false matches if the indexed documents contain branches with identical child nodes, then we ask multiple queries and compute set difference on result If the query contains a large number of same child nodes under the branch, we can choose disassemble the tree into multiple trees and use join operations to combine their results

A LGORITHMS Naïve algorithm RIST – Relationships Indexed Suffix Tree ViST – Virtual Suffix Tree

N AÏVE ALGORITHM : S UFFIX -T REE -L IKE STRUCTURE Doc1 : (P, ϵ)( S, P)(N, PS)(v 1, PSN)(L, PS)(v 2, PSL) Doc2 : (P, ϵ)(B, P)(L, PB)(v 2, PBL) Q1 : (P, ϵ)(B, P)(L,PB)(v 2, PBL) Q2 : (P, ϵ)(L, P*)(v 2,P*L)

D-A NCESTORSHIP AND S-A NCESTORSHIP D-Ancestorship Ancestor-descendant relationships in original XML tree Element (S,P) is a D-Ancestorship of (L,PS) S-Ancestorship Ancestor-descendant relationships in suffix tree Element (v 1, PSN) is an S-Ancestorship of (L, PS)

N AÏVE SEARCH :A NAÏVE ALGORITHM BASED ON SUFFIX TREES

RIST – I NDEXING C ONSTRUCTION S-Ancestorship requires additional information Label each suffix tree node x by pair n x prefix traversal order of x in suffix tree size x is total number of descendants of x in suffix tree x …, y … x is S-Ancestor of node y if n y ϵ (n x, n x + size x ] Construct the B + Trees: Tree nodes into the D-Ancestorship B + Tree using (Symbol, Prefix) as keys For all nodes x inserted with the same (Symbol, Prefix) we index them by S-Ancestorship B + Tree, using the n x values of their labels as keys.

T HE RIST INDEX STRUCTURE

S EARCH : NON - CONTIGUOUS SUBSEQUENCE MATCHING USING B + T REE

V I ST – V IRTUAL S UFFIX T REE Dynamic Virtual suffix tree labeling Semantic and statistical clues Dynamic scope allocation without clues

D YNAMIC SCOPE ALLOCATION Number of child nodes of x is λ. We allocate 1/ λ of the remaining scope to x’s first child Dynamic scope allocation with λ=2

D YNAMIC S COPE OF A S UFFIX T REE N ODE

SUB S COPE ( PARENT, E ): CREATE A SUB SCOPE WITHIN THE PARENT SCOPE FOR E

I NSERTION INDEX Doc 1 = (P, ϵ)(S,P)(N,PS)(v 1,PSN)(L,PS)(v 2,PSL) Doc 2 = (P, ϵ)(S,P)(L,PS)(v 2,PSL)

I NDEX AN XML DOCUMENT

EXPERIMENTS - S AMPLE QUERIES Path Expression Dataset Q1 /inproceedings/title DBLP Q2 /book/author[text=‘David’] DBLP Q3 /*/author[text= ‘David’] DBLP Q4 //author[text= ‘David’] DBLP Q5 /book[key=‘books/bc/MaierW88’]/author DBLP Q6 /site//item[location=‘US’]/mail/date[text=‘12/15/1999’] XMARK Q7 /site//person/*/city[text=‘Pocatello’] XMARK Q8 //closed_auction[*[person=‘person1’]]/date[text=‘12/15/1999’] XMARK

C OMPARING INDEXING METHODS time in seconds

I NDEX STRUCTURE DBLP (301 MB of data) XMARK (52MB of data)

C ONCLUSION structure-encoded sequences Sequence matching Avoid expensive join operations Top-down scope allocation method Index structure – B + Tree

PRIX: P RUFER S EQUENCES FOR I NDEXING XML

Rao & Moon (2006) proposed a new method for indexing XML documents using sequences It uses the same idea as in ViST index: The XML tree is transformed into a sequence and saved in the database Each query is also transformed into a sequence The answer of the query is acquired by performing subsequence matching

PRIX: PR UFER SEQUENCES FOR I NDEXING X ML Motivation PRIX architecture Indexing XML documents Querying

PRIX: PR UFER SEQUENCES FOR I NDEXING X ML Motivation PRIX architecture Indexing XML documents Querying

M OTIVATION : T WIG Q UERIES AND W ILDCARDS Like in ViST, PRIX also tries to efficiently answer twig queries as well as queries containing wildcards (‘*’ any and ‘//’ self or descendant queries) P Q TS Twig query XPath: P/Q[T]/S Query with wildcards XPath: P//Q/S P Q S

M OTIVATION : P ROBLEMS IN V I ST I NDEX Memory requirements: In the worst case, ViST requires O(N 2 ) space to index the document A B C D D = (A, ε), (B, A), (C, AB), (D, ABC), (E, ABCD) E Elements in height k appear k times

M OTIVATION : P ROBLEMS IN V I ST I NDEX Memory requirements: In the worst case, ViST requires O(N 2 ) space to index the document False positives In many cases, query processing in Vist results in false alarms P Q T R TUS Doc1 = (P, e) (Q, P) (T, PQ) (S, PQ) (R, P) (U, PR) (T, PR) P Q T Q S Doc2 = (P, e) (Q, P) (T, PQ) (Q, P) (S, PQ) P Q TS XPath: P/Q[T]/S Q = (P, e) (Q, P) (T, PQ) (S, PQ)

M OTIVATION : P ROBLEMS IN V I ST I NDEX Memory requirements: In the worst case, ViST requires O(N 2 ) space to index the document False positives In many cases, query processing in Vist results in false alarms False negatives Correctly answering a twig query depends on the order the branches are created P F T N G Doc = (P, e) (F, P) (T, PF) (N, P) (G, PN) P NF Xpath: P[N]/F Q = (P, e) (N, P) (F, P) ???

M OTIVATION : P ROBLEMS IN V I ST I NDEX Memory requirements: In the worst case, ViST requires O(N 2 ) space to index the document False positives In many cases, query processing in Vist results in false alarms False negatives Correctly answering a twig query depends on the order the branches are created

PRIX: PR UFER SEQUENCES FOR I NDEXING X ML Motivation PRIX architecture Indexing XML documents Querying

PRIX A RCHITECTURE

I NDEXING AND Q UERYING IN PRIX Indexing: The first step is to take as input an XML document and convert it into a sequence This is achieved using Prufer Sequences The sequence is saved in the database in a way equivalent to the one used in ViST It is a Virtual Trie implemented as B+ Trees XML document

I NDEXING AND Q UERYING IN PRIX Querying Queries are also transformed to trees and then to Prufer Sequences The query sequence looked up in the document sequence and all matching subsequences are retrieved After this initial filtering, three refinement phases follow XPath Query

PRIX: PR UFER SEQUENCES FOR I NDEXING X ML Motivation PRIX architecture Indexing XML documents Querying

I NDEXING XML D OCUMENTS The first step is to transform the XML document to the equivalent XML tree Notice that both elements and text values are represented as nodes (the same stands for attributes) The tree is not saved in the database D A BB CC D F E

I NDEXING XML D OCUMENTS Then the Prufer Sequence is created from the XML tree A Prufer Sequence is a method proposed by Prufer (1918) that constructs a one-to-one correspondence between a labeled tree and a sequence 8,A 1,B7,B 6,C3,C 2,D 5,E 4,F 8, 3, 7, 6, 6, 7, 8

I NDEXING XML D OCUMENTS Prufer Sequences can only be created from trees with numerical labeling, with each node having a unique number Since the XML tree contains string labels (the names of elements etc.) we add an additional label to each node We will use the post-order traversal to name the nodes The prufer sequence can be extracted for any labeling of the tree, but using post-order numbering has some properties that makes the querying process easier

I NDEXING XML D OCUMENTS Initial labeling A BB CC D F E 8,A 1,B7,B 6,C3,C 2,D 5,E 4,F

I NDEXING XML D OCUMENTS Finding the Prufer Sequence The algorithm to find the Prufer sequence is the following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left In PRIX index, two sequences are held: The actual Prufer Sequence holding the numbers of the labels called Numbered Prufer Sequence: NPS The corresponding sequence holding the actual labels of the nodes of the XML Tree called Labeled Prufer Sequence: LPS

I NDEXING XML D OCUMENTS Finding the Prufer Sequence The algorithm to find the Prufer sequence is the following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left 8,A 1,B7,B 6,C3,C 2,D 5,E 4,F NPS : 8, LPS : A,

I NDEXING XML D OCUMENTS Finding the Prufer Sequence The algorithm to find the Prufer sequence is the following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left 8,A 7,B 6,C3,C 2,D 5,E 4,F NPS : 8, 3 LPS : A, C 1,B

I NDEXING XML D OCUMENTS Finding the Prufer Sequence The algorithm to find the Prufer sequence is the following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left 8,A 7,B 6,C3,C 2,D 5,E 4,F NPS : 8, 3, 7, 6, 6, 7, 8 LPS : A, C, B, C, C, B, A 1,B

I NDEXING XML D OCUMENTS Properties Both NPS and LPS have length N-1 (where N is the total number of nodes Due to the fact that we delete one node at a time until only one node is left NPS : 8, 3, 7, 6, 6, 7, 8 LPS : A, C, B, C, C, B, A 8,A 1,B7,B 6,C3,C 2,D 5,E 4,F

I NDEXING XML D OCUMENTS Properties Both NPS and LPS have length N-1 (where N is the total number of nodes The i-th element deleted is always the node with label i This helps us find the edges of the tree! (that is the mapping from the NPS to the tree) NPS : 8, 3, 7, 6, 6, 7, 8 LPS : A, C, B, C, C, B, A 8,A 1,B7,B 6,C3,C 2,D 5,E 4,F Deleted node: 1, 2, 3, 4, 5, 6, 7

I NDEXING XML D OCUMENTS Properties Both NPS and LPS have length N-1 (where N is the total number of nodes The i-th element deleted is always the node with label i LPS does not contain any leaves NPS : 8, 3, 7, 6, 6, 7, 8 LPS : A, C, B, C, C, B, A 8,A 1,B7,B 6,C3,C 2,D 5,E 4,F Deleted node: 1, 2, 3, 4, 5, 6, 7

I NDEXING XML D OCUMENTS Indexes held in the database are The LPS (label prufer sequence) The NPS (numbered prufer sequence) The mapping between the number and the xml label of the leaves of the tree NPS : 8, 3, 7, 6, 6, 7, 8 LPS : A, C, B, C, C, B, A Leaves mapping: 1  B, 2  D, 4  F, 5  E 8,A 1,B7,B 6,C3,C 2,D 5,E 4,F

PRIX: PR UFER SEQUENCES FOR I NDEXING X ML Motivation PRIX architecture Indexing XML documents Querying

Q UERYING When a query arrives it is also transformed to a prufer sequence Then, an initial filtering is performed The results of the initial filtering are sorted out in order to acquire the correct answer to the query after three more refinement phases. XPath Query

Q UERYING : T RANSFORMING A QUERY TO A PRUFER SEQUENCE The same process as in documents is followed For instance if we have the XPath query A[B/C]/D/E/F The query tree is: The NPS and LPS are: NPS(Q) = 2, 6, 4, 5, 6 LPS(Q) = B, A, E, D, A

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING Suppose we have the following XML tree (T) of the document: NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T) “A subsequence is any string that can be obtained by deleting zero or more symbols from a given string”

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A 12 subsequences are found in total, while only 4 are correct T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A 12 subsequences are found in total, while only 4 are correct T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A 12 subsequences are found in total, while only 4 are correct T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A 12 subsequences are found in total, while only 4 are correct T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A 12 subsequences are found in total, while only 4 are correct T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the path in the tree that is represented by the sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the NPS(T) NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the path in the tree that is represented by the sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the NPS(T) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the path in the tree that is represented by the sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the NPS(T) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A T Q

Q UERYING : F ILTERING B Y S EQUENCE M ATCHING To find the path in the tree that is represented by the sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the NPS(T) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A T Q

Q UERYING : R EFINEMENT S TEPS Despite the filtering, some false positives are in the results. To find these false positives we have 3 refinement steps, namely: Refinement by connectedness Refinement by structure Refinement by matching leaf nodes

Q UERYING : F ALSE N EGATIVES A false negative can appear in the same case as in ViST index The subsequence filtering relies on the assumption that the query branches come in the “correct” order P F T N G Document P NF Query

Q UERYING : F ALSE N EGATIVES The solution proposed by Rao and Moon is to test the query in all possible permutations of the branches and then return the union as the answer of the query N branches  N! permutations Their main argument is that queries usually have a small number of branches

Q UERYING : F ALSE N EGATIVES The solution proposed by Rao and Moon is to test the query in all possible permutations of the branches and then return the union as the answer of the query N branches  N! permutations Their main argument is that queries usually have a small number of branches P NF D S P NFD S P NF D S … (three more permutations)

E XPERIMENTS

V I ST VS PRIX: E XPERIMENTS 1.8GHz Pentium IV processor 512 MB RAM running Solaris 8 40GB EIDE disk drive (store data and indexes) Compiled by GNU g++ compiler version Buffer pool size: 2000 pages of size 8K

V I ST VS PRIX: E XPERIMENTS

DBLP dataset

V I ST VS PRIX: E XPERIMENTS SWISSPROT dataset

V I ST VS PRIX: E XPERIMENTS TREEBANK dataset

V I ST VS PRIX O(N 2 )

?????? Q UESTIONS ?

T HANK YOU !!