A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.

Presentation on theme: "A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1."— Presentation transcript:

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1

The Setting Arises in –computational biology, –image analysis, –automatic theorem proving, –compiler optimization –XML databases Huge Labeled Tree Data 2

Subtree Similarity-Search Goal: Given a (small) tree Q and a number k, find the k subtrees S of T most similar to Q n nodes ⇨ n subtrees 3 Query tree Q Database tree T Top-k subtrees of T, most similar to Q

Subtree Similarity-Search Goal: Given a (small) tree Q and a number k, find the k subtrees S of T most similar to Q Similarity: defined using a function that takes two trees and returns a real value 4 n nodes ⇨ n subtrees Query tree Q Database tree T Top-k subtrees of T, most similar to Q

The Bottom Line An algorithm for subtree similarity-search Compatible with a wide family of tree distance functions Runtime is linear –(Depending on the distance function used; see paper for exact analysis) Experimental results show near-invariance to query size and number of results fetched 5

Defining Distance We introduce profile distance functions for determining similarity among two given trees Several previously proposed distance measures can be shown to be profile distance functions: –pq-gram distance (Augsten et. al.) –Windowed pq-gram distance (Augsten et. al.) –Binary branch distance (Yang et. al.) –Other multiset-based distance measures 6

Profile Distance Functions Main idea: 1.Associate each tree T with a multiset of small objects that represent the tree structure and contents 2.Use a multiset comparison method to determine similarity between two trees 7

Profile Distance Functions 8 Summarize the interesting features of each tree using a multiset Compare the multisets Distance value between the two trees

Profile Distance: A Simple Example 9 “cluck”, “meow”, “purr”, “purr” “cluck”, “meow”, “meow”, “ribbit”, “woof” Compare the multisets. For example: Dice coefficient “meow” “woof” “ribbit” “cluck”“meow” “purr” “cluck”“meow” “purr” Summarize the interesting features of each tree using a multiset. For example: take bags of the tree labels

Profile Distance: pq-grams (Augsten et al.) 10 Compare the multisets. For example: Normalized Dice for multisets (Augsten et al.) “meow” “woof” “ribbit” “cluck”“meow” “purr” “cluck”“meow” “purr” Summarize the interesting features of each tree using a multiset. For example: pq-grams “meow” * “ribbit” “cluck”“meow” “ribbit” * “cluck” ** … and many more … etc. This profile function pays respect to the tree’s structure as well as its content!

Profile Distance Functions Main idea: 1.Associate each tree T with a multiset of small objects that represent the tree structure and contents 2.Use a multiset comparison method to determine similarity between two trees 11 Actually, multiset for tree is determined by multisets associated with nodes Comparison functions will be based on intersection, union and sizes of multisets

Multisets Associated with Trees Each node u is associated with two multisets: – : Contains elements that describe the subtree rooted at u – : Contains elements that describe the node u and its surroundings A tree T, rooted at node r, is then associated with the multiset: 12

Example: Subtree Rooted In Node r r v1v3 v2 … … ………… Take the local multiset from the root node Take the global multiset from non-root nodes 13

Subtree Similarity Search A friendly reminder - Our mission: find the top-k subtrees of a tree T most similar to a query tree Q 14 Query tree Q Database tree T Top-k subtrees of T, most similar to Q This problem can trivially be solved in polynomial time The challenge: huge size of the data, and efficiently computing distances for all subtrees

Subtree Similarity Search Our algorithm’s basic strategy, given a number k, a query Q, and a tree T: –Go over T in post-order: –Calculate, for the subtree S rooted in the current node of T –Derive a distance value between Q and S –If S is one of the top-k subtrees we’ve seen, keep it in the results set 15

Calculating The Multiset Unions Note: Using the following formula, calculating the multiset size for each subtree S while iterating over T in post-order is easy: 16

Calculating The Multiset Intersections –Notation: is the number of times x appears in A –We sum over each x exactly once, even if it appears several times in the multisets Suppose we want to calculate the size of the multiset intersection between A={ α, α, α, β } and B={ α, α, β, γ } 17

Calculating The Multiset Intersections We begin with describing a simple algorithm for calculating the intersection sizes –This method is used within the DynamicSearch algorithm in the paper Later, we will describe an improved algorithm –This improved approach is what we use in the ProfileSimSearch algorithm in the paper 18

Multiset Intersections; Simple Version We want to find the intersection size for each subtree S Q always stays constant, so we calculate the multiset once Any element contributes 0 to this sum, so we will only calculate for 19

Multiset Intersections; Simple Version For each distinct, define a queue This queue initially contains elements, all of which are null placeholders For example, if ={a,a,a,b}, we have two queues: null 20

Multiset Intersections; Simple Version We iterate over T in post-order For each node v, and for each x such that, we perform the following action times: –Pop an element from,, and, –Insert v into null 21

Multiset Intersections; Simple Version We iterate over T in post-order For each node v, and for each x such that, we perform the following action times: –Pop an element from,, and, –Insert v into null 22

Multiset Intersections; Simple Version We iterate over T in post-order For each node v, and for each x such that, we perform the following action times: –Pop an element from,, and, –Insert v into null 23

Multiset Intersections; Simple Version We iterate over T in post-order For each node v, and for each x such that, we perform the following action times: –Pop an element from,, and, –Insert v into 24

Multiset Intersections; Simple Version We iterate over T in post-order For each node v, and for each x such that, we perform the following action times: –Pop an element from,, and, –Insert v into 25

Multiset Intersections; Simple Version We iterate over T in post-order For each node v, and for each x such that, we perform the following action times: –Pop an element from,, and, –Insert v into 26

Multiset Intersections; Simple Version In, we have: null v2v3 v5v7v8 A prefix of nulls and nodes from outside the current subtree A suffix of the nodes from the current subtree that have x in their global profile v4 v1v3 v2 … … v9 v5v6v7 v8 Current iteration’s node in T 27

Multiset Intersections; Simple Version The length of the queue is always exactly We can count the size of the suffix and prefix in order to obtain the intersection size (with respect to x), –Note: “local” multiset elements can fit in any slot of the prefix and contribute to the intersection size. We use this fact to account for the local multiset of the current node. 28

Is that all? The tree T is huge! Runtime of the simple algorithm is too high. 29

Making it Scalable By careful book-keeping, we can avoid the need to count the size of each queue suffix –This reduces the runtime from quadratic to linear Calculating the intersection with local multiset elements is still needed –But, the runtime of this operation is bounded by the local multiset sizes, so overall linear in the input size 30

Calculating the Suffix Size On-The-Fly Each node in T keeps a counter, initialized to 0 –However, we’ll never use more than O(height(T)) memory During the post-order iteration over T: –Increase counter(v) whenever v is enqueued in some queue –At the end of the iteration over v, add counter(v) to counter(v.parent) This is not good enough! What happens when a node is evicted from the queue? 1 st attempt: 31

Calculating the Suffix Size On-The-Fly Each node in T keeps a counter, initialized to 0 –However, we’ll never use more than O(height(T)) memory During the post-order iteration over T: –Increase counter(v) whenever v is enqueued in some queue –At the end of the iteration over v, add counter(v) to counter(v.parent) –Whenever a node u is evicted from a queue and node v is inserted instead, decrement counter(LCA(u,v)) Fixed: counter(w) contains the size of the suffix during the iteration over w 32

Calculating the Suffix Size On-The-Fly The queue contains the last nodes that we’ve seen, to which x was associated with Each x can’t contribute more than to the intersection size uv w w is the lowest common ancestor (LCA) of u,v u …… …… v dequeue u enqueue v decrement counter(w) Queue length is always 33

The ProfileSimSearch Algorithm Runtime: –Linear in the multiset sizes for Q,T, plus a factor of |T|log(k) (Assuming O(1) calculation time for lowest common ancestors) Memory use: –Linear in the query’s multiset size, in k, and in height(T) Runs in a single post-order pass over T Multisets of T’s nodes can be indexed in advance, for a quick implementation –If all multiset elements can be generated on-the-fly easily, no such preprocessing is necessary 34

Experimentation 35

Setup State of the art for subtree similarity search: –TASM-postorder [Augsten et al.] –StructureSearch [Cohen] –Both algorithms use tree edit distance, and not profile distance functions We also compare performance with the implementation of tree-to-tree distance using pq-grams by Augsten et al. 36

Setup Data sets: –DBLP (17.6 million nodes) –XMark100 – XMark1800 (3.6 to 57.8 million nodes) –Sprot (9.4 million nodes) Queries: –Random subtrees from the data Extensive experimentation in paper In the next slides, all times are in seconds 37

Varying |Q| (Dataset: 14.5 million nodes) Similar results were observed on all other datasets that were tested 38

Varying k (Dataset: 14.5 million nodes) Similar results were observed on all other datasets that were tested 39

Varying Dataset Size Different multiset-generating functions are compared here 40

Comparison with tree-to-tree pq-gram distance A MySQL-based implementation of the pq-gram distance calculation routine given by Augsten et al. is compared to ProfileSimSearch Note: ProfileSimSearch may output top-k results, while the other algorithm is designed to calculate pq-gram distance between two given trees Q,T Both algorithms use an indexing stage over the database tree T, which is not measured in the following results 41

Comparison with tree-to-tree pq-gram distance 42

Conclusion and Future Work We presented a definition capable of expressing a large general family of tree distance functions Efficient and scalable algorithm for subtree search using this definition –Can also be used for tree search with a large set of trees Future Work: –Use of upper bounds on subtree sizes or other attributes, to prune search space –Use a profile distance function to obtain bounds on tree edit distance, and modify the algorithm to calculate top-k using tree edit distance 43

Thanks! Questions? 44

Download ppt "A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1."

Similar presentations