Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Slides:



Advertisements
Similar presentations
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
Advertisements

Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Clustering Categorical Data The Case of Quran Verses
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Fast Algorithms For Hierarchical Range Histogram Constructions
Near-Duplicates Detection
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
Searching on Multi-Dimensional Data
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Approximation Algorithms
Aki Hecht Seminar in Databases (236826) January 2009
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Unit 11a 1 Unit 11: Data Structures & Complexity H We discuss in this unit Graphs and trees Binary search trees Hashing functions Recursive sorting: quicksort,
CSC 2300 Data Structures & Algorithms March 20, 2007 Chapter 7. Sorting.
Branch and Bound Algorithm for Solving Integer Linear Programming
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Tree edit distance1 Tree Edit Distance.  Minimum edits to transform one tree into another Tree edit distance2 TED.
Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Querying Structured Text in an XML Database By Xuemei Luo.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
1Computer Sciences Department. Book: Introduction to Algorithms, by: Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein Electronic:
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Multi-object Similarity Query Evaluation Michal Batko.
Presented by Ho Wai Shing
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
RE-Tree: An Efficient Index Structure for Regular Expressions
Taku Aratsu1, Kouichi Hirata1 and Tetsuji Kuboyama2
Probabilistic Data Management
Integrating XML Data Sources Using Approximate Joins
Efficient Record Linkage in Large Data Sets
Phylogeny.
Compact routing schemes with improved stretch
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Donghui Zhang, Tian Xia Northeastern University
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Approximate XML Joins Huang-Chun Yu Li Xu

Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents: –Two documents may convey approximately or exactly the same information but may be different on structure. –Even when two documents have the same DTD, the structures may be different due to optional elements or attributes.

Introduction Example: paper conferencetitleauthors VLDBXML for the masses author AliceBob paper conferencetitleauthors VLDBXML for the masses author Alice publication typetitle conferenceXML for the masses name AliceRobVLDB author We need approximately matching of XML documents!

Introduction We also need a distance metric to quantify the differences between XML documents. Tree edit distance is used in this paper for its generality and simplicity when quantifying the distance between trees. Other distance metrics can be used as well.

Tree Edit Distance Tree edit distance: the minimum number of tree edit operations (node insertion, deletion, label substitution) required to transform one tree to another. Given two trees T 1 and T 2, there is a well known algorithm to compute the tree edit distance in O(|T 1 | |T 2 | h(T 1 ) h(T 2 )).

Tree Edit Distance Find a mapping M between T 1 and T 2 such that the editing cost is minimized. The mapping consists pairs of integers (i, j) such that: –1≤ i ≤ |T 1 | and 1≤ j ≤ |T 2 | –For any (i 1, j 1 ), (i 2, j 2 ) in M i 1 = i 2 iff j 1 = j 2 t 1 [i 1 ] is to the left of t 1 [i 2 ] iff t 2 [j 1 ] is to the left of t 2 [j 2 ] (sibling order preserving) t 1 [i 1 ] is an ancestor of t 1 [i 2 ] iff t 2 [j 1 ] is an ancestor of t 2 [j 2 ] (ancestor order preserving)

Tree Edit Distance Example: tree edit distance is 3 (delete B, insert H, relabel C to I) A BC DEFG A DH EI FG T1:T1: T2:T2: BH

Problem Definition Given two XML data source S 1 and S 2, and a distance threshold τ. TDist(d 1, d 2 ): a function that assesses the tree edit distance between two documents d 1 S 1 and d 2 S 2. Approximate join: return all pairs of documents (d 1, d 2 ) S 1 S 2 such that TDist(d 1, d 2 ) ≤ τ.

Challenges Evaluation of TDist function between two documents is a very expensive operation. (worst case: O(n 4 ), for trees of size O(n) ) Traditional techniques in join algorithms (sort merge, hash join, etc) cannot be used.

Lower Bounds Let T be an ordered labeled tree. Let pre(T) denote the preorder traversal of T and post(T) denote the postorder traversal of T. Let T 1, T 2 be ordered labeled trees. max{ed(pre(T 1 ), pre(T 2 )), ed(post(T 1 ), post(T 2 )} ≤ TDist(T 1, T 2 ) This can be computed in O(n 2 ) time.

Upper Bounds Additional constraint is imposed on the original TDist algorithm. The search space is reduced and a faster algorithm is proposed. For any triple (t 1 [i 1 ], t 2 [j 1 ]), (t 1 [i 2 ], t 2 [j 2 ]), (t 1 [i 3 ], t 2 [j 3 ]) M, let lca( ) be the lowest common ancestor function. –t 1 [lca(t 1 [i 1 ], t 1 [i 2 ])] is a proper ancestor of t 1 [i 3 ] iff t 2 [lca(t 2 [j 1 ], t 2 [j 2 ])] is a proper ancestor of t 2 [j 3 ] Two distinct subtrees of T 1 will be mapped to two distinct subtrees of T 2. It can be calculated in O(|T 1 ||T 2 |) time.

Upper Bounds Example: the upper bound is 5 (delete B, delete E, insert H, insert E, relabel C to I ) A BC DEFG A DH EI FG T1:T1: T2:T2:

Upper Bounds Algorithm for Upper Bound:

Outline Reference set Choosing reference set Approximate join algorithms

Outline Reference set Choosing reference set Approximate join algorithms

Reference Set S 1, S 2 : two sets of XML documents Reference set K S 1 ∪ S 2 – a chosen set of XML documents v i : a vector for document d i S 1 ∪ S 2 –dimensionality = |K| –v it = TDist(d i, k t ), k t K, 1 ≤ t ≤ |K|

Reference Set | v i t - v j t | ≤ TDist(d i, d j ) ≤ v i t + v j t, 1 ≤ t ≤ |K| –Essentially the above procedure “projects” documents d i, d j onto the reference set K τ : distance threshold u ij = min t,1 ≤ t ≤ |k| v i t + v j t –u ij ≤ τ : the pair is certainly within distance τ l ij = max t,1 ≤ t ≤ |k| |v i t – v j t | – l ij > τ : the pair can’t be within distance τ

Outline Reference set Choosing reference set Approximate join algorithms

Choosing Reference Set S = S 1 ∪ S 2 S is well separated, if –S can be divided into k clusters s.t. Documents within a cluster have small distance (say less than τ/2) Documents in different clusters have large distance (say larger than 3τ/2)

Choosing Reference Set S is well separated –choose a single point from each of the k ( the size of the reference set ) largest clusters to be in the reference set –If k is not known f i : the fraction of points in the first i clusters Choose k ≥ i ≥ 2, such that

Choosing Reference Set choose d C 1 in the reference set K –(d i C 1, d j C 1 ) should be in the output TDist(d i, d j ) ≤ TDist(d i, d) + TDist(d j, d) ≤ τ/2 + τ/2 = τ –C 1 containing n 1 documents Saving n 1 *(n 1 - 1)/2 evaluations of TDist() – (d i C 1, d j C 2 ) should not be in the output TDist(d i, d j ) ≥ |TDist(d i, d) - TDist(d j, d)| > 3τ/2 - τ/2 = τ –Saving n 1 *(|S| - n 1 ) evaluations of TDist()

Algorithm 1.do{ 1.1 randomly pick a point d from the data set S 1.2 put all the points within τ/2 distance with d in one cluster } until (all documents in S belong to some cluster ) 2. choose the k largest clusters 3. pick a random point from each cluster to be in the reference set K

Outline Reference set Choosing reference set Approximate join algorithms

Bounds Algorithm Naïve algorithm –Nested loop join + TDist algorithm Bounds algorithm for each d i S 1 { for each d j S 2 { if (UBDist(d i, d j ) ≤ τ ) output (d i, d j ); if (LBDist(d i, d j ) ≤ τ ) if (TDist(d i, d j ) ≤ τ ) output(d i, d j ); }

Pruning with a Reference Set for each pair (d i S 1, d j S 2 ) –u ij = min t,1 ≤ t ≤ |k| v i t + v j t –l ij = max t,1 ≤ t ≤ |k| |v i t – v j t | u ij ≤ τ : the pair belongs to the output l ij > τ : the pair can be pruned away l ij ≤ τ < u ij : apply TDist(d i, d j ) to identify the distance between d i and d j refer to this algorithm as RS (ReferenceSets) Drawback –need to perform (| S 1 | + |S 2 |) * |K| invocations of TDist() to compute vectors

Applying Both Optimizations if RS algorithm indicates that TDist() should be invoked between a pair –can be possibly avoid by applying the computational cheaper LBDist() and UBDist() refer to this algorithm as RSB (RSBounds)

RSC Algorithm potentially more evaluation of TDist() –because of the construction of vectors two vectors for document d i S 1 ∪ S 2 –vector v l i : v l it = LBDist(d i, k t ), k t K, 1 ≤ t ≤ |K| –vector v u i : v u it = UBDist(d i, k t ), k t K, 1 ≤ t ≤ |K| | v l i t - v u j t | ≤ TDist(d i, d j ) ≤ v u i t + v u j t, 1 ≤ t ≤ |K| u ij = min t,1 ≤ t ≤ |k| v u i t + v u j t l ij = max t,1 ≤ t ≤ |k| |v l i t – v u j t | Refer to this algorithm as RSCombined(RSC) Drawback: double the size of vectors

Performance Evaluation Run time vs. number of nodes

Performance Evaluation Run time vs. distance threshold (XMark) Run time vs. distance threshold (DBLP)

Performance Evaluation Run time vs. distance threshold (XMark) Number of TDist calculation vs. distance threshold (XMark)

Conclusion & Future Work The algorithms are not scalable for huge data sets. The performance of these algorithms has a strong correlation with the data itself. The performance of the reference set depends on the clustering algorithm chosen. Try to incorporate other distance matrices into the algorithms. Try to explore the various indexing schemes which can be used in the algorithms.

References S. Guda, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, Approximate XML Joins, Proceedings of ACM SIGMID, K. Zhang and D. Shasha, Tree Pattern Matching, Oxford University Press, S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, Proceedings of ACM SIGMOD, T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of ACM SIGMOD, 1996.