Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity.

Similar presentations


Presentation on theme: "Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity."— Presentation transcript:

1 Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity of Alberta

2 Motivations Ranked retrievals e.g. query: book[author=‘Abiteboul’ and year=‘2000’] DTD extraction –useful for query processing Clustering –for efficient storage and indexing –for efficient retrievals (similar documents are expected to match the same queries more often)

3 Problem Statement How to measure similarity (or distance) between XML documents Desired properties –The distance must be a metric –Documents generated by the same DTD are expected to have less distance –Documents with more common tags are expected to be more similar –Interested in syntactic similarity only

4 Examples Similar documents Non-similar documents Abiteboul 2000 John 20 Abiteboul 2000 John 1994 George Animal Farm

5 Related Work Structural Similarity –Edit distance between ordered trees ( Nierman and Jagadish [11], Zhang et al. [21, 23], Chawate et al. [96] ) –Edit distance between unordered trees: NP- Complete ( Zhang et al. [22] ) Specialized Solutions ( Flesca et al. [5], Zaki and Aggrawal [20] )

6 Related Work (Cont.) More Syntactic Similarity –Based on common parent-child tags ( Lian et al. [10] ); e.g. of non-similar documents A T 2006 A T 2006 –Use parent-child tags, twigs, content terms, semantic relationships ( Theobald et al. [18] )

7 Structural Sketch For every path in d, there is a path in t and vice versa and t is minimal. John Mary u200 d t

8 Sketch Similarity John Mary u200 d Problems of matching trees Sketch tree is not unique t

9 Path Sets user/person/name user/person/id user/person/name user/person/id user/person person/name user person name person/id id Root paths Path set

10 Similar Path Sets Standard set comparisons apply –E.g. Cosine, Jaccard, Dice Path set size nl(l+1)/2 –for n root paths, each of length l Fast similarity comparison –Cost: linear on the size of the path set

11 Evaluation Effectiveness in clustering documents generated by the same DTD –Count the mis-clusterings For result comparison –Used the same dataset and setting as some earlier work Also used a larger dataset

12 Real Data XML files of ACM Sigmod Record since March 1999 Four DTDs (total of 989 xml files) –ProceedingsPage17 xml files –IndexTermsPage920 xml files –OrdinaryIssuePage51 xml files –SigmodRecod1 xml file

13 Synthetic Data Generated using IBM xml generator DTDs –Set A: the set used by Neirman and jagadish –Set B: set A plus 5 more DTDs Parameters –M: max repeat for + or * –P: probability of an optional attribute

14 Example Clusters

15 Mis-Clusterings Cosine was used for similarity measurements –Also tried Jaccard and Dice coefficients but the results weren’t better. Real data DS1- DS4 DS5DS6DS7DS8 Binary vector Freq. vec N. freq. vec

16 Comparison Real data DS1- DS4 DS5DS6DS7DS8 Binary vector Freq. vec N. freq. vec real data DS1DS2DS3DS4 Nierman Chawathe Shasha Tag Freq Our results Earlier results

17 Tag Frequency real Data DS1DS2DS3DS4 City block Euclidean Cosine

18 Conclusions Presented a method for clustering documents generated by the same DTD Compared to tree-edit distance-based methods, our method is –more effective (based on our evaluations) –and also much more efficient

19 Future Work Detecting documents with similar structures and related tag names, e.g. Possible solutions: –allow users to specify relabeling rules –Learn relabeling rules from a training data Abiteboul 2000


Download ppt "Finding Syntactic Similarities Between XML Documents Davood RafieiUniversity of Alberta Joint work with Daniel Moise University of Alberta and Dabo SunUniversity."

Similar presentations


Ads by Google