Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,

Similar presentations


Presentation on theme: "Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,"— Presentation transcript:

1 Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007, Vienna, Austria 2008. 2. 5 Summarized and Presented by Seungseok Kang IDS Lab., Seoul National University

2 Copyright  2007 by CEBT Outline  Introduction  Measuring Entropy  Information Distance  Measuring Structural Similarity Extract structural information Ziv-Lempel Encoding Ziv-Merhav Crossparsing  Competitors Tree-editing distance Discrete Fourier Transformation Path shingles  Experiments Clustering Quality Document Collections Results  Conclusion IDS Lab. Seminar - 2Center for E-Business Technology

3 Copyright  2007 by CEBT Introduction  Semistructured document HTML, XHTML, XML, …  In traditional IR detecting similarities used widely For querying For clustering Lots of similarity measures for text documents  New challenges with semistructured documents Measuring structural similarity Semistructured documents show great structural diversity Measuring structural similarity can integrate heterogeneous data sources IDS Lab. Seminar - 3Center for E-Business Technology

4 Copyright  2007 by CEBT Measuring Entropy  Bennet et al. Introduced concept of universal information metric Based on Kolmogorov complexity – Given data object x, Kolmogorov complexity K(x) means the length of shortest program that outputs x Conditional Kolmogorov complexity K(x|y) – Generalized form of Kolmogorov complexity – Length of the shortest program with input y that outputs x IDS Lab. Seminar - 4Center for E-Business Technology

5 Copyright  2007 by CEBT Information Distance  Similarity of two data objects can be measured by normalized information distance Based on Conditional Kolmogorov complexity Has some nice property – Almost a metric property – Lower bound for admissible distance Unfortunately, Kolmogorov complexity is not computable in general – Can be approximated by compression – C(x) : length of compressed x IDS Lab. Seminar - 5Center for E-Business Technology

6 Copyright  2007 by CEBT Measuring Structural Similarity  Just Compressing XML files does not get the job done  Extract structural information Tags – List element/attribute names in document order Pairwise – Like tags, but with names of parents Path – Like tags, but with full path to root Family order – Family-order traversal of document  Except path, all extractions can be done in linear time IDS Lab. Seminar - 6Center for E-Business Technology

7 Copyright  2007 by CEBT Measuring Structural Similarity (cont.)  After extracting structural information Come up with similarity measure – NCD with GZIP (Zip-Lempel Encoding) – Ziv-Merhav crossparsing Can be done in linear time  We can calcurate the actual value for the NCD by encoding or crossparsing Crossparsing is alternative way to measure relative entropy between sequences in files – Instead of compressing IDS Lab. Seminar - 7Center for E-Business Technology

8 Copyright  2007 by CEBT Measuring Structural Similarity (cont.)  Ziv-Lempel Encoding Build a directory for encoding by utilizing substring that have already appeared in the previous encoded part of the document – ex) String to encode “aababbccccd”,,,,  Ziv-merhav Crossparsing Find the longest prefix of string x that appears as a string in y until we have parsed the whole document x – ex) x=“abbbbaaabba”, y=“baababaabba” c(x|y) = 3, the three phrases being “abb”,”bba”,”aabba” IDS Lab. Seminar - 8Center for E-Business Technology

9 Copyright  2007 by CEBT Competitors  Tree-editing distance Nierman and Jagadish  Discrete Fourier Transformation Flesca et al.  Path shingles Buttler IDS Lab. Seminar - 9Center for E-Business Technology

10 Copyright  2007 by CEBT Tree-editing distance  Measuring the minimum editing distance between string Levenshtein distance  Five different edit operations Relabel Insert Delete Insert Tree Delete Tree  Quadratic runtime O(|d 1 ||d 2 |) – |d n | : size of document d IDS Lab. Seminar - 10Center for E-Business Technology

11 Copyright  2007 by CEBT Discrete Fourier Transformation  Called DFT technique  The feature of an XML document relevant to its structure are described via a numerical encoding  Numerical values are interpreted as a time series Rotate document by 90 ◦, interpret indentations as time series Use DFT transform to compute similarity – Tag encoding – Document encoding – Document similarity  Runtime O(NlogN) – where N = max((|d 1 ||d 2 |) IDS Lab. Seminar - 11Center for E-Business Technology

12 Copyright  2007 by CEBT Path Shingles  Extract structural information using Full Path variant  Compute a hash value h j for each path  A shingle of width w is the combination of w consecutive hash values  Compute similarity between two documents using Dice coefficient on the two sets of shingles  Not linear run time IDS Lab. Seminar - 12Center for E-Business Technology

13 Copyright  2007 by CEBT Clustering quality  Measure quality of similarity measure by clustering  Hierarchical agglomerative clustering  Quality expressed in number of misclustering in dendrogram IDS Lab. Seminar - 13Center for E-Business Technology

14 Copyright  2007 by CEBT Document Collections  Use three different document collections for experiment Real data set – SIGMOD record, INEX 2005, music sheets encoded in XML Synthetically generated data sets from DFT paper Own synthetically generated various data sets – Element names – Element frequencies – Element positions – Element depths IDS Lab. Seminar - 14Center for E-Business Technology

15 Copyright  2007 by CEBT Overall Results  Number of misclustering for different methods IDS Lab. Seminar - 15Center for E-Business Technology

16 Copyright  2007 by CEBT Detailed results  Different method have different strengths and weaknesses Tree-edit – Generally good, has problems with largely varying document sizes DFT – Good at frequencies, bad at element names, position, and depth GZIP/Ziv-Merhav – Bad at frequencies, good at element names, position, and depth  DFT and GZIP/Ziv-Merhav are complementary to each other Performance of hybrid becomes even better Hybrid approach does not have linear run time IDS Lab. Seminar - 16Center for E-Business Technology

17 Copyright  2007 by CEBT Runtime  Shallow documents Crossparsing/pairwise, shingle path algorithm  Deep documents Crossparsing/pairwise IDS Lab. Seminar - 17Center for E-Business Technology

18 Copyright  2007 by CEBT Conclusion  Determining the similarity of document Based on measuring entropy  Encoding/Crossparsing approach totally different to previous approaches  Can be done in linear time Important for large document collection  Future work More sophisticated ways of encoding document structure Different entropy measure better suited for structural information IDS Lab. Seminar - 18Center for E-Business Technology


Download ppt "Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,"

Similar presentations


Ads by Google