Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s distance Presenter : Shao-Wei Cheng Authors : Xiaojun Wan InfSci 2007

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Methodology Experiments Conclusion Comments

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation 3 Measuring pair-wise document similarity is crucial for various text applications, including document clustering, document filtering, and nearest neighbor search. There are too many many many methods: VSM - Cosine, Dice, Jaccard, Overlap Information theoretic Retrieval Model - BM25, NVSM, LM OM-based : document structure information one-to-one

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objectives Not only one-to-one matching  Many-To-Many More information, more nature

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Framework 5 Methodology document decomposition similarity measure TextTiling Sentence clustering The proposed EMD-based (earth mover’s distance ) measure (Improve the OM-based measure to allow many to many matching)

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Methodology TextTiling Tokenization Lexical score determination Boundary identification Sentence clustering hierarchical agglomerative clustering algorithm. Use the average-link method to compute similarity. The merging threshold can be determined through cross-validation.

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Methodology OM-based measure Change the similarity measure to Optimal matching problem. The constraint of optimal matching problem No two edges share the same node. Find the matching M ( the best E ) that has the largest total weight. The one-to-one matching might loss information

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Methodology EMD-based measure Change the similarity measure to transportation problem. The earth mover’s distance Find a flow F = [f ij ] that minimizes the overall cost The constraint :

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Performance comparison for different similarity measures. MAP - non-interpolated mean average precision Experiments 9

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Influence of document decomposition algorithm Sentence clustering algorithm TextTiling Experiments 10

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion 11 The proposed measure can overcome the one-to-one matching problem and the experimental results show the effectiveness and robustness of the EMD-based measure. Future work Combine the Cosine measure and the EMD-based measure in a re-ranking process. Other document decomposition algorithms.

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Comments Advantage  Change document similarity measure to another math problem. Drawback  Application  Clustering  Classification  Search engine  …


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s."

Similar presentations


Ads by Google