Presentation is loading. Please wait.

Presentation is loading. Please wait.

Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.

Similar presentations


Presentation on theme: "Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December."— Presentation transcript:

1 Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal {liut,wangfa,zhujie,agrawal}@cse.ohio-state.edu December 14, 2010

2 Outline Introduction Problem Definition Differential Analysis and Approaches Experiment Result Conclusion

3 Introduction Deep web –Query forms vs. backend databases –Similar information from multiple data sources –What’s their difference? –Application: guiding users’ search process Higher-level knowledge summary –Patterns of values with respects to the same entity

4 Problem definition Goal –D–Difference between multiple data sources in the same domain Patterns of values of the same entity –D–Different values for the same data entity For example: prices of commodities –H–How different is the data, under what conditions? –D–Differential Rules Capturing the difference of values

5 Differential Analysis and Approaches Summarizing difference between two data sources Data queried from the deep web –A relational table Attributes –Assumption: data sources have same attributes –Identical attributes Same values for the same data object –Differential attributes Different values for the same data object –Quantitative attributes Differences in values of quantitative attributes

6 Differential Analysis and Approaches- Useful Identifiers Two data source and –Identical attributes –Differential attributes :attribute in data source –Combining relation tables of A and B –Differential rule where Profile X: the left hand of the rule

7 Differential Analysis and Approaches- Differential Rule Mining Frequent Item Set Mining –Apriori algorithm –A concept hierarchy Identifying patterns for target attributes –For each frequent itemset X Decide –Paired Z-test : difference between two random variables Hypothesis test vs. if >, then – if >0, then

8 Differential Analysis and Approaches- Pruning Rules Pruning rules –A large number of rules are generated –Essential rules predict unessential rules –Identifying essential rules Direction of rules

9 Differential Analysis and Approaches- ancestors of rules Rules R1, R2 are complementary ancestors of rule R –R1: Y->d, R2: Z->d –R: X->d, and Rule R is predicated by complementary ancestors R1 and R2

10 Differential Analysis and Approaches- Profile Representation Identifying essential Rules –Rules are processed level by level –For rule R in k, all the rules from level 1 to k-1 are visited –Computation cost is expensive Profile Representation –Uniquely describe items contained in the profile X of a rule R –For profile, define would be extremely large when profile X is large –Thus, we modify

11 Differential Analysis and Approaches- Process of Pruning Hash table is used to store differential rules Each level corresponds to a hash table For each rule R in the k-the level –The ancestor rules from 1 to k/2 are visited –Identifying complementary rules by profile representation –R is unessential rules Predicted by a pair of complementary ancestor rules –Process the next rule

12 Experiment Results Data Set: four of the most popular travel sites. 120 randomly selected cities all over the world Attributes –Hotel ID, City, Star, Customer Rating, Cleanness Rating, Price, Service Rating Concept Hierarchy for attribute: city

13 Experiment Results - effectiveness

14 Experiment Results – Pruning effectiveness

15 Experiment Results- Efficiency

16 Experiment Results -Mining-Utility of the Approach

17 Conclusion A method to extract high-level summary of the differences in multiple data sources Differential rule mining – A new data mining problem Statistic test for discovering differential rules A method to prune unessential rules Hash-table is used to speedup the process. Experiment results on four travel-related deep web data sources show good results.

18 Questions?


Download ppt "Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December."

Similar presentations


Ads by Google