Presentation is loading. Please wait.

Presentation is loading. Please wait.

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Similar presentations


Presentation on theme: "HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH."— Presentation transcript:

1 HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH

2 Scenario Arnab Nandi & Phil Bernstein 2

3 Scenario Arnab Nandi & Phil Bernstein 3 Search over structured data Commerce entertainment Data onboarding – merge an XML data feed from a 3 rd party to Microsoft data warehouse.

4 Scenario Arnab Nandi & Phil Bernstein 4 query Search engine + data warehouse Users 3 rd Party Feed results Amazon.com High Precision High Recall Minimal Human Involvement High Precision High Recall Minimal Human Involvement

5 Example Feed - Indiana Jones and The Kingdom of The Crystal Skull 2008 Ever… 127 Action Comedy PG-13 http://www.indianajones.com/site/ index.html - Harrison Ford - Warehouse: Movies (Host) 3 rd Party Movie Site (Foreign) 57590 Indiana Jones and the Kingdom of the Crystal Skull 02:00 Action/Adventure NR http://www.indianajones.com/ Harrison Ford Karen Allen 5 Arnab Nandi & Phil Bernstein

6 Schema Matching - Indiana Jones and The Kingdom of The Crystal Skull 2008 Ever… 127 Action Comedy PG-13 http://www.indianajones.com/site/ index.html - Harrison Ford - Warehouse: Movies (Host) 3 rd Party Movie Site (Foreign) 57590 Indiana Jones and the Kingdom of the Crystal Skull 02:00 Action/Adventure NR http://www.indianajones.com/ Harrison Ford Karen Allen 6 Arnab Nandi & Phil Bernstein

7 Taxonomy Matching - Indiana Jones and The Kingdom of The Crystal Skull 2008 Ever… 127 Action Comedy PG-13 http://www.indianajones.com/site/ index.html - Harrison Ford - Warehouse: Movies (Host) 3 rd Party Movie Site (Foreign) 57590 Indiana Jones and the Kingdom of the Crystal Skull 02:00 Action/Adventure NR http://www.indianajones.com/ Harrison Ford Karen Allen 7 Arnab Nandi & Phil Bernstein

8 Various Problems 8 Badly normalized…. Unit conversion… Formatting choices… In-band signaling… Arbitrary labels Arnab Nandi & Phil Bernstein Non standard vocabulary / language Zero documentation Not enough instances

9 Unlike conventional matching… Arnab Nandi & Phil Bernstein 9 We have web search click data For both Warehouse & 3 rd party website The databases we are integrating (usually) have a presence on the web Why not use click data as a feature for schema & taxonomy matching? query Search engine + data warehouse Users 3 rd Party Feed results

10 Outline 10 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein

11 Core idea 11 If two (sets of) products are searched for by similar queries, then they are similar Small laptop Arnab Nandi & Phil Bernstein Web Search

12 Clicklog Core idea 12 Arnab Nandi & Phil Bernstein Small Laptops Pro. Laptops Warehouse hardware eee Asus.com eee ::: small laptops Small laptop Y X Z

13 Query Distributions Arnab Nandi & Phil Bernstein 13 click count

14 Mapping to Taxonomy 14 Map URL to product, which belongs to taxonomy http://www.amazon.com/dp/B001JTA59C Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein 3rd party DB (provided to us)

15 Aggregating Query Distributions 15 Arnab Nandi & Phil Bernstein Small Laptops Pro. Laptops Warehouse hardware eee Asus.com eee ::: small laptops

16 Aggregate URLs to categories 16 Aggregate queries for each URL to schema element / taxonomy term Electronics|Electronics Features|Brands|Asus EEE netbook, laptop, cheap laptop Office Products|Office Machines|Netbooks netbook Arnab Nandi & Phil Bernstein

17 Generating Correspondences Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them. Process For each page (URL) Identify query distribution Identify category / schema element of that page For each category / schema element C Aggregate over pages in C to get query distribution For each foreign category / schema element Find host category / schema element with most similar query distribution 17 Arnab Nandi & Phil Bernstein

18 Outline 18 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein

19 Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 19 queryfrequrl laptop70http://searchengine.com/product/macbookpro laptop25http://searchengine.com/product/mininote laptop5http://asus.com/eeepc netbook5http://searchengine.com/product/macbookpro netbook20http://searchengine.com/product/mininote netbook15http://asus.com/eeepc cheap netbook5http://asus.com/eeepc Warehouse: Small Laptops Warehouse: Professional Laptops eee

20 Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 20 laptop: 25/45 netbook: 20/45 laptop : 70 / 75 netbook : 5/75 laptop: 5/25 netbook: 15/25 cheap laptop: 5/25 Warehouse: Small Laptops Warehouse: Professional Laptops eee

21 Distribution Similarity Metric Arnab Nandi & Phil Bernstein 21 Jaccard(q host, q foreign ) MinFreq(q host, q foreign ) Σ (all q host, q foreign combinations)

22 small laptops vs eee laptop vs laptop netbook vs netbook laptop vs cheap laptop 1 x (25/45) + 1 x (20/45) + 0.5 x (5/25) = 0.74 Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 22 Warehouse: Small Laptops Warehouse: Professional Laptops eee laptop: 25/45 netbook: 20/45 laptop : 70 / 75 netbook : 5/75 laptop: 5/25 netbook: 15/25 cheap laptop: 5/25 0.74 0.31

23 Advantages of Clicklogs Arnab Nandi & Phil Bernstein 23 Resilient to language Resilient to new domains, data, and features As long as people query & click, we have data to learn from Generates mappings previous methods cant Electronics Electronics Features Brands Texas Instruments Office Products Office Machines Calculators Software Categories Programming Programming Languages Visual Basic Software Developer Tools

24 System Design 24 Arnab Nandi & Phil Bernstein

25 Outline 25 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein

26 Experimenting with Click Logs Arnab Nandi & Phil Bernstein 26 Commercial warehouse mapping, 258 products from a 70,000 term Amazon.com taxonomy (613 in gold) to a 6,000 term warehouse taxonomy (40 in gold) Live.com (now Bing.com) search querylog Amazon to warehouse mapping task, consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product pages Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

27 Summary of Results Arnab Nandi & Phil Bernstein 27 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

28 Precision / Recall Arnab Nandi & Phil Bernstein 28 Commercial warehouse mapping, 258 products from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613 categories used)

29 Summary of Results Arnab Nandi & Phil Bernstein 29 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

30 Match Quality Arnab Nandi & Phil Bernstein 30 QDs are unique to entities QDs are unique to aggregate classes Amazon Products Amazon Categories Warehouse Products Warehouse Categories Amazon Products 257/258 correct241/258 correct189/258 correct (73%)226/258correct Amazon Categories 373/613 correct204/400 correct525/613 (85%) Warehouse Products 392/400 correct383/400 correct Warehouse Categories 40/40 correct QDs of entities are closest to the distributions of their aggregate classes QDs of similar aggregates are similar

31 Summary of Results Arnab Nandi & Phil Bernstein 31 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

32 Varying Clicklog Size 32 Successively decreased clicklog size by half Recall decreases as clicklog size is decreased Arnab Nandi & Phil Bernstein

33 Summary of Results Arnab Nandi & Phil Bernstein 33 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

34 Comparing Query Distributions 34 Jaccard(q host, q foreign ) MinFreq(q host, q foreign ) Σ (all q host, q foreign combinations) Replace Jaccard with various phrase similarity metrics Minimal difference due to size of most queries Arnab Nandi & Phil Bernstein

35 Summary of Results Arnab Nandi & Phil Bernstein 35 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

36 Related + Future Work Arnab Nandi & Phil Bernstein 36 Usage Based / Crowdsourcing Usage-Based Schema Matching (ICDE 2008) Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A. Matching schemas in online communities: A web 2.0 approach (ICDE 2008) R McCann, W Shen, AH Doan Web Scale Integration Web-scale Data Integration: You can only afford to Pay As You Go (CIDR 2007) Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy

37 Related + Future Work Arnab Nandi & Phil Bernstein 37 Mixed methods Ontology matching: A machine learning approach (Handbook on Ontologies 2004) A Doan, J Madhavan, P Domingos, A Halevy Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003) A Doan, P Domingos, A Halevy Schema and ontology matching with COMA++ (SIGMOD 2005) D Aumueller, HH Do, S Massmann, E Rahm

38 Conclusion Unsupervised mapping is possible very high recall / precision when enough queries are present Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce more mappings Combinable with existing methods 38 Arnab Nandi & Phil Bernstein

39 http://arnab.org/contact http://research.microsoft.com/~philbe/ Questions? 39 Arnab Nandi & Phil Bernstein

40 Existing Methods 40 Arnab Nandi & Phil Bernstein A Survey of Approaches to Automatic Schema Matching (VLDBJ 2001) Erhard Rahm, Philip A. Bernstein

41 Name-based & Instance-based Arnab Nandi & Phil Bernstein 41 Not ideal for our use case Need high precision Task B: Commercial warehouse mapping, 258 products in a 70K term taxonomy to a 6,000 term taxonomy


Download ppt "HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH."

Similar presentations


Ads by Google