Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,

Similar presentations


Presentation on theme: "Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,"— Presentation transcript:

1 Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City, IA 52246 ** School of Informatics Indiana University, Bloomington, IN 47408

2 Overview  Topical Crawling  The Business Intelligence Problem  Test Bed  Crawling Algorithms  Results  Finding Better Seeds

3 Crawling as Graph Search History Frontier Seeds  Node expansion – Downloading and parsing a page  Open list - Frontier  Closed list – History  Expansion order – Crawl path

4 Exhaustive vs. Preferential Crawling  Exhaustive - blind expansion order (e.g. Breadth First )  Preferential - heuristic-based expansion order (e.g. Best First) Topical Crawling: the guiding heuristic is based on a topic or a set of topics

5 Business Intelligence Problem  Web based information about related business entities  Related through the area of competence, research thrust etc.  Topical crawlers can help in creating a small but focused collection of Web pages that is rich in information about related business entities

6 Business Intelligence Problem  A list of business entities is available  We create a focused document collection that can be further explored with ranking, indexing and text-mining tools  We investigate the crawling techniques for the task

7 Finding paths in a competitive community.com. edu,.org,.gov.com

8 Test Bed  DMOZ Categories – “Companies”, “Consultants”, “Manufacturers” DMOZ  159 topics  seeds, targets, keywords and description  Each crawler crawl up-to 10,000 pages for each topic

9 Sample Topic

10 Performance Metrics  Precision@N  Target Recall@N Relevant Targets Crawled |Crawled ∩ Relevant| / |Relevant| |Crawled ∩ Targets| / |Targets|

11 Crawling Infrastructure

12 Crawling Algorithms  Breadth First  Naïve Best First

13 Crawling Algorithms – DOM Crawler

14 Hub-Seeking Crawler n – number of seed hosts

15 Performance

16 Improving the Seed Set  Top 10 hubs based on back- links from Google  Avoiding mirrors of DMOZ  Augmented seed set

17 Performance

18 Related work  Chakrabarti et. al. [1998] Use of Hubs  Menczer et. al. [2001] Framework for evaluating topical crawlers  Chakrabarti et. al. [2002] Use of DOM

19 Conclusion  Investigated the problem of creating a small collection through topical crawling for locating related business entities  Hub Seeking crawler that seeks hubs at crawl time and exploits the tag tree structure of Web pages outperforms Naïve Best-First  Positive effects of identifying hubs before and during the crawl process  Future Work – Find optimal aggregation node Compare the benefits of identifying hubs in competitive vs. collaborative communities

20 Thank You gautam-pant@uiowa.edu Acknowledgements: Robin McEntire (GlaxoSmithKline R&D) Valdis A. Dzelzkalns (GlaxoSmithKline R&D) Paul Stead (GlaxoSmithKline R&D) NSF grant to FM


Download ppt "Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,"

Similar presentations


Ads by Google