Presentation on theme: "Focused Crawling A New Approach to Topic-Specific Web Resource Discovery Soumen Chakrabarti Martin van Den Berg Byron Dom."— Presentation transcript:
Focused Crawling A New Approach to Topic-Specific Web Resource Discovery Soumen Chakrabarti Martin van Den Berg Byron Dom
WWW 1999 2 Portals and portholes Popular search portals and directories Useful for generic needs Difficult to do serious research Information needs of net-savvy users are getting very sophisticated Relatively little business incentive Need handmade specialty sites: portholes Resource discovery must be personalized
WWW 1999 3 Quote The emergence of portholes will be one of the major Internet trends of 1999. As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals. Jim Hake (Founder, Global Information Infrastructure Awards)
WWW 1999 4 Scenario Disk drive research group wants to track magnetic surface technologies Compiler research group wants to trawl the web for graduate student resumés ____ wants to enhance his/her collection of bookmarks about ____ with prominent and relevant links Virtual libraries like the Open Directory Project and the Mining Co.Open Directory ProjectMining Co.
WWW 1999 5 Structured web queries How many links were found from an environment protection agency site to a site about oil and natural gas in the last year? Apart from cycling, what is the most common topic cited by pages on cycling? Find Web research pages which are widely cited by Hawaiian vacation pages
WWW 1999 6 Goal Automatically construct a focused portal (porthole) containing resources that are Relevant to the user’s focus of interest Of high influence and quality Collectively comprehensive Answer structured web queries by selectively exploring the topics involved in the query
WWW 1999 7 Tools at hand Keyword search engines Synonymy, polysemy Abundance, lack of quality Hand compiled topic directories Labor intensive, subjective judgements Resources automatically located using keyword search and link graph distillation Dependence on large crawls and indices
WWW 1999 8 Estimating popularity Extensive research on social network theory Wasserman and Faust Hyperlink based Large in-degree indicates popularity/authority Not all votes are worth the same Several similar ideas and refinements Googol (Page and Brin) and HITS (Kleinberg) Resource compilation (Chakrabarti et al) Topic distillation (Bharat and Henzinger)
WWW 1999 9 Topic distillation overview Given web graph and query Search engine selects sub-graph Expansion, pruning and edge weights Nodes iteratively transfer authority to cited neighbors Search Engine Query The Web Selected subgraph
WWW 1999 10 Preliminary distillation-based approach Design a keyword query to represent a topic Run topic distillation periodically Refine query through trial-and-error Works well if answer is partially known, e.g., European airlines +swissair +iberia +klm
WWW 1999 12 Problems with preliminary approach Dependence on large web crawl and index System = crawler + index + distiller Unreliability of keyword match Engines differ significantly on a given query due to small overlap [Bharat and Bröder] Narrow, arbitrary view of relevant subgraph Topic model does not improve over time Difficulty of query construction Lack of output sensitivity
WWW 1999 13 Query construction +“power suppl*” “switch* mode” smps -multiprocessor* “uninterrupt* power suppl*” ups -parcel* /Companies/Electronics/Power_Supply
WWW 1999 14 Query complexity Complex queries (966 trials) Average words 7.03 Average operators ( +*–" ) 4.34 Typical Alta Vista queries are much simpler [Silverstein, Henzinger, Marais and Moricz] Average query words 2.35 Average operators ( +*–" ) 0.41 Forcibly adding a hub or authority node helped in 86% of the queries
WWW 1999 15 Query complexity Complex queries needed for distillation Typical Alta Vista queries are much simpler (Silverstein, Henzinger, Marais and Moricz) Forcing a hub or authority helps 86% of the time
WWW 1999 16 Output sensitivity Say the goal is to find a comprehensive collection of recreational and competitive bicycling sites and pages Ideally effort should scale with size of the result Time spent crawling and indexing sites unrelated to the topic is wasted Likewise, time that does not improve comprehensiveness is wasted
WWW 1999 17 Proposed solution Resource discovery system that can be customized to crawl for any topic by giving examples Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their centrality Crawler has guidance hooks controlled by these two scores
WWW 1999 19 Relevance All Bus&EconRecreation CompaniesCycling Bike Shops Mt.Biking Clubs Arts... Path nodes Good nodes Subsumed nodes
WWW 1999 20 Classification How relevant is a document w.r.t. a class? Supervised learning, filtering, classification, categorization Many types of classifiers Bayesian, nearest neighbor, rule-based Hypertext Both text and links are class-dependent clues How to model link-based features?
WWW 1999 21 The “bag-of-words” document model Decide topic; topic c is picked with prior probability (c); c (c) = 1 Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1 Fix document length and keep tossing coin Given c, probability of document is
WWW 1999 22 Exploiting link features c=class, t=text, N=neighbors Text-only model: Pr[t|c] Using neighbors’ text to judge my topic: Pr[t, t(N) | c] Better model: Pr[t, c(N) | c] Non-linear relaxation ?
WWW 1999 23 Improvement using link features 9600 patents from 12 classes marked by USPTO Patents have text and cite other patents Expand test patent to include neighborhood ‘Forget’ fraction of neighbors’ classes
WWW 1999 24 Putting it together Taxonomy Database Taxonomy Editor Example Browser Crawl Database Hypertext Classifier (Learn) Topic Models Hypertext Classifier (Apply) Scheduler Workers Topic Distiller Feedback
WWW 1999 25 Monitoring the crawler Time Relevance One URL Moving Average
WWW 1999 26 Measures of success Harvest rate What fraction of crawled pages are relevant Robustness across seed sets Separate crawls with random disjoint samples Measure overlap in URLs and servers crawled Measure agreement in best-rated resources Evidence of non-trivial work #Links from start set to the best resources
WWW 1999 29 Top resources after one hour Recreational and competitive cycling http://www.truesport.com/Bike/links.htm http://www.truesport.com/Bike/links.htm http://reality.sgi.com/billh_hampton/jrvs/links.htmlreality.sgi.com/billh_hampton/jrvs/links.html http://www.acs.ucalgary.ca/~bentley/mark_links.html http://www.acs.ucalgary.ca/~bentley/mark_links.html HIV/AIDS research and treatment http://www.stopaids.org/Otherorgs.html http://www.stopaids.org/Otherorgs.html http://www.iohk.com/UserPages/mlau/aidsinfo.html http://www.iohk.com/UserPages/mlau/aidsinfo.html http://www.ahandyguide.com/cat1/a/a66.htm http://www.ahandyguide.com/cat1/a/a66.htm Purer and better than root set
WWW 1999 32 Distance to best resources Cycling: cooperativeMutual funds: competitive
WWW 1999 33 Robustness of resource discovery Sample disjoint sets of starting URL’s Two separate crawls Find best authorities Order by rank Find overlap in the top-rated resources
WWW 1999 34 Related work WebWatcher, HotList and ColdList Filtering as post-processing, not acquisition ReferralWeb Social network on the Web Ahoy!, Cora Hand-crafted to find home pages and papers WebCrawler, Fish, Shark, Fetuccino, agents Crawler guided by query keyword matches
WWW 1999 35 Comparison with agents Agents usually look for keywords and hand-crafted patterns Cannot learn new vocabulary dynamically Do not use distance-2 centrality information Client-side assistant We use taxonomy with statistical topic models Models can evolve as crawl proceeds Combine relevance and centrality Broader scope: inter- community linkage analysis and querying
WWW 1999 36 Conclusion New architecture for example-driven topic- specific web resource discovery No dependence on full web crawl and index Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from keyword query response nodes
Your consent to our cookies if you continue to use this website.