Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)

Similar presentations


Presentation on theme: "The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)"— Presentation transcript:

1 The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox) Byron Dom (IBM Almaden)

2 Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery Soumen Chakrabarti (IIT Bombay) Martin van Den Berg (Xerox) Byron Dom (IBM Almaden)

3 Soumen Chakrabarti IIT Bombay Quote 1 Portals and search pages are changing rapidly, in part because their biggest strength — massive size and reach — can also be a drawback. The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical — and useful — than trying to cover the entire universe. Dan Gillmore, San Jose Mercury News

4 Soumen Chakrabarti IIT Bombay Scenario  Disk drive research group wants to track magnetic surface technologies  Compiler research group wants to trawl the web for graduate student resumés  ____ wants to enhance his/her collection of bookmarks about ____ with prominent and relevant links  Virtual libraries like Yahoo!, the Open Directory Project and the Mining Co.Yahoo!Open Directory ProjectMining Co.

5 Soumen Chakrabarti IIT Bombay Structured web queries  How many links were found from an environment protection agency site to a site about oil and natural gas in the last year?  Apart from cycling, what is the most common topic cited by pages on cycling?  Find Web research pages which are widely cited by Hawaiian vacation pages Answer: “first-aid”

6 Soumen Chakrabarti IIT Bombay Quote 2 As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals. Jim Hake Founder, Global Information Infrastructure

7 Soumen Chakrabarti IIT Bombay Goals  Spontaneous, decentralized formation of topical communities  Automatic construction of a “focused portal” containing resources that are  Relevant to the user’s focus of interest  Of high influence and quality  Collectively comprehensive  Discovery that combine structure and content

8 Soumen Chakrabarti IIT Bombay  Taxonomy with some ‘chosen’ topics  Each page has a relevance score w.r.t. chosen topics  Mendelzon and Milo’s web access cost model  Goal is to ‘expand’ start set to maximize average relevance Model All ScienceSports Cycling Hiking Physics Zoology

9 Soumen Chakrabarti IIT Bombay Properties to be exploited  A page with high relevance tends to link to at least some other relevant pages (radius- one rule)  Given that a page u links to relevant page(s), chances are increased that u points to other relevant pages (radius-two rule) ?

10 Soumen Chakrabarti IIT Bombay Syntactic “query-by-example”  If part of the answer is known, trivial search techniques may do quite well  E.g., “European airlines”  +swissair +iberia +klm  E.g., “Car makers”  Which pages link to and ?

11 Soumen Chakrabarti IIT Bombay

12 Soumen Chakrabarti IIT Bombay The backlink architecture S1 C S2 GET /P2 HTTP/1.0 Referer: Local Backlink Database C’ Who points to S2/P2?

13 Soumen Chakrabarti IIT Bombay Backlink rationale  Centralized backlink service does not scale  Limited additional storage per server  Turn hyperlinks into undirected edges  A series of forward and backward ‘clicks’ can quickly build a topical community  Can be used to boot-strap the focused crawler

14 Soumen Chakrabarti IIT Bombay Backlink example 1

15 Soumen Chakrabarti IIT Bombay Backlink example 2

16 Soumen Chakrabarti IIT Bombay Backlink example 3

17 Soumen Chakrabarti IIT Bombay Backlink example 4

18 Soumen Chakrabarti IIT Bombay Estimating popularity  Extensive research on social network theory  Wasserman and Faust  Hyperlink based  Large in-degree indicates popularity/authority  Not all votes are worth the same  Several similar ideas and refinements  Googol (Page and Brin) and HITS (Kleinberg)  Resource compilation (Chakrabarti et al)  Topic distillation (Bharat and Henzinger)

19 Soumen Chakrabarti IIT Bombay Topic distillation overview  Given web graph and query  Search engine selects sub-graph  Expansion, pruning and edge weights  Nodes iteratively transfer authority to cited neighbors Search Engine Query The Web Selected subgraph

20 Soumen Chakrabarti IIT Bombay Preliminary distillation-based approach  Design a keyword query to represent topics of focus  Using a large web crawl, run topic distillation on the query  Refine query by inspecting result and trial- and-error

21 Soumen Chakrabarti IIT Bombay Problems with preliminary approach  Unreliability of keyword match  Engines differ significantly on a given query due to small overlap [Bharat and Bröder]  Narrow, arbitrary view of relevant subgraph  Topic model does not improve over time  Dependence on large web crawl and index (lack of “output sensitivity”)  Difficulty of query construction

22 Soumen Chakrabarti IIT Bombay Output sensitivity  Say the goal is to find a comprehensive collection of recreational and competitive bicycling sites and pages  Ideally effort should scale with size of the result  Time spent crawling and indexing sites unrelated to the topic is wasted  Likewise, time that does not improve comprehensiveness is wasted

23 Soumen Chakrabarti IIT Bombay Query construction +“power suppl*” “switch* mode” smps -multiprocessor* “uninterrupt* power suppl*” ups -parcel* /Companies/Electronics/Power_Supply

24 Soumen Chakrabarti IIT Bombay Query complexity  Complex queries needed for distillation  Typical Alta Vista queries are much simpler (Silverstein, Henzinger, Marais and Moricz)  Forcing a hub or authority helps 86% of the time

25 Soumen Chakrabarti IIT Bombay Proposed solution  Resource discovery system that can be customized to crawl for any topic by giving examples  Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their centrality  Crawler has guidance hooks controlled by these two scores

26 Soumen Chakrabarti IIT Bombay Administration scenario Taxonomy Editor Current Examples Suggested Additional Examples Drag

27 Soumen Chakrabarti IIT Bombay Relevance All Bus&EconRecreation CompaniesCycling Bike Shops Mt.Biking Clubs Arts... Path nodes Good nodes Subsumed nodes

28 Soumen Chakrabarti IIT Bombay Classification  How relevant is a document w.r.t. a class?  Supervised learning, filtering, classification, categorization  Many types of classifiers  Bayesian, nearest neighbor, rule-based  Hypertext  Both text and links are class-dependent clues  How to model link-based features?

29 Soumen Chakrabarti IIT Bombay The “bag-of-words” document model  Decide topic; topic c is picked with prior probability  (c);  c  (c) = 1  Each c has parameters  (c,t) for terms t  Coin with face probabilities  t  (c,t) = 1  Fix document length and keep tossing coin  Given c, probability of document is

30 Soumen Chakrabarti IIT Bombay Exploiting link features  c=class, t=text, N=neighbors  Text-only model: Pr[t|c]  Using neighbors’ text to judge my topic: Pr[t, t(N) | c]  Better model: Pr[t, c(N) | c]  Non-linear relaxation ?

31 Soumen Chakrabarti IIT Bombay Improvement using link features  9600 patents from 12 classes marked by USPTO  Patents have text and cite other patents  Expand test patent to include neighborhood  ‘Forget’ fraction of neighbors’ classes

32 Soumen Chakrabarti IIT Bombay Putting it together Taxonomy Database Taxonomy Editor Example Browser Crawl Database Hypertext Classifier (Learn) Topic Models Hypertext Classifier (Apply) Scheduler Workers Topic Distiller Feedback

33 Soumen Chakrabarti IIT Bombay Monitoring the crawler Time Relevance One URL Moving Average

34 Soumen Chakrabarti IIT Bombay Measures of success  Harvest rate  What fraction of crawled pages are relevant  Robustness across seed sets  Separate crawls with random disjoint samples  Measure overlap in URLs and servers crawled  Measure agreement in best-rated resources  Evidence of non-trivial work  #Links from start set to the best resources

35 Soumen Chakrabarti IIT Bombay Harvest rate Unfocused Focused

36 Soumen Chakrabarti IIT Bombay Crawl robustness URL OverlapServer Overlap Crawl 1 Crawl 2

37 Soumen Chakrabarti IIT Bombay Top resources after one hour  Recreational and competitive cycling     HIV/AIDS research and treatment     Purer and better than root set

38 Soumen Chakrabarti IIT Bombay

39 Soumen Chakrabarti IIT Bombay

40 Soumen Chakrabarti IIT Bombay Robustness of resource discovery  Sample disjoint sets of starting URL’s  Two separate crawls  Find best authorities  Order by rank  Find overlap in the top-rated resources

41 Soumen Chakrabarti IIT Bombay Distance to best resources Cycling: cooperativeMutual funds: competitive

42 Soumen Chakrabarti IIT Bombay Observations  Random walk on the Web “rapidly mixes” topics  Yet, there are large coherent paths and clusters  Focused crawling gives topic distillation richer data to work on  Combining content with link structure eliminates the need to tune link-based heuristics

43 Soumen Chakrabarti IIT Bombay Related work  WebWatcher, HotList and ColdList  Filtering as post-processing, not acquisition  ReferralWeb  Social network on the Web  Ahoy!, Cora  Hand-crafted to find home pages and papers  WebCrawler, Fish, Shark, Fetuccino, agents  Crawler guided by query keyword matches

44 Soumen Chakrabarti IIT Bombay Comparison with agents  Agents usually look for keywords and hand-crafted patterns  Cannot learn new vocabulary dynamically  Do not use distance-2 centrality information  Client-side assistant  We use taxonomy with statistical topic models  Models can evolve as crawl proceeds  Combine relevance and centrality  Broader scope: inter- community linkage analysis and querying

45 Soumen Chakrabarti IIT Bombay Conclusion  New architecture for example-driven topic- specific web resource discovery  No dependence on full web crawl and index  Modest desktop hardware adequate  Variable radius goal-directed crawling  High harvest rate  High quality resources found far from keyword query response nodes


Download ppt "The Focus Project Soumen Chakrabarti (IIT Bombay) David Gibson (Berkeley) Piotr Indyk (Stanford) Kevin McCurley (IBM Almaden) Martin van Den Berg (Xerox)"

Similar presentations


Ads by Google