Download presentation
Presentation is loading. Please wait.
2
Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering Indian Institute of Technology Bombay www.cse.iitb.ernet.in/~soumen www.cse.iitb.ernet.in/~cfiir
3
EMNLP/VLC 2000 Generic search engines Struggle to cover the expanding Web 35% coverage in 1997 (Bharat and Broder) 18% in 1999 (Lawrence and Lee Giles) Google rebounds to 50% in 2000 Moore’s law vs. Web population Search quality, index freshness Cannot afford advanced processing Alta Vista serves >40 million queries / day Cannot even afford to seek on disk (8ms) Limits intelligence of search engines
4
EMNLP/VLC 2000 Scale vs. quality Scale Quality Keyword-based search engines Link-assisted ranking HotBot, Alta Vista Google, Clever Resource discovery Focused crawling Topic distillation Lexical networks, parsing, semantic indexing
5
EMNLP/VLC 2000 The case for vertical portals “Portals and search pages are changing rapidly, in part because their biggest strength — massive size and reach — can also be a drawback. The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical — and useful — than trying to cover the entire universe.” (San Jose Mercury News, 1/1999)
6
EMNLP/VLC 2000 Scaling through specialization The Web shows content-based locality Link-based clusters correlated with content Content-based communities emerge in a spontaneous, decentralized fashion Can learn and exploit locality patterns Analyze page visits and bookmarks Automatically construct a “focused portal” with resources that –Have high relevance and quality –Are up-to-date and collectively comprehensive
7
EMNLP/VLC 2000 Roadmap Hyperlink mining: a short history Resource discovery Content-based locality in hypertext Taxonomy models, topic distillation Strategies for focused crawling Data capture and mining architecture The Memex collaboration system –Collaborative construction of vertical portals Link metadata management architecture –Surfing backwards on the Web
8
EMNLP/VLC 2000 Historical background First generation Web search engines Delete ‘stopwords’ from queries Can only do syntactic matching Users stopped asking good questions! TREC queries: tens to hundreds of words Alta Vista: at most 2–3 words Crisis of abundance Relevance ranking for very short queries Quality complements relevance — that’s where hand-made topic directories shine
9
EMNLP/VLC 2000 Hyperlink Induced Topic Search Expanded graph Response Keyword Search engine Query a = Eh h = E T a ‘Hubs’ and ‘authorities’ h a h h h a a a
10
EMNLP/VLC 2000 PageRank and Google Prestige of a page is proportional to sum of prestige of citing pages Standard bibliometric measure of influence Simulate a random walk on the Web to precompute prestige of all pages Sort keyword-matched responses by decreasing prestige p3p3 p4p4 p1p1 p2p2 p 4 p 1 + p 2 + p 3 I.e., p = Ep Follow random outlink from page
11
EMNLP/VLC 2000 Observations HITS Uses text initially to select Web subgraph Expands subgraph by radius 1 … magic! h and a scores independent of content Iterations required at query time Google/PageRank Precomputed query-independent prestige No iterations needed at query time, faster Keyword query selects subgraph to rank No notion of hub or bipartite reinforcement
12
EMNLP/VLC 2000 Limitations Artificial decoupling of text and links Connectivity-based topic drift (HITS) “movie awards” “movies” Expanders at www.web-popularity.comwww.web-popularity.com Feature diffusion (Google) “more evil than evil” www.microsoft.comwww.microsoft.com New threat of anchor-text spamming Decoupled ranking (Google) “harvard mother” Bill Gates’s bio page!
13
EMNLP/VLC 2000 Genealogy Bibliometry GoogleHITS Clever@IBM Exploiting anchor text Topic distillation @Compaq Outlier elimination Relaxation labeling Text classificatio n Hypertext classification Learning topic paths Focused crawling Crawling context graphs
14
EMNLP/VLC 2000 Reducing topic drift: anchor text Page modeled as sequence of tokens and outlinks “Radius of influence” around each token Query term matching token increases link weight Favors hubs and authorities near relevant pages Better answers than HITS Ad-hoc “spreading activation”, but no formal model as yet Query term
15
EMNLP/VLC 2000 Expanded graph Reducing topic drift: Outlier detection Search response is usually ‘purer’ than radius=1 expansion Compute document term vectors Compute centroid of response vectors Eliminate far-away expanded vectors Results improve Why stop at radius=1? Keyword search response Vector-space document model Centroid × Cut-off radius
16
EMNLP/VLC 2000 Resource discovery Given Yahoo-like topic tree with example URLs A selection of good topics to explore Examples, not queries, define topics Need 2-way decision, not ad-hoc cut-off Goal Start from the good / relevant examples Crawl to collect additional relevant URLs Fetch as few irrelevant URLs as possible
17
EMNLP/VLC 2000 A model for relevance All Bus&EconRecreation CompaniesCycling Bike Shops Mt.Biking Clubs Arts... Path class Good classes Subsumed classes Blocked class
18
EMNLP/VLC 2000 Pr(c|d) from Pr(c|d) using Bayes rule Decide topic; topic c is picked with prior probability (c); c (c) = 1 Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1 Fix document length n(d) and toss coin Naïve yet effective; can use other algos Given c, probability of document is
19
EMNLP/VLC 2000 Enhanced models for hypertext c=class, d=text, N=neighbors Text-only model: Pr(d|c) Using neighbors’ text to judge my topic: Pr(d, d(N) | c) Better recursive model: Pr(d, c(N) | c) Relaxation labeling over Markov random fields Or, EM formulation ?
20
EMNLP/VLC 2000 Hyperlink modeling boosts accuracy 9600 patents from 12 classes marked by USPTO Patents have text and prior art links Expand test patent to include neighborhood ‘Forget’ and re- estimate fraction of neighbors’ classes (Even better for Yahoo)
21
EMNLP/VLC 2000 Resource discovery: basic approach Topic taxonomy with examples and ‘good’ topics specified Crawler coupled to hypertext classifier Crawl frontier expanded in relevance order Neighbors of good hubs expanded with high priority ? ? Radius-1 ruleRadius-2 rule Example URLs
22
EMNLP/VLC 2000 Focused crawler block diagram Taxonomy Database Taxonomy Editor Example Browser Crawl Database Hypertext Classifier (Learn) Topic Models Hypertext Classifier (Apply) Scheduler Workers Topic Distiller Feedback
23
EMNLP/VLC 2000 Focused crawling evaluation Harvest rate What fraction of crawled pages are relevant Robustness across seed sets Perform separate crawls with random disjoint samples Measure overlap in URLs, server IP addresses, and best-rated resources Evidence of non-trivial work Path length to the best resources
24
EMNLP/VLC 2000 Harvest rate Unfocused Focused
25
EMNLP/VLC 2000 Crawl robustness URL OverlapServer Overlap Crawl 1 Crawl 2
26
EMNLP/VLC 2000 Robustness of resource quality Sample disjoint sets of starting URL’s Two separate crawls Run HITS/Clever Find best authorities Order by rank Find overlap in the top-rated resources
27
EMNLP/VLC 2000 Distance to best resources Cycling: cooperativeMutual funds: competitive
28
EMNLP/VLC 2000 A top hub on ‘airlines’ after half an hour of focused crawling
29
EMNLP/VLC 2000 A top hub on ‘bicycling’ after one hour of focused crawling
30
EMNLP/VLC 2000 Learning context graphs Topics form connected cliques “heart disease” ‘swimming’, ‘hiking’ ‘cycling’ “first-aid”! Radius-1 rule can be myopic Trapped within boundaries of related topics From short pre-crawled paths Can learn frequent chains of related topics Use this knowledge to circumvent local “topic traps”
31
EMNLP/VLC 2000 Context improves focused crawling
32
EMNLP/VLC 2000 Roadmap Hyperlink mining: a short history Resource discovery Content-based locality in hypertext Taxonomy models, topic distillation Strategies for focused crawling Data capture and mining architecture The Memex collaboration system –Collaborative construction of vertical portals Link metadata management architecture –Surfing backwards on the Web
33
EMNLP/VLC 2000 Memex project goals Infrastructure to support spontaneous formation of topic-based communities Mining algorithms for personal and community level topic management and collaborative resource discovery Extensible API for plugging in additional hypertext analysis tools
34
EMNLP/VLC 2000 Memex project status Java applet client Netscape 4.5+ (Javascript) available IE4+ (ActiveX) planned Server code for Unix and Windows Servlets + IBM Universal Database Berkeley DB lightweight storage manager Simple-to-install RPMs for Linux planned About a dozen alpha testers First beta available 12/2000
35
EMNLP/VLC 2000 Creating personal topic spaces Valuable user input and feedback on topics and associated examples File manager- like interface Privacy choice ‘?’ indicates automatic placement by Memex classifier User cuts and pastes to correct or reinforce the Memex classifier
36
EMNLP/VLC 2000 Replaying topic-based contexts “Where was I when last surfing around /Software/Programming?” Choice of topic context Replay of recent browsing context restricted to chosen topic Active browser monitoring and dynamic layout of new/ incremental context graph Better mobility than one- dimensional history provided by popular browsers
37
EMNLP/VLC 2000 Synthesis of a community taxonomy Users classify URLs into folders How to synthesize personal folders into common taxonomy? Combine multiple similarity hints Entertainment Studios Broadcasting Media kpfa.org bbc.co.uk kron.com channel4.com kcbs.com foxmovies.com lucasfilms.com miramax.com Share document Share folder Share terms Themes ‘Radio’ ‘Television’ ‘Movies’
38
EMNLP/VLC 2000 Setting up the focused crawler Taxonomy Editor Current Examples Suggested Additional Examples Drag
39
EMNLP/VLC 2000 Monitoring harvest rate Time Relevance/Harvest rate One URL Moving Average
40
EMNLP/VLC 2000 Overview of the Memex system Browser Memex server Client JAR Visit Running client applet Download Attach Event-handler servlets Search Folder Context Archive Memex client-server protocol and workload sharing negotiations Relational metadata Text index Mining demons Topic models Taxonomy synthesis Resource discovery Recommendation Classification Clustering
41
EMNLP/VLC 2000 Surfing backwards using contexts Space-bounded referrer log HTTP extension to query backlink data S1 C http://S1/P1 http://S2/P2 S2 GET /P2 HTTP/1.0 Referer: http://S1/P1 Backlink Database C’ Who points to S2/P2? Local or on Memex server
42
EMNLP/VLC 2000 Surfing backwards 1
43
EMNLP/VLC 2000 Surfing backwards 2
44
EMNLP/VLC 2000 Surfing backwards 3
45
EMNLP/VLC 2000 Surfing backwards 4
46
EMNLP/VLC 2000 User study and analysis (1999) Significant improvement in finding comprehensive resource lists Six broad information needs, 25 volunteers Find good resources within limited time Backlinks faked using search engines Blind-reviewed by three other volunteers (2000) Average path length of undirected Web graph is much smaller compared to directed Web graph (2000) Better focused crawls using backlinks Proposal to W3C
47
EMNLP/VLC 2000 Backlinks improve focused crawling Follow forward HREF as before Also expand backlinks using ‘link:’ queries Classify pages as before Sometimes distracts in unrewardin g work… …but pays off in the end
48
EMNLP/VLC 2000 Surfing backwards: summary “Life must be lived forwards, but it can only be understood backwards” —Soren Kierkegaard Hubs are everywhere! To find them, look backwards Bidirectional surfing is a valuable means to seed focused resource discovery Even if one has to depend on search engines initially for link:… queries
49
EMNLP/VLC 2000 Conclusion Architecture for topic-specific web resource discovery Driven by examples collected from surfing and bookmarking activity Reduced dependence on large crawlers Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from keyword query response nodes
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.