Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.

Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering Indian Institute of Technology Bombay www.cse.iitb.ernet.in/~soumen www.cse.iitb.ernet.in/~cfiir

EMNLP/VLC 2000 Generic search engines  Struggle to cover the expanding Web 35% coverage in 1997 (Bharat and Broder) 18% in 1999 (Lawrence and Lee Giles) Google rebounds to 50% in 2000 Moore’s law vs. Web population Search quality, index freshness  Cannot afford advanced processing Alta Vista serves >40 million queries / day Cannot even afford to seek on disk (8ms) Limits intelligence of search engines

EMNLP/VLC 2000 Scale vs. quality Scale Quality Keyword-based search engines Link-assisted ranking HotBot, Alta Vista Google, Clever Resource discovery Focused crawling Topic distillation Lexical networks, parsing, semantic indexing

EMNLP/VLC 2000 The case for vertical portals “Portals and search pages are changing rapidly, in part because their biggest strength — massive size and reach — can also be a drawback. The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical — and useful — than trying to cover the entire universe.” (San Jose Mercury News, 1/1999)

EMNLP/VLC 2000 Scaling through specialization  The Web shows content-based locality Link-based clusters correlated with content Content-based communities emerge in a spontaneous, decentralized fashion  Can learn and exploit locality patterns Analyze page visits and bookmarks Automatically construct a “focused portal” with resources that –Have high relevance and quality –Are up-to-date and collectively comprehensive

EMNLP/VLC 2000 Roadmap  Hyperlink mining: a short history  Resource discovery Content-based locality in hypertext Taxonomy models, topic distillation Strategies for focused crawling  Data capture and mining architecture The Memex collaboration system –Collaborative construction of vertical portals Link metadata management architecture –Surfing backwards on the Web

EMNLP/VLC 2000 Historical background  First generation Web search engines Delete ‘stopwords’ from queries Can only do syntactic matching  Users stopped asking good questions! TREC queries: tens to hundreds of words Alta Vista: at most 2–3 words  Crisis of abundance Relevance ranking for very short queries Quality complements relevance — that’s where hand-made topic directories shine

EMNLP/VLC 2000 Hyperlink Induced Topic Search Expanded graph Response Keyword Search engine Query a = Eh h = E T a ‘Hubs’ and ‘authorities’ h a h h h a a a

EMNLP/VLC 2000 PageRank and Google  Prestige of a page is proportional to sum of prestige of citing pages  Standard bibliometric measure of influence  Simulate a random walk on the Web to precompute prestige of all pages  Sort keyword-matched responses by decreasing prestige p3p3 p4p4 p1p1 p2p2 p 4  p 1 + p 2 + p 3 I.e., p = Ep Follow random outlink from page

EMNLP/VLC 2000 Observations  HITS Uses text initially to select Web subgraph Expands subgraph by radius 1 … magic! h and a scores independent of content Iterations required at query time  Google/PageRank Precomputed query-independent prestige No iterations needed at query time, faster Keyword query selects subgraph to rank No notion of hub or bipartite reinforcement

EMNLP/VLC 2000 Limitations  Artificial decoupling of text and links  Connectivity-based topic drift (HITS) “movie awards”  “movies” Expanders at www.web-popularity.comwww.web-popularity.com  Feature diffusion (Google) “more evil than evil”  www.microsoft.comwww.microsoft.com New threat of anchor-text spamming  Decoupled ranking (Google) “harvard mother”  Bill Gates’s bio page!

EMNLP/VLC 2000 Genealogy Bibliometry GoogleHITS Clever@IBM Exploiting anchor text Topic distillation @Compaq Outlier elimination Relaxation labeling Text classificatio n Hypertext classification Learning topic paths Focused crawling Crawling context graphs

EMNLP/VLC 2000 Reducing topic drift: anchor text  Page modeled as sequence of tokens and outlinks  “Radius of influence” around each token  Query term matching token increases link weight  Favors hubs and authorities near relevant pages  Better answers than HITS  Ad-hoc “spreading activation”, but no formal model as yet Query term

EMNLP/VLC 2000 Expanded graph Reducing topic drift: Outlier detection  Search response is usually ‘purer’ than radius=1 expansion  Compute document term vectors  Compute centroid of response vectors  Eliminate far-away expanded vectors  Results improve  Why stop at radius=1? Keyword search response Vector-space document model Centroid × Cut-off radius

EMNLP/VLC 2000 Resource discovery  Given Yahoo-like topic tree with example URLs A selection of good topics to explore  Examples, not queries, define topics Need 2-way decision, not ad-hoc cut-off  Goal Start from the good / relevant examples Crawl to collect additional relevant URLs Fetch as few irrelevant URLs as possible

EMNLP/VLC 2000 A model for relevance All Bus&EconRecreation CompaniesCycling Bike Shops Mt.Biking Clubs Arts... Path class Good classes Subsumed classes Blocked class

EMNLP/VLC 2000 Pr(c|d) from Pr(c|d) using Bayes rule  Decide topic; topic c is picked with prior probability  (c);  c  (c) = 1  Each c has parameters  (c,t) for terms t  Coin with face probabilities  t  (c,t) = 1  Fix document length n(d) and toss coin  Naïve yet effective; can use other algos  Given c, probability of document is

EMNLP/VLC 2000 Enhanced models for hypertext  c=class, d=text, N=neighbors  Text-only model: Pr(d|c)  Using neighbors’ text to judge my topic: Pr(d, d(N) | c)  Better recursive model: Pr(d, c(N) | c)  Relaxation labeling over Markov random fields  Or, EM formulation ?

EMNLP/VLC 2000 Hyperlink modeling boosts accuracy  9600 patents from 12 classes marked by USPTO  Patents have text and prior art links  Expand test patent to include neighborhood  ‘Forget’ and re- estimate fraction of neighbors’ classes (Even better for Yahoo)

EMNLP/VLC 2000 Resource discovery: basic approach  Topic taxonomy with examples and ‘good’ topics specified  Crawler coupled to hypertext classifier  Crawl frontier expanded in relevance order  Neighbors of good hubs expanded with high priority ? ? Radius-1 ruleRadius-2 rule Example URLs

EMNLP/VLC 2000 Focused crawler block diagram Taxonomy Database Taxonomy Editor Example Browser Crawl Database Hypertext Classifier (Learn) Topic Models Hypertext Classifier (Apply) Scheduler Workers Topic Distiller Feedback

EMNLP/VLC 2000 Focused crawling evaluation  Harvest rate What fraction of crawled pages are relevant  Robustness across seed sets Perform separate crawls with random disjoint samples Measure overlap in URLs, server IP addresses, and best-rated resources  Evidence of non-trivial work Path length to the best resources

EMNLP/VLC 2000 Harvest rate Unfocused Focused

EMNLP/VLC 2000 Crawl robustness URL OverlapServer Overlap Crawl 1 Crawl 2

EMNLP/VLC 2000 Robustness of resource quality  Sample disjoint sets of starting URL’s  Two separate crawls  Run HITS/Clever  Find best authorities  Order by rank  Find overlap in the top-rated resources

EMNLP/VLC 2000 Distance to best resources Cycling: cooperativeMutual funds: competitive

EMNLP/VLC 2000 A top hub on ‘airlines’ after half an hour of focused crawling

EMNLP/VLC 2000 A top hub on ‘bicycling’ after one hour of focused crawling

EMNLP/VLC 2000 Learning context graphs  Topics form connected cliques “heart disease”  ‘swimming’, ‘hiking’ ‘cycling’  “first-aid”!  Radius-1 rule can be myopic Trapped within boundaries of related topics  From short pre-crawled paths Can learn frequent chains of related topics Use this knowledge to circumvent local “topic traps”

EMNLP/VLC 2000 Context improves focused crawling

EMNLP/VLC 2000 Roadmap  Hyperlink mining: a short history  Resource discovery Content-based locality in hypertext Taxonomy models, topic distillation Strategies for focused crawling  Data capture and mining architecture The Memex collaboration system –Collaborative construction of vertical portals Link metadata management architecture –Surfing backwards on the Web

EMNLP/VLC 2000 Memex project goals  Infrastructure to support spontaneous formation of topic-based communities  Mining algorithms for personal and community level topic management and collaborative resource discovery  Extensible API for plugging in additional hypertext analysis tools

EMNLP/VLC 2000 Memex project status  Java applet client Netscape 4.5+ (Javascript) available IE4+ (ActiveX) planned  Server code for Unix and Windows Servlets + IBM Universal Database Berkeley DB lightweight storage manager Simple-to-install RPMs for Linux planned  About a dozen alpha testers  First beta available 12/2000

EMNLP/VLC 2000 Creating personal topic spaces  Valuable user input and feedback on topics and associated examples File manager- like interface Privacy choice ‘?’ indicates automatic placement by Memex classifier User cuts and pastes to correct or reinforce the Memex classifier

EMNLP/VLC 2000 Replaying topic-based contexts  “Where was I when last surfing around /Software/Programming?” Choice of topic context Replay of recent browsing context restricted to chosen topic Active browser monitoring and dynamic layout of new/ incremental context graph Better mobility than one- dimensional history provided by popular browsers

EMNLP/VLC 2000 Synthesis of a community taxonomy  Users classify URLs into folders  How to synthesize personal folders into common taxonomy?  Combine multiple similarity hints Entertainment Studios Broadcasting Media kpfa.org bbc.co.uk kron.com channel4.com kcbs.com foxmovies.com lucasfilms.com miramax.com Share document Share folder Share terms Themes ‘Radio’ ‘Television’ ‘Movies’

EMNLP/VLC 2000 Setting up the focused crawler Taxonomy Editor Current Examples Suggested Additional Examples Drag

EMNLP/VLC 2000 Monitoring harvest rate Time Relevance/Harvest rate One URL Moving Average

EMNLP/VLC 2000 Overview of the Memex system Browser Memex server Client JAR Visit Running client applet Download Attach Event-handler servlets Search Folder Context Archive Memex client-server protocol and workload sharing negotiations Relational metadata Text index Mining demons Topic models Taxonomy synthesis Resource discovery Recommendation Classification Clustering

EMNLP/VLC 2000 Surfing backwards using contexts  Space-bounded referrer log  HTTP extension to query backlink data S1 C http://S1/P1 http://S2/P2 S2 GET /P2 HTTP/1.0 Referer: http://S1/P1 Backlink Database C’ Who points to S2/P2? Local or on Memex server

EMNLP/VLC 2000 Surfing backwards 1

EMNLP/VLC 2000 User study and analysis  (1999) Significant improvement in finding comprehensive resource lists Six broad information needs, 25 volunteers Find good resources within limited time Backlinks faked using search engines Blind-reviewed by three other volunteers  (2000) Average path length of undirected Web graph is much smaller compared to directed Web graph  (2000) Better focused crawls using backlinks  Proposal to W3C

EMNLP/VLC 2000 Backlinks improve focused crawling  Follow forward HREF as before  Also expand backlinks using ‘link:’ queries  Classify pages as before Sometimes distracts in unrewardin g work… …but pays off in the end

EMNLP/VLC 2000 Surfing backwards: summary  “Life must be lived forwards, but it can only be understood backwards” —Soren Kierkegaard  Hubs are everywhere! To find them, look backwards  Bidirectional surfing is a valuable means to seed focused resource discovery Even if one has to depend on search engines initially for link:… queries

EMNLP/VLC 2000 Conclusion  Architecture for topic-specific web resource discovery  Driven by examples collected from surfing and bookmarking activity  Reduced dependence on large crawlers  Modest desktop hardware adequate  Variable radius goal-directed crawling  High harvest rate  High quality resources found far from keyword query response nodes

Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.

Similar presentations

Presentation on theme: "Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.

Similar presentations

Presentation on theme: "Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback