Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Retrieval in Peer to Peer Systems David Karger MIT.

Similar presentations

Presentation on theme: "Text Retrieval in Peer to Peer Systems David Karger MIT."— Presentation transcript:

1 Text Retrieval in Peer to Peer Systems David Karger MIT

2 Information Retrieval before P2P The traditional approach

3 Information Retrieval Most of our information base is text  academic journals  books and encyclopedias  news feeds  world wide web pages  How do we find what we need? neat messy

4 The Classic IR Model User has information need User formulates textual query System processes corpus of documents System extracts relevant documents User refines query Metrics:  recall: % of relevant documents retrieved  precision: % of retrieved docs with relevance

5 Precision-Recall Tradeoff Recall 100% Precision 100% Fetch Nothing Fetch Everything CIA Web Search Library

6 Specific Retrieval Algorithms Define relevance  Build a model of documents, meanings  Ignore computational cost Implement efficiently  Preprocessing Tb corpora call for big-iron machines (or simulations)  Interaction: after 1/2 second, user notices delay after 10 seconds, user gives up (historical perspective; changed by web)

7 Boolean Keyword Search Q: “Do harsh winters affect steel production?”  Query: steel AND winter Output:  “Last WINTER, overproduction of STEEL led to...”  “STEEL automobiles resist WINTER weather poorly.”  “Boston must STEEL itself for another bad WINTER”  “the Pittsburgh STEELers started WINTER training...” Not Output:  “Cold weather caused increased metal prices as orders for radiators and automobiles picked up...”

8 Implementing Boolean Search Typical: OR of ANDS,  handle each OR separately, aggregate For ANDs, inverted index:  Per term, list of documents containing that term  intersect lists for query terms Basically a database join

9 Intersection Algorithms (as in DB) Method 1: direct list merge  Linear work in summed size of lists Method 2: examine candidates  Start with shortest term list  For each list entry, check for other search terms  Linear in smallest list  Good if at least one rare term, but requires forward index (list of terms in each document) no gain if all search terms common (“flying fish”)

10 Problems with Boolean Approach Synonymy  several words for same thing  if author used different one, query won’t match Polysemy  one word can mean many things (“bank”)  query matches wrong meanings Harsh cutoffs (1 wrong keyword kills)  user can’t type descriptive paragraph... Terms have uniform influence  repeated occurrence same as single occurrence  common terms treated same as rare ones

11 Fixing Problems Synonymy  thesaurus can add equivalent terms to query  increases recall, but lowers precision  expensive to construct (semantics---manual) Polysemy  use more query terms to disambiguate  user might not know more terms  increases precision, but lowers recall Harsh cutoffs  quorum system (maximize # matching terms) Uniformity?

12 Vector Space Model Document is a vector with a coordinate/term  0-1 for presence/absence of term (quorum)  real valued to represent term “importance” term frequency in document increases value term frequency in corpus decreases value Dot product with query measures similarity Best known implementation: inverted index  for each query term, list documents containing it  accumulate dot products

13 Vector Space Advantages Smoother than Boolean search  Provides ranking rather than sharp cut-off Tends to allow/encourage queries with many nonzero terms  Easy to “expand query” with synonyms  Hopefully polysemes will “interfere constructively”  May even add relevant documents to query 100s or 1000s of terms

14 P2PIR Simulating big iron

15 Web Search Info From Google Web queries  Almost all queries 2 terms only  “Boolean vector space” model (tiny recall OK)  Zipf distribution, so caching queries helps some Corpus  3B pages, 10K average size, 30TB total  Inverted index: roughly the same size  Fits in a “moderate” P2P system of 30K nodes  But must be partitioned. How?

16 Obvious: Partition Documents Node builds full inverted index for its subset Query quite tractable per node Merge results sent back from each node Used by Google (in data center) and Gnutella Drawback: query broadcast to all nodes  OK for Google data center; bad for P2P

17 Alternative: Partition Terms One node owns a few terms of inverted index  Term pair is “key” for distributed hash table Talk only to nodes that own query terms They return desired inverted-index lists Results intersected at query issuer Drawback: transfer huge inverted index lists Alternative: send first term-list to second  Ships 1 (perhaps small) list instead of 2

18 Avoiding Communication (Om Gnawali et Build inverted index on term pairs  Pre-answering all queries Partition pairs among nodes Search contacts one node Problem: pre-computation cost  Size- n document generates n 2 pairs  Each pair must be communicated  Each pair must be stored

19 Good Cases Music search  “document” is song title + author  n small, so n 2 factor unimportant Document windows  Usually, good docs have query terms “nearby”  Scan window of length 5, take pairs in window  10 pairs/window, so 10 n per document  So linear in corpus size as before Bundle pairs to ship over sparse overlay

20 What About Vector Space? Weighting terms is easy But cannot limit search to pair list  However, need only highest-scored documents on individual terms  So, pre-compute and store small “winner list” Vector space encourages many-term queries  Find pairs with small intersection  Index triples, quadruples, etc  Apply branch and bound techniques

21 Google Pushback No need for P2P  More precisely: “keep peers in our data center”  Exploit high local communication bandwidth  Economics support large server farm More load? Buy more servers Main bottleneck: content provider bandwidth  Limits rate of crawl  Google index often weeks out of date  Distributed crawler won’t help

22 Google Pushback Pushback P2P might help  Let each node build own index  Ship changes to Google Potential applications  real-time index  new-relevant-content notification Problem: SPAM  Content providers will lie about index changes  Use P2P system to spot-check?

23 Person-to-Person IR New modalities

24 P2P: Systems Perspective Distributed system has more resources  Computation/Storage  Reliability Can exploit, if successfully hide  Latency  Bandwidth Goal: simulate reliable big iron  Solve traditional problems that need resources  File storage, factoring, database queries, IR

25 P2P: Social Perspective Applications based on person-to-person interactions  Messaging  Linking/community bulding (the web)  Reputation management (Mojo Nation)  File-sharing collaborations (just now) Need not run on top of P2P network

26 The “Pathetic Fallacy” of P2P Assumption that network layer should mirror social layer  E.g. “peers should be node with similar interests” Many work fine on one (big, reliable) machine  Placement on P2P system is “coincidental”  On other side of “one big machine” abstraction Breaching abstraction has bad consequences  Peering to “friends” unlikely to optimize efficiency, reliability

27 P2P Opportunity: Leverage Involvement of People Each individual manipulates information  In much more powerful, semantic ways than machines can achieve Record that manipulation Exploit to help others do better retrieval

28 Link-based Retrieval Simultaneous work:  Kleinberg at IBM  Brin/Page at Stanford/Google People find “good” web pages, link to them  So, a page with large in-degree is good  Refine: target of many good nodes is good Mathematically, random walk model  Page rank=stationary probability of random walk

29 Applications Search  Raise relevance of high page-rank pages  If lazy, limit corpus to high page-rank  Anchor text better description than page contents Crawl  Page rank computed before see page  Prioritize high page-rank pages for crawl People add usable info no system could find

30 P2P: Systems/Social Interactions Distributed system has novel properties Exploit them to enable novel capabilities E.g., anonymity  Relies on partition of control/knowledge E.g., privacy  Allow limited access to my private information  Gain (false, but important) sense of safety by keeping it on my machine

31 Expertise Networks Haystack (Karger et al), Shock (Adar et al) Route questions to appropriate expert  Use text to describe knowledge  Based on human entry, or indexing of human’s personal files Might be unwilling to admit knowledge  P2P framework can protect anonymity  Shock achieves by Gnutella-style query broadcast  More efficient approach?

32 Other New Aspects Personal information sharing  Unwilling to “publish” mail, documents to world  But might allow search, access in some cases  Keeping data, index on own machine gives (false) sense of security, privacy Anonymity  P2P provides strong anonymity primitives  Can be exploited, e.g., for “recommending” embarrassing content

33 Sample Application Social: “Secret Web”  Maintain links for use by page-rank algorithm  But, links are secret from most others  Need random walk through link path Implement via recursive lookup  Censorproof?, spamproof?

34 Semantics vs. Syntax Clearly, using word meanings would help Some systems try to implement semantics But this is a core AI problem, unsolved Current attempts don’t scale to large corpora All current large systems are syntactic only Idea: use computational power of P2P Idea: use humans to attach semantics

35 Conclusion: Two Approaches to P2P Hide P2P (Partition to Partition)  Goal: illusion of single server  Know how to do task on single server  Devise tools to achieve same in distributed sys.  Focus on surmounting drawbacks: systems Exploit P2P (Person to Person)  Determine new opportunities afforded by P2P  Perhaps impossible on single server  Focus on new applications: AI? HCI?

Download ppt "Text Retrieval in Peer to Peer Systems David Karger MIT."

Similar presentations

Ads by Google