Download presentation
Presentation is loading. Please wait.
1
Peer-to-Peer Information Search Sebastian Michel Ecole Polytechnique Fédérale Lausanne Lausanne - Switzerland Josiane Xavier Parreira Max-Planck Institute for Informatics Saarbrücken - Germany
2
Peer-to-Peer Information Search - SBBD 2007 Tutorial2 2/6/2015 Outline of Part 1 Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
3
Peer-to-Peer Information Search - SBBD 2007 Tutorial3 2/6/2015 P2P Systems Known from Napster and others Sharing of mostly illegal content (mp3, movies) P2P= Pirate-to-Pirate ?? New kind of network organization; no client/server anymore Basic Ideas: Each peer connects to a few other peers All peers together form powerful networks Potential Benefits: No single point of failure Load is spread across mulitple peers (Resilient to failures and dynamics) Peer: “one that is of equal standing with another” (source: Merriam-Webster Online Dictionary )
4
Peer-to-Peer Information Search - SBBD 2007 Tutorial4 2/6/2015 Napster Publish file statistics File Download Central server (index) Client software sends information about users‘ contents to server. User send queries to server Server responds with IP of users that store matching files. Peer-to-Peer file sharing! Developed in 1998. First P2P file-sharing system
5
Peer-to-Peer Information Search - SBBD 2007 Tutorial5 2/6/2015 Gnutella Protocol for distributed file sharing Started in 2000 in 2005: 1.81 million computers connected* Unstructured Network Truly decentralized Uses message flooding during query execution. Later: version with super nodes and query routing * http://www.slyck.com/news.php?story=814
6
Peer-to-Peer Information Search - SBBD 2007 Tutorial6 2/6/2015 Gnutella Style Paris Hilton? TTL 3 TTL 2 TTL 1 TTL 0 TTL 1 TTL 0
7
Peer-to-Peer Information Search - SBBD 2007 Tutorial7 2/6/2015 Gnutella Style Pros: no complex statistical bookkeeping Cons: lot of network traffic some peers might not be reachable (TTL)
8
Peer-to-Peer Information Search - SBBD 2007 Tutorial8 2/6/2015 Bit Torrent Idea: Load sharing through file splitting A lot of (legal) software distributors offer software through Bit-torrent Download information in small.torrent file One tracker node per file (specified in torrent file) segment 1 segment 2 segment 3 segment 4 segment 5 tracker node Client segment 1 segment 3 segment 5 segment 4 segment 2 request random peer list request segments File Incentives: „ tit-for-tat“ Each peer remembers collaborative peers different priorities
9
Peer-to-Peer Information Search - SBBD 2007 Tutorial9 2/6/2015 Literature Book: Peer-to-Peer: Harnessing the Power of Disruptive Technologies by Andy Oram. O'Reilly Media, Inc.
10
Peer-to-Peer Information Search - SBBD 2007 Tutorial10 2/6/2015 Overlay Networks On top of existing networks Different way to build an overlay network structured unstructured hybrid
11
Peer-to-Peer Information Search - SBBD 2007 Tutorial11 2/6/2015 Self* Properties (Promises) Self-Organizing: evolves, grows..... without being guided/managed Self-Optimizing Self-Configuring Self-Healing: Self-Restoration Self-Diagnostics Self-Protecting
12
Peer-to-Peer Information Search - SBBD 2007 Tutorial12 2/6/2015 Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
13
Peer-to-Peer Information Search - SBBD 2007 Tutorial13 2/6/2015 Distributed Hash Tables Hash-Table: given a key, return the bucket id. Based on a hash function (like SHA-1) Now: Distributed. For a given key, return the id of the peer currently responsible for the key. Challenge: Purely distributed protocols that cope with node failures, departures, arrivals. No central manager.
14
Peer-to-Peer Information Search - SBBD 2007 Tutorial14 2/6/2015 p1 p8 p14 p21 p32 p38 p42 p48 p51 p56 Chord uses an m-bit identifier space ordered in a mod-2 m circle, the Chord ring; maps peers and objects to identifiers in the Chord ring, using the hash function SHA-1 uses consistent hashing: an object with identifier id is placed on the successor peer, succ(id), which is the first node whose identifier is equal to, or follows id on the Chord ring Key k (e.g., hash(file name)) is assigned to the node with key p (e.g., hash(IP address)) such that k p and there is no node p‘ with k p‘ and p‘<p k10 k24 k30k38 k54 Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160
15
Peer-to-Peer Information Search - SBBD 2007 Tutorial15 2/6/2015 Chord peer n maintains routing information about peers that lie on the Chord ring at logarithmically increasing distance Finger tables Chord Ring p1p1 p8 p 56 p 51 p 48 p 42 p 38 p 32 p 21 p 14 p 8 + 4 p 8 + 8 p 8 + 16 p 8 + 2 p 8 + 32 p 8 + 1 p 14 p 21 p 32 p 14 p 42 p 14 fingertable p 8 p 42 + 4 p 42 + 8 p 42 + 16 p 42 + 2 p 42 + 32 p 42 + 1 p 48 p 51 p1p1 p 48 p 14 p 48 fingertable p 42 p 51 + 4 p 51 + 8 p 51 + 16 p 51 + 2 p 51 + 32 p 51 + 1 p 56 p1p1 p8p8 p 21 p 56 fingertable p 51 k 54 Lookup(54)
16
Peer-to-Peer Information Search - SBBD 2007 Tutorial16 2/6/2015 Node Joins in Chord p 48 p 38 p 42 k 40 k 43 k 39 p 42 lookup(42) k 40 k 39 sets succ pointer p 42 moving keys updates succ pointer p 38 init_finger_tables() successor=node.find_successor() predecessor=successor.predecessor predecessor.successor=new
17
Peer-to-Peer Information Search - SBBD 2007 Tutorial17 2/6/2015 And others... P-Grid: Karl Aberer: P-Grid: A Self-Organizing Access Structure for P2P Information Systems. CoopIS 2001: 179-194 CAN: Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M. Karp, Scott Shenker: A scalable content-addressable network. 161- 172 Pastry: Antony I. T. Rowstron, Peter Druschel: Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to- Peer Systems. Middleware 2001: 329-350 Bamboo: Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz. Handling Churn in a DHT. Proceedings of the USENIX Annual Technical Conference, June 2004.
18
Peer-to-Peer Information Search - SBBD 2007 Tutorial18 2/6/2015 Range queries A range query [v1, v2] searches for those peers which store data whit key value k [v1, v2] DHTs only support efficiently exact-match queries The naïve approach to process range queries in DHTs is to: query each value of a range individually It is HIGHLY EXPENSIVE!
19
Peer-to-Peer Information Search - SBBD 2007 Tutorial19 2/6/2015 DHTs and Range Queries There are two main solutions to cope with load imbalances i.e. to perform load balancing: transferring load, or replicating data Order preserving hash function: usually leads skewed distributions
20
Peer-to-Peer Information Search - SBBD 2007 Tutorial20 2/6/2015 DHT and Range Queries (2) Existing approaches to deal with range queries: Locality preserving hashing OP-Chord: Triantafillou et al (2003). Skip Graphs: Aspnes et al (2004) Hashing ranges of values instead of each value individually CAN-based: Andrzejak et al (2002), Sahin et al (2004) Another problem in that context: access load imbalances One possible solution: “hot data” transferring to deal with those load imbalances However, data transfer does not solve access load imbalances in skewed access (query) distributions
21
Peer-to-Peer Information Search - SBBD 2007 Tutorial21 2/6/2015 HotRod: replicating hot arcs Theoni Pitoura et al. EDBT 2006. A peer is “hot” (or overloaded) when > _max, where _max is the upper limit of its resource capacity An arc of peers is “hot” when at least one of its peers is hot replicate ranges of values
22
Peer-to-Peer Information Search - SBBD 2007 Tutorial22 2/6/2015 Efficient Load Balancing
23
Peer-to-Peer Information Search - SBBD 2007 Tutorial23 2/6/2015 Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
24
Peer-to-Peer Information Search - SBBD 2007 Tutorial24 2/6/2015 Building a P2P Search Engine (Peer to Peer Information Retrieval) “Distributed Google” P2P approach best suitable large number of peers exploit mostly idle resources intellectual input of user community scalable and self organizing
25
Peer-to-Peer Information Search - SBBD 2007 Tutorial25 2/6/2015 Information Retrieval Basics Document Terms 5 x 7 x 4 x # of terms (term frequency)
26
Peer-to-Peer Information Search - SBBD 2007 Tutorial26 2/6/2015 Information Retrieval Basics (2) index lists with (DocId: tf*idf) sorted by Score B+ tree on terms Query Execution: Usually using some kind of threshold algorithm*: - sequential scans over the index lists (round-robin) - (random accesses to fetch missing scores) - aggregate scores - stop when the threshold is reached Top-k Query Processing: find k documents with the highest total score e.g. Fagin’s algorithm TA or a variant without random accesses d17: 0.3 d44: 0.4... d52: 0.1 d53: 0.8 d55: 0.6 d12: 0.5 d14: 0.4... d28: 0.1 d51: 0.6 d52: 0.3 d28: 0.7... d17: 0.1 d44: 0.2 d11: 0.6
27
Peer-to-Peer Information Search - SBBD 2007 Tutorial27 2/6/2015 Going distributed: Index Organization peer index every peer has its own collection (full documents) distributed index = index of peer descriptions document index d17: 0.3 d44: 0.4... d52: 0.1 d53: 0.8 d55: 0.6 d12: 0.5 d14: 0.4... d28: 0.1 d51: 0.6 d52: 0.3 d44: 0.2 d28: 0.7... d17: 0.1 d11: 0.6 Peer 1 Peer 2Peer 3 Peer 2 Peer 1
28
Peer-to-Peer Information Search - SBBD 2007 Tutorial28 2/6/2015 (Full) Document Index Straight forward from centralized document index Each peer is responsible for storing the index list for a subset of terms. p1 p8 p14 p21 p32 p38 p42 p48 p51 p56 Query Routing: DHT lookups Query Execution: Distributed Top-k [TPUT ’04, KLEE ‘05]
29
Peer-to-Peer Information Search - SBBD 2007 Tutorial29 2/6/2015 Peer Index Each peer has its own local index (e.g., created by web crawls) Peers publish compact per-term descriptions about their index Query Routing: 1. DHT lookups 2. Retrieve Metadata 3. Find most promising peers Query Execution: - Send the complete Query and merge the incoming results a: P1 P6 P4 b: P5 P3 P1 P6... Distributed Directory Term List of Peers P1 P5 P6P4 P2 P3
30
Peer-to-Peer Information Search - SBBD 2007 Tutorial30 2/6/2015 P2P Search with Minerva book- marks B0 term g: 13, 11, 45,... term a: 17, 11, 92,... term f: 43, 65, 92,... peer lists (directory) term g: 13, 11, 45,... term c: 13, 92, 45,... url x: 37, 44, 12,... url y: 75, 43, 12,... url z: 54, 128, 7,... query peer P0 Query routing aims to optimize benefit/cost driven by distributed statistics on peers‘ content quality, content overlap, freshness, authority, trust, etc. Maintain semantic/social/statistical overlay network (SON) local index X0 based on scalable, churn- resilient DHT with O(log n) key lookup peer ranking & statistics peer ranking & statistics Exploit community behavior (bookmarks, links, tags, clicks, etc.)
31
Peer-to-Peer Information Search - SBBD 2007 Tutorial31 2/6/2015 Two major Problems Task of merging the obtained results into final ranking: Result Merging Task of finding “high quality“ peers: Query Routing aka database/collection/peer selection Overview articles: J. Callan. (2000). "Distributed information retrieval." In W. B. Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers. (pp. 127-150). Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1): 48-89 (2002)
32
Peer-to-Peer Information Search - SBBD 2007 Tutorial32 2/6/2015 Query Routing Given a Query Q={term1, term2,...., termN): select the most promising peers Based on: per-term per-peer statistics document frequency vocabulary size + normalization issues like collection frequency avg vocabulary size Most popular: CORI, GlOSS, Decision Theoretic Framework (DTF)
33
Peer-to-Peer Information Search - SBBD 2007 Tutorial33 2/6/2015 CORI p1p2pj-1pj t1t2t3tk.... Apply document ranking to resource ranking q Query Resource s Terms C = #peers df = document frequency cf = collection frequency cw = # distinct words per peer
34
Peer-to-Peer Information Search - SBBD 2007 Tutorial34 2/6/2015 Literature J. Callan. (2000). "Distributed information retrieval." In W. B. Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers. (pp. 127-150). Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1): 48-89 (2002) CORI: James P. Callan, Zhihong Lu, W. Bruce Croft: Searching Distributed Collections with Inference Networks. SIGIR 1995: 21-28 GlOSS: Luis Gravano, Hector Garcia-Molina, Anthony Tomasic: GlOSS: Text-Source Discovery over the Internet. ACM Trans. Database Syst. 24(2): 229-264 (1999) Decision Theoretic Framework: Norbert Fuhr: A Decision- Theoretic Approach to Database Selection in Networked IR. ACM Trans. Inf. Syst. 17(3): 229-249 (1999)
35
Peer-to-Peer Information Search - SBBD 2007 Tutorial35 2/6/2015 Problem: incomparable scores Different corpus statistics df component used in tf*ids scoring functions is not globally known user with lot of high quality documents for term a high df non expert user with some bad documents for term a low df Result Merging Different scoring functions completely different functions different parameters in the same function
36
Peer-to-Peer Information Search - SBBD 2007 Tutorial36 2/6/2015 Result Merging Approaches Score Normalization by using global statistics computation of global statistics difficult (not obvious) solution using gossip score re-computation with query initiator‘s local statistics required re-ranking and knowledge about document contents score re-computation using query routing scores routing score available anyway
37
Peer-to-Peer Information Search - SBBD 2007 Tutorial37 2/6/2015 Global DF Estimation gdf (global doc. freq.) of a term is interesting key measure, but overlap among peers makes simple distr. counting infeasible hash sketches [Flajolet/Martin 1985]: duplicate-sensitive cardinality estimator for multisets hash each multiset element x onto m-bit bitvector and remember least significant 1 bit rough intuition: least-significant bit set by half of the documents, second bit by ¼ of the documents...... Theory says: most significant bit estimator of log (n); n=#documents Higher accuracy: average multiple iid sketches
38
Peer-to-Peer Information Search - SBBD 2007 Tutorial38 2/6/2015 Global DF Estimation Hash sketches of different peers collected at directory peer distributivity is free!! i { (h(x)) | x S i } = { (h(x)) | x i S i } gdf estimation algorithm: each peer p posts hash sketch for each (discriminative) term t to directory directory peer for term t forms union of incoming hash sketches when a peer needs to know gdf(t), simply ask directory peer for t sliding-window techniques for dynamic adjustment Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum: Global Document Frequency Estimation in Peer-to-Peer Web Search. WebDB 2006
39
Peer-to-Peer Information Search - SBBD 2007 Tutorial39 2/6/2015 Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
40
Peer-to-Peer Information Search - SBBD 2007 Tutorial40 2/6/2015 Autonomous Peers Overlapping Sources A C E D B ? ? querying peer 1 2 3 4 Recall #peers {A} {A,B}{A,B,C}{A,..,D} overlap aware routing strategy: 1 2 Recall #peers {A} {A,E}
41
Peer-to-Peer Information Search - SBBD 2007 Tutorial41 2/6/2015 How? Enrich published statistics with overlap estimators. Interested in NOVELTY and QUALITY Iterative greedy selection process select first peer based on quality select next peer by quality*novelty Suitable synopses for overlap estimation: Bloom filter [Bloom 1979] hash sketches [Flajolet&Martin 1985] min wise independent permutations [Broder 1997]
42
Peer-to-Peer Information Search - SBBD 2007 Tutorial42 2/6/2015 Min-Wise Independent Permutations [Broder 97] MIPs are unbiased estimator of overlap: P [min {h(x) | x A} = min {h(y) | y B}] = |A B| / |A B| set of ids 17 21 3 12 24 8 20 48 24 36 18 8 40 9 21 15 24 46 9 21 18 45 30 33 h 1 (x) = 7x + 3 mod 51 h 2 (x) = 5x + 6 mod 51 h N (x) = 3x + 9 mod 51 … compute N random permutations … 8 9 9 N MIPs vector: minima of perm. 8 9 33 24 36 9 8 24 45 24 48 13 MIPs (set1) MIPs (set2) estimated overlap = 2/6
43
Peer-to-Peer Information Search - SBBD 2007 Tutorial43 2/6/2015 Bloom Filter [Bloom 1979] bit array of size m k hash functions h_i: docId_space {1,..,m} insert n docs by hashing the ids and settings the corresponding bits document is in the Bloom Filter if the corresponding bits are set probability of false positives (pfp) tradeoff accuracy vs. efficiency bits 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 h1 9 h2 2 h1 14 h2 6 1 1 1 1 h1 15 h2 9 X h1 6 h2 2 11 1 Andrei Broder and Michael Mitzenmacher: Network Applications of Bloom Filters: A Survey. Internet Mathematics 1(4). 2005.
44
Peer-to-Peer Information Search - SBBD 2007 Tutorial44 2/6/2015 Multi-Key Statistics solves interesting problem: peer with lot of docs on american football and lots of documents about pop music has not a single document about american music cannot be predicted using per-term statistics Obvious: Recall that....
45
Peer-to-Peer Information Search - SBBD 2007 Tutorial45 2/6/2015 Multi-Key Statistics in P2P Motivation: estimated_quality(a and b) = quality(a) + quality (b) = df_a + df_b != df_(a and b) Impossible (Infeasible) to consider all term-pairs, triplets, quadruples,..... Query Driven: Analyze query logs @ directory peers. + Data driven verficication: P[Anna|Kournikova] =...... P[Andy|Rodick] = P[Berlin|Marathon] = No additional messages + shorter lists + highly accurate Sebastian Michel, Matthias Bender, Nikos Ntarmos, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. CIKM 2006: 172-181 additional statistics often not needed Whole process can be easily integrated into Peer-level P2P IR
46
Peer-to-Peer Information Search - SBBD 2007 Tutorial46 2/6/2015 Single-term vs. multi-term P2P document indexing Single - term indexing term1posting list1 term2posting list2 termM-1posting listM-1 termMposting listM ... long posting lists s m a l l v o c. key11posting list11 key12posting list12 key1iposting list1i ... short posting lists l a r g e v o c. PEER1... keyN1posting listN1 keyN2posting listN2 keyNjposting listNj ... PEERN 1 N... Multi-term keys Multiterm indexing make use of highly discriminative keys limit influence of overly long index lists consider term pairs (triplets...) for shorter lists efficient query processing Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman, Karl Aberer: Web text retrieval with a P2P query-driven index. SIGIR 2007: 679-686
47
Peer-to-Peer Information Search - SBBD 2007 Tutorial47 2/6/2015 Literature Overlap Awareness: Ronak Desai, Qi Yang, Zonghuan Wu, Weiyi Meng, Clement T. Yu: Identifying redundant search engines in a very large scale metasearch engine context. WIDM 2006: 51-58 Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Improving collection selection with overlap awareness in P2P search engines. SIGIR 2005: 67-74 Thomas Hernandez, Subbarao Kambhampati: Improving text collection selection with coverage and overlap statistics. WWW (Special interest tracks and posters) 2005: 1128-1129 Sketches Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher: Min-Wise Independent Permutations. J. Comput. Syst. Sci. 60(3): 630-659 (2000) Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985) Andrei Broder and Michael Mitzenmacher: Network Applications of Bloom Filters: A Survey. Internet Mathematics 1(4). 2005.
48
Peer-to-Peer Information Search - SBBD 2007 Tutorial48 2/6/2015 Literature Multi-key statistics: Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys. ICDE 2007: 1096-1105 Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman, Karl Aberer: Web text retrieval with a P2P query-driven index. SIGIR 2007: 679-686 Sebastian Michel, Matthias Bender, Nikos Ntarmos, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. CIKM 2006: 172-181
49
Peer-to-Peer Information Search - SBBD 2007 Tutorial49 2/6/2015 Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
50
Peer-to-Peer Information Search - SBBD 2007 Tutorial50 2/6/2015 For the IR people.... Why top-k? Cannot take a look at all matching documents E.g., Google provides millions of documents about Britney Spears Requires ranking (scoring): In text retrieval for instance + of course pagerank if you wish Remember Part one: Local Query Execution at each peer (peer-index-model) AND truly distributed top-k processing in the full document-index.
51
Peer-to-Peer Information Search - SBBD 2007 Tutorial51 2/6/2015 For the DB guys... Table with schema (id, attribute, value) SELECT id, aggr(value) from table group by id sort by aggr(value) desc limit k
52
Peer-to-Peer Information Search - SBBD 2007 Tutorial52 2/6/2015 For the networking guys... IPBytes in kB 192.168.1.731kB 192.168.1.323kB 192.168.1.412kB IPBytes in kB 192.168.1.881kB 192.168.1.333kB 192.168.1.112kB IPBytes in kB 192.168.1.453kB 192.168.1.321kB 192.168.1.19kB IPBytes in kB 192.168.1.129kB 192.168.1.428kB 192.168.1.512kB Network Monitoring Find clients that cause high network traffic.
53
Peer-to-Peer Information Search - SBBD 2007 Tutorial53 2/6/2015 Computational Model m lists with (itemId, score)-pairs sorted by score descending. One list per attribute (e.g. term) Aggregation function aggr() Monotonicity is important for all items a, b: whith denoting the score of item x in list i Goal: return the top-k items w.r.t. their aggregated (overall) scores
54
Peer-to-Peer Information Search - SBBD 2007 Tutorial54 2/6/2015 How to process this? Most popular: Family of threshold algorithms Fagin, 1999 Nepal/ Ramakrishna, 1999 Güntzer/Balke/Kießling, 2001 Basic ideas: keep upper and lower score bound for each document lowerbound (or worstscore) = sum of scores we have seen so far assuming 0 for unseen dimensions upperbound (or bestscore) = lowerbound + highest possible value for unseen dimensions know what we‘ve got already; know what do expect stop if no further step can improve the current (i.e. final) ranking
55
Peer-to-Peer Information Search - SBBD 2007 Tutorial55 2/6/2015 Fagin’s NRA NRA(q,L): top-k := ; candidates := ; min-k := 0; scan all lists L i (i = 1..m) in parallel: consider item d at position pos i in L i ; E(d) := E(d) {i}; high i := s i (q i,d); worstscore(d) := aggr{s (q,d)| E(d)}; bestscore(d):= aggr{aggr{s (q,d)| E(d)}, aggr{high | E(d)}}; if worstscore(d) > min-k then remove argmin d’ {worstscore(d’)|d’ top-k} from top-k; add d to top-k min-k := min{worstscore(d’) | d’ top-k}; else if bestscore(d) > min-k then candidates := candidates {d}; threshold := max {bestscore(d’) | d’ candidates}; if threshold min-k then exit;
56
Peer-to-Peer Information Search - SBBD 2007 Tutorial56 2/6/2015 Index lists s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 … Data items: d 1, …, d n Query: q = (t 1, t 2, t 3 ) RankDocWorst- score Best- score 1d780.92.4 2d640.82.4 3d100.72.4 RankDocWorst- score Best- score 1d781.42.0 2d231.41.9 3d640.82.1 4d100.72.1 RankDocWorst- score Best- score 1d102.1 2d781.42.0 3d231.41.8 4d641.22.0 … … t1t1 d78 0.9 d1 0.7 d88 0.2 d10 0.2 d78 0.1 d99 0.2 d34 0.1 d23 0.8 d10 0.8 d1d1 d1d1 t2t2 d64 0.8 d23 0.6 d10 0.6 t3t3 d10 0.7 d78 0.5 d64 0.4 STOP! Scan depth 1 Scan depth 1 Scan depth 2 Scan depth 2 Scan depth 3 Scan depth 3 k = 1 Top-k Search
57
Peer-to-Peer Information Search - SBBD 2007 Tutorial57 2/6/2015 Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
58
Peer-to-Peer Information Search - SBBD 2007 Tutorial58 2/6/2015 Observation: pruning often overly conservative (deep scans, high memory for priority queue) Evolution of a Candidate ’ s Score Approximate top-k “What is the probability that d qualifies for the top-k ?” scan depth bestscore d worstscore d min-k score drop d from the candidate queue
59
Peer-to-Peer Information Search - SBBD 2007 Tutorial59 2/6/2015 Safe Thresholding vs. Probabilistic Guarantees NRA based on invariant Relaxed into probabilistic threshold test Or equivalently, with bestscore d worstscore d min-k δ(d) worstscore d bestscore d
60
Peer-to-Peer Information Search - SBBD 2007 Tutorial60 2/6/2015 Expected Result Quality Missing relevant items Probability p_miss of missing a true top-k object equals the probability of erroneously dropping a candidate from the queue For each candidate p_miss ≤ ε P[recall = r/k] = P[precision = r/k] = E[precision] = E[recall] =
61
Peer-to-Peer Information Search - SBBD 2007 Tutorial61 2/6/2015 Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
62
Peer-to-Peer Information Search - SBBD 2007 Tutorial62 2/6/2015 Going distributed Key Observations: Network traffic is crucial Number of round trips is crucial Straight forward application of TA/NRA? expensive: huge number of rounds trips even with batching: unpredictable performance
63
Peer-to-Peer Information Search - SBBD 2007 Tutorial63 2/6/2015 Where is the data? Consider network consumption per peer load latency (query response time) network I/O processing P0 P1 P2 P3 … t1t1 d78 0.9 d1 0.7 d88 0.2 d23 0.8 d10 0.8 … d10 0.2 d78 0.1 t2t2 d64 0.8 d23 0.6 d10 0.6 … d99 0.2 d34 0.1 t3t3 d10 0.7 d78 0.5 d64 0.4
64
Peer-to-Peer Information Search - SBBD 2007 Tutorial64 2/6/2015 Three Phase Uniform Threshold Algorithm [Cao and Wang, PODC 2004] Exactly 3 phases: 1. fetch k best entries (d, sj) from each of P1... Pm and aggregate ( j=1..m sj(d)) at query initiator 2. ask each of P1... Pm for all entries with sj > min-k / m and aggregate results at query initiator. min-k is score of item currently at rank k. 3. fetch missing scores for all candidates by random lookups at P1... Pm First distributed top-k algorithm with fixed number of phases!
65
Peer-to-Peer Information Search - SBBD 2007 Tutorial65 2/6/2015... Index List Cohort Peer Pi Coordinator Peer P0 current top-k - candidate set... score Index List Cohort Peer Pj score top k k candidates min-k / m Retrieve missing scores
66
Peer-to-Peer Information Search - SBBD 2007 Tutorial66 2/6/2015 Analysis of TPUT Theorem: TPUT is an exact algorithm, i.e. identifies the true top-k items Proof (sketch): TPUT cannot miss a true top-k item. Assume it misses one, i.e. item is below mink/m in all lists. overall score < mink not a true top-k item! list 1list 2list 3 min-k score < min-k State after phase 2:
67
Peer-to-Peer Information Search - SBBD 2007 Tutorial67 2/6/2015 if mink / m is small TPUT retrieves a lot of data in Phase 2 high network traffic random accesses high per-peer load KLEE [VLDB ‘05] Different philosophy: approximate answers Efficiency: Reduces (docId, score)-pair transfers no random accesses at each peer Two pillars: The HistogramBlooms structure The Candidate List Filter structure Analysis of TPUT
68
Peer-to-Peer Information Search - SBBD 2007 Tutorial68 2/6/2015 Additional Data Structures Equi-width histogram + Bloom filter for each cell + average score per cell + upper/lower score Usage: During Phase 1: + fetch top-k from each list + top-c cells “increase” the min-k / m threshold
69
Peer-to-Peer Information Search - SBBD 2007 Tutorial69 2/6/2015 KLEE... Index List Cohort Peer Pi Coordinator Peer P0 current top-k - candidate set... score Index List Cohort Peer Pj Histogram b bits 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 c cells b bits 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 1 1 1 0 c cells score top k k candidates min-k / m
70
Peer-to-Peer Information Search - SBBD 2007 Tutorial70 2/6/2015 KLEE– Candidate Set Reduction... score 010010000100010001 Index List Cohort Peer Pi top k Coordinator Peer P0 min-k / m current top-k candidate set 0000100000100000001 xxx candidates min-k / m candidate filter matrix Cohort Peer Pj 100010100000010001 0000100000100000001
71
Peer-to-Peer Information Search - SBBD 2007 Tutorial71 2/6/2015 KLEE – Candidate Retrieval... score 010010000100010001 Index List Cohort Peer Pi top k Coordinator Peer P0 min-k / m current top-k candidate set 0000100000100000001 xxx candidates early stopping point candidate filter matrix Cohort Peer Pj 100010100000010001 0000100000100000001
72
Peer-to-Peer Information Search - SBBD 2007 Tutorial72 2/6/2015 Literature Ronald Fagin: Combining Fuzzy Information from Multiple Systems. J. Comput. Syst. Sci. 58(1): 83-99 (1999) Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003) Surya Nepal, M. V. Ramakrishna: Query Processing Issues in Image (Multimedia) Databases. ICDE 1999: 22-29 Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling: Towards Efficient Multi- Feature Queries in Heterogeneous Environments. ITCC 2001: 622-628 Martin Theobald, Gerhard Weikum, Ralf Schenkel: Top-k Query Evaluation with Probabilistic Guarantees. VLDB 2004: 648-659 Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, Gerhard Weikum: IO-Top-k: Index-access Optimized Top-k Query Processing. VLDB 2006: 475-486 Amélie Marian, Nicolas Bruno, Luis Gravano: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2): 319-362 (2004) Pei Cao, Zhe Wang: Efficient top-K query calculation in distributed networks. PODC 2004: 206-215 Sebastian Michel, Peter Triantafillou, Gerhard Weikum: KLEE: A Framework for Distributed Top-k Query Algorithms. VLDB 2005: 637-648
73
Peer-to-Peer Information Search - SBBD 2007 Tutorial73 2/6/2015 Part II – Social Search
74
Peer-to-Peer Information Search - SBBD 2007 Tutorial74 2/6/2015
75
Peer-to-Peer Information Search - SBBD 2007 Tutorial75 2/6/2015 Motivation People connected through a network People create links to other people Links can express friendship, recommendations, etc Different graph structures appear Sharing interests Enables users to find others who share common interests Similar users can provide relevant content Users and content spread at different sites Distributed nature and continuously increasing size call for peer- to-peer approaches
76
Peer-to-Peer Information Search - SBBD 2007 Tutorial76 2/6/2015 Outline of the Second Part Link Analysis: The Web as a Graph PageRank Distributed Approaches BlockRank Local PageRank + ServerRank Adaptive OPIC JXP Identifying common interests – Semantic Overlay Networks Crespo and Garcia Molina pSearch p2pDating Social Networks – A new paradigm What people share Social graphs Links, Tags, users analysis
77
Peer-to-Peer Information Search - SBBD 2007 Tutorial77 2/6/2015 Links are everywhere… …connecting Web pages www.openp2p.com/... www.searchtools.com www.searchengines.com www.searchengineguide.com www.searchengineshowdown.com searchenginewatch.com
78
Peer-to-Peer Information Search - SBBD 2007 Tutorial78 2/6/2015 Links are everywhere… …connecting people Example of a Flickr’s friends network
79
Peer-to-Peer Information Search - SBBD 2007 Tutorial79 2/6/2015 Links are everywhere… …connecting products
80
Peer-to-Peer Information Search - SBBD 2007 Tutorial80 2/6/2015 Links Analysis The set of nodes/pages (e.g., web pages, people, products, etc) and the links connecting them define a graph www.openp2p.com/... www.searchtools.com www.searchengines.com www.searchengineguide.com www.searchengineshowdown.com searchenginewatch.com
81
Peer-to-Peer Information Search - SBBD 2007 Tutorial81 2/6/2015 Link Analysis At the end we have something like this… Lots of useful information can be obtained from the analysis of the such graphs
82
Peer-to-Peer Information Search - SBBD 2007 Tutorial82 2/6/2015 Adjacency Matrix Matrix representation of graphs Given a graph G, its adjacency matrix A is nxn and a ij = 1, it there is a link from node i to node j a ij = 0, otherwise
83
Peer-to-Peer Information Search - SBBD 2007 Tutorial83 2/6/2015 PageRank – Exploring the Wisdom of Crowds Measures relative importance of pages on the graph Importance of a page depends on the importance of the pages that point to it Random Surfer Model: once in a page, the surfer chooses to follow one of the outlinks with prob. α, or to jump to a random page with prob. (1- α) PR: probability of being at a certain page, after a enough number of jumps S. Brin & L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW Conf. 1998.
84
Peer-to-Peer Information Search - SBBD 2007 Tutorial84 2/6/2015 PageRank – Formal Definition: N → Total number of pages; PR(p) → PageRank of page p; out(p) → Outdegree of p ε→ Random jump probability Can be computed using power iteration method In practice more efficient versions can be used Google Google is believed to use it on the Web graph, combined with other metrics, to rank their search results
85
Peer-to-Peer Information Search - SBBD 2007 Tutorial85 2/6/2015 A → Matrix containing the transition probabilities where P ij = 1/out(i), if there is a link from i to j, 0 otherwise; E is the random jumps matrix Probability distribution vector at time k is the starting vector PageRank → Stationary distribution of the Markov Chain described by A, i.e., principal eigenvector or A PageRank – Matrix Notation
86
Peer-to-Peer Information Search - SBBD 2007 Tutorial86 2/6/2015 Going Distributed PageRank in principle needs the whole graph at one place Shortcomings: Not Scalable for huge graphs, like the Web Slow update – PageRank in such huge graph can take weeks Not suitable for different network architectures (e.g. P2P) Distributed approaches, where the graph is partitioned, are clearly needed Some distributed approaches (more details on the next slides): Local PageRank + ServerRank (Wang et al.) BlockRank (Kamvar et al.) JXP (Parreira et al.)
87
Peer-to-Peer Information Search - SBBD 2007 Tutorial87 2/6/2015 The “Block Structure” Most of links are among web pages inside same host 11 11111 111 111 111 111 111 1111 1111 1111 1111111 111111 111 1111111 111111 1111111 1111 11111111 1111 111111 11111 1111111 111111 111111111 1111 Pages from Host A Pages from Host B Adjacency Matrix Block structure can be exploited for speeding up and/or distributing the PR computation
88
Peer-to-Peer Information Search - SBBD 2007 Tutorial88 2/6/2015 BlockRank PageRank in three steps: 1. Computes “local PageRanks” of pages for each host, by considering only intra host links 2. Computes the importance of the host, using the local PR values and the inter host links 3. Combines previous values to create the starting vector for the standard PR algorithm Speeds up computation Step 1 can be parallelized Still needs the whole matrix for step 3 S. Kamvar, T. Haveliwala, C. Manning & G. Golub. Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University, 2003.
89
Peer-to-Peer Information Search - SBBD 2007 Tutorial89 2/6/2015 Going Distributed… Local PR + ServerRank Similar to BlockRank Local PR : PR computed inside each server using intra server links ServerRank: PR computed on server graph using inter server links Server graph does not need to be materialized. Computation is done by exchanging messages among servers Local PR and ServerRank are combined to approximate the true PR of a page Values can be further refined by using Local PR info on ServerRank computation and vice versa. Server partition can be a limitation… Y. Wang & D. J. DeWitt. Computing pagerank in a distributed internet search system. In VLDB, 2004.
90
Peer-to-Peer Information Search - SBBD 2007 Tutorial90 2/6/2015 Partition at “peer level” In P2P networks, server partition is not suitable
91
Peer-to-Peer Information Search - SBBD 2007 Tutorial91 2/6/2015 Partition at “peer level” Every peer crawls Web fragments at its discretion Peers have only local (incomplete) information Pages might be link to or linked by pages at other peers Overlaps between peers’ graphs may occur Peers a priori unaware of other peers’ contents
92
Peer-to-Peer Information Search - SBBD 2007 Tutorial92 2/6/2015 Adaptive OPIC OPIC: Online Page Importance Computation Computes the importance of a page on-line, with few resources Algorithm: Pages initially receive some cash Pages are randomly visited When a page is visited, its cash is distributed between the pages it points to The page importance for a given page is computed using the history of cash of that page Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. In WWW, 2003.
93
Peer-to-Peer Information Search - SBBD 2007 Tutorial93 2/6/2015 Adaptive OPIC Example: Small Web of 3 pages Alice has all the cash to start (Importance independent of the initial state) Alice Bob George Cash-Game History: Alice received600(200+400)40% Bob received600(200+100+300)40% George received300(200+100)20%
94
Peer-to-Peer Information Search - SBBD 2007 Tutorial94 2/6/2015 Adaptive OPIC No particular graph partition No need to store the link matrix Adapts to the changes on the web graph by considering only the recent part of the cash history for each page Time window: [now-T, now] High number of messages exchanged Does not handle case where same page is stored at more than one place
95
Peer-to-Peer Information Search - SBBD 2007 Tutorial95 2/6/2015 The JXP Algorithm Decentralized algorithm for computing global authority scores of pages in a P2P Network Runs locally at every peer No coordinator, asynchronous Combines Local PageRank computations + Meetings between peers JXP scores converge to the true global PageRank scores Josiane Xavier Parreira, Carlos Castillo, Debora Donato, Sebastian Michel and Gerhard Weikum: The JXP Method for Robust PageRank Approximation in a Peer-to-Peer Web Search Network. The VLDB Journal, 2007.
96
Peer-to-Peer Information Search - SBBD 2007 Tutorial96 2/6/2015 The JXP Algorithm “World Node”: Special node attached to the local graph at every peer Compact representation of all other pages in the network “Special features”: All links from local pages to external pages point to World Node Links from external pages that point to local pages (discovered during meetings) are represented at the World Node Score and outdegree of these external pages are stored; World Node outgoing links are weighted to reflect score mass given by original link Self-loop link to represent transitions among external pages W
97
Peer-to-Peer Information Search - SBBD 2007 Tutorial97 2/6/2015 The JXP Algorithm Initialization step: Local graph is extended by adding the world node PageRank is computed in the extended graph → JXP Scores Main algorithm (for every P i in the network) Select P j to meet Update world node Add edges for pages in P j that point to pages in P i If an edge already exists at the world node, the score of the source page is updated by taking the highest of both scores Compute PageRank → JXP scores
98
Peer-to-Peer Information Search - SBBD 2007 Tutorial98 2/6/2015 The JXP Algorithm Theorem: “In a fair series of JXP meetings, the JXP scores of all nodes converge to the true global PR scores”
99
Peer-to-Peer Information Search - SBBD 2007 Tutorial99 2/6/2015 Locating parts of the Graph “Finding peers that share common interests” Many applications can benefit from it Distributed PR In principle, peers need to send content only to the peers that contain their successors Random messages guarantees that those peers will eventually be reached, but part of messages will be “wasted”
100
Peer-to-Peer Information Search - SBBD 2007 Tutorial100 2/6/2015 WASTED MEETING!!!! We want to avoid it!!!
101
Peer-to-Peer Information Search - SBBD 2007 Tutorial101 2/6/2015 Locating parts of the Graph Query answering Ideal: Forward query only to peers that are more likely to provide good answers to it Query flooding is very expensive Hash-based queries are not suitable for approximate queries
102
Peer-to-Peer Information Search - SBBD 2007 Tutorial102 2/6/2015 Locating parts of the Graph Locating “relevant” peers Increase performance Reduce traffic load Idea: Group peers according to the semantic of their content and place them into different overlay networks
103
Peer-to-Peer Information Search - SBBD 2007 Tutorial103 2/6/2015 Outline of the Second Part Link Analysis: The Web as a Graph PageRank Distributed Approaches BlockRank Local PageRank + ServerRank Adaptive OPIC JXP Identifying common interests – Semantic Overlay Networks Crespo and Garcia Molina pSearch p2pDating Social Networks – A new paradigm What people share Social graphs Links, Tags, users analysis
104
Peer-to-Peer Information Search - SBBD 2007 Tutorial104 2/6/2015 Semantic Overlay Networks Partition the P2P network into several thematic networks Peers with similar or beneficial/complementary content are “clustered” together Queries for a content will be forwarded only to peers with such content Flooding in smaller networks with smaller TTL (or more results with same)
105
Peer-to-Peer Information Search - SBBD 2007 Tutorial105 2/6/2015 Overlay Networks: Random vs. Semantic Random Peers connect to a small set of random peers Queries are flooded through the network Peers with unrelated content receive query Low performance: High number of messages Low recall if only few peers are contactedSemantic Peers connect to peers with related content → Cluster of peers Peers identify query’s topic and forward it only the set of peers on that topic Messages to peers with unrelated content are avoided Better performance: Smaller number of messages High recall by asking only few peers
106
Peer-to-Peer Information Search - SBBD 2007 Tutorial106 2/6/2015 When creating SONs… Two main things to consider Node partitioning Clustering criteria Node partitioning - When does a peer belong to SON A? When it contains a doc of type A When it contains more than x docs of type A Less peers per SON → more results sooner Less SONs per peer → less connections Clustering criteria - Clustering must provide: Load-balance Each category has similar number of nodes Each node belongs to a small number of categories Easy and accurate way to classify a document
107
Peer-to-Peer Information Search - SBBD 2007 Tutorial107 2/6/2015 Crespo and Garcia-Molina Uses a classification hierarchy to form the overlay networks Documents and queries are classified into one or more concepts Queries are forwarded to peers in the super/sub concepts A. Crespo and H. Garcia-Molina. Semantic Overlay Networks for P2P Systems. Technical report, Stanford University, January 2003.
108
Peer-to-Peer Information Search - SBBD 2007 Tutorial108 2/6/2015 Crespo and Garcia-Molina Reported results show a significant improvement on number of messages Music file sharing scenario: To get half the documents that match a query: SONs: 461 msgs Gnutella: 1731 msgs SON links are “logical”: Two peers that are connected on a SON can actually be many hops away from each other Requirement that hierarchy and classification algorithm are shared among all nodes might be a problem
109
Peer-to-Peer Information Search - SBBD 2007 Tutorial109 2/6/2015 pSearch Semantic Overlay on top of Content Addressable Networks (CANs) Latent Semantic Indexing (LSI) is used to generate a semantic vector for each document Semantic vectors are used as keys to store docs indices in the CAN Indices close in semantics are stored close in the overlay Two types of operations Publish document indices Process queries Chunqiang Tang, Zhichen Xu, and Sandhya Dwarkadas. Peer-to-peer Information Retrieval Using Self-Organizing Semantic Overlay Networks. In SIGCOMM, 2003.
110
Peer-to-Peer Information Search - SBBD 2007 Tutorial110 2/6/2015 pSearch Key Idea doc query semantic space
111
Peer-to-Peer Information Search - SBBD 2007 Tutorial111 2/6/2015 pSearch Key Idea docquery ABC D E F G HI semantic space
112
Peer-to-Peer Information Search - SBBD 2007 Tutorial112 2/6/2015 Background:Content-Addressable Network A B CDE Partition Cartesian space into zones Each zone is assigned to a computer Neighboring zones are routing neighbors An object key is a point in the space Object lookup is done through routing
113
Peer-to-Peer Information Search - SBBD 2007 Tutorial113 2/6/2015 Background: Vector Space Model Term Vectors represent documents and queries Elements correspond to importance of term in document or vector Statistical computation of vector elements Term frequency * inverse document frequency Ranking of retrieved documents Similarity between document vector and query vector
114
Peer-to-Peer Information Search - SBBD 2007 Tutorial114 2/6/2015 Background: Vector Space Model A: “books on computer networks” B: “network routing in P2P networks” Q: “P2P network” computer network P2P routing vocabulary 0.5 0 Va 0 0.5 0.25 Vb 0 0.5 0 Vq 0.25 0.375
115
Peer-to-Peer Information Search - SBBD 2007 Tutorial115 2/6/2015 Background: Latent Semantic Indexing Document vectors dimension has to match the dimension of the CAN network Latent Semantic Indexing uses Singular Value Decomposition (SVD) high-dimensional term vector to low-dimensional semantic vector elements correspond to importance of abstract concept in document/query Also helps to overcomes synonym problem (e.g., user looks for car and don’t find document about automobile)
116
Peer-to-Peer Information Search - SBBD 2007 Tutorial116 2/6/2015 Background: Latent Semantic Indexing Va Vb documents terms ….. V’a V’b semantic vectors SVD ….. SVD: singular value decomposition Reduce dimensionality Suppress noise Discover word semantics Car Automobile
117
Peer-to-Peer Information Search - SBBD 2007 Tutorial117 2/6/2015 pSearch Basic Algorithm: Steps 1. Receive a new document A: generate a semantic vector V a, store the key in the index 2. Receive a new query Q: generate a semantic vector V q, route the query in the overlay 3. The query is flooded to nodes within a radius r R determined by similarity threshold or number of wanted documents 4. All receiving nodes do a local search and report references to best matching documents
118
Peer-to-Peer Information Search - SBBD 2007 Tutorial118 2/6/2015 search region for the query 3 3 3 pSearch Illustration querydoc 1 44 2
119
Peer-to-Peer Information Search - SBBD 2007 Tutorial119 2/6/2015 p2pDating Start with a randomly connected network Peers meet other peers they do not know ( “ blind dates ” ) If a peer “ likes ” another it will remember it as a “ friend ”. A remembers B abstract link A → B Directed links preserves peers ’ autonomy SONs dynamically evolve from the meeting process J. X. Parreira et al. p2pDating: Real Life Inspired Semantic Overlay Networks for Web Search. Information Processing & Management [43], 643-664
120
Peer-to-Peer Information Search - SBBD 2007 Tutorial120 2/6/2015 p2pDating Finding new friends Random meetings (Blind dates) Meet friends of friends AB B’s Friends If A and B are friends…… it is very likely the B’s friends are friends of A as well. A
121
Peer-to-Peer Information Search - SBBD 2007 Tutorial121 2/6/2015 Defining Good Friends Criteria for defining a good friend combination of different measures History: Credits for good behavior in the past Response time, query result precision, etc… Collection similarity Collection Overlap Different ways of estimating the overlap between two collections Number of links between peers Etc… Peers might have more than one list of friends E.g., according to different criterias
122
Peer-to-Peer Information Search - SBBD 2007 Tutorial122 2/6/2015 Going Social… Before: Only few content producers (e.g., companies, universities) Analysis was done using the content itself plus a few implicit recommendations (links) Very little information about the content consumers (mainly through query logs) Nowadays: New technologies to facilitate content sharing Content consumers are now also content producers and content describers (e.g., explicit recommendations, tags, etc) More and more crowd wisdom that can be harvested
123
Peer-to-Peer Information Search - SBBD 2007 Tutorial123 2/6/2015 Outline of the Second Part Link Analysis: The Web as a Graph PageRank Distributed Approaches BlockRank Local PageRank + ServerRank Adaptive OPIC JXP Identifying common interests – Semantic Overlay Networks Crespo and Garcia Molina pSearch p2pDating Social Networks – A new paradigm What people share Social graphs Links, Tags, users analysis
124
Peer-to-Peer Information Search - SBBD 2007 Tutorial124 2/6/2015
125
Peer-to-Peer Information Search - SBBD 2007 Tutorial125 2/6/2015 Social Networks A social structure made of nodes (which are generally individuals or organizations) that are tied by one or more specific types of relations, such as values visions ideas friends conflict web links Etc Social networks have been studied for over a century
126
Peer-to-Peer Information Search - SBBD 2007 Tutorial126 2/6/2015 Social Network Services Enable the creation of online social networks for communities of people who share interests and activities, or who are interested in exploring the interests and activities of others Online communities offer an easy way for users to publish and share their content.
127
Peer-to-Peer Information Search - SBBD 2007 Tutorial127 2/6/2015 Social Networking Growth Several social networking sites have experienced dramatic growth during the past year. Social Networking Site Total Unique Visitors (Mio.) Jun-06Jun-07% Change MySpace 66.41114.1572 Facebook 14.0852.17270 Hi5 18.1028.1756 Friendster 14.9224.6865 Orkut 13.5924.1278 Bebo 6.6918.20172 Tagged1.5113.17774 Worldwide Growth of Selected Social Networking Sites. June 2007 vs. June 2006, Users Age 15+, Source: comScore
128
Peer-to-Peer Information Search - SBBD 2007 Tutorial128 2/6/2015 What people share…
129
Peer-to-Peer Information Search - SBBD 2007 Tutorial129 2/6/2015 Social Networks Besides sharing content, a user can… …describe documents using tags …maintain a list of friends …make comments on other users’ content, exchange opinions, discover users with similar profile. In contrast to Web Graph, in Social Graphs users are part of the model
130
Peer-to-Peer Information Search - SBBD 2007 Tutorial130 2/6/2015 Social Content Graph Sihem Amer-Yahia, Michael Benedikt, Philip Bohannon: Challenges in Searching Online Communities. IEEE Data Eng. Bull. 30(2): 23-31 (2007)
131
Peer-to-Peer Information Search - SBBD 2007 Tutorial131 2/6/2015 Social Graphs Other models also possible Directed vs. Undirected edges Etc. users tags docs Standard IR techniques for Web retrieval need to be adapted to work on social networks - Lot of current research dedicated on this area
132
Peer-to-Peer Information Search - SBBD 2007 Tutorial132 2/6/2015 Social Networks The Wisdom of Crowds: Beyond PR Spectral analysis of various graphs E.g., SocialPageRank, FolkRank. Tag semantic analysis Discovering semantic from tags co-occurrence E.g., SocialSimRank Distributed View Exploiting social relations to enhance search E.g., PeerSpective
133
Peer-to-Peer Information Search - SBBD 2007 Tutorial133 2/6/2015 Link Analysis in Social Networks SocialPageRank: High quality web pages are usually popularly annotated and popular web pages, up-to-date web users and hot social annotations can be mutual enhanced. Let M UT, M TD, M DU be the matrices corresponding to relations UsersTags, TagsDocs, DocsUsers Compute iteratively: S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu: Optimizing Web Search Using Social Annotation. WWW 2007 a c b Documents Users Tags
134
Peer-to-Peer Information Search - SBBD 2007 Tutorial134 2/6/2015 Link Analysis in Social Networks FolkRank Define graph G as union of graphs UsersTags, TagsDocs, DocsUsers Assume each user has personal preference vector Compute iteratively: FolkRank vector of docs is: Andreas Hotho, Robert Jäschke, Christoph Schmitz, Gerd Stumme: Information Retrieval in Folksonomies: Search and Ranking. ESWC 2006: 411-426
135
Peer-to-Peer Information Search - SBBD 2007 Tutorial135 2/6/2015 Tag Similarity SocialSimRank Idea: Similar annotations (tags) are usually assigned to similar web pages by users with common interests. sim(t1, t2) ~ aggr {sim(d1,d2) | (t1,d1), (t2,d2) Tagging} sim(d1, d2) ~ aggr {sim(t1,t2) | (t1,d1), (t2,d2) Tagging} S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu: Optimizing Web Search Using Social Annotation. WWW 2007
136
Peer-to-Peer Information Search - SBBD 2007 Tutorial136 2/6/2015 Exploring friendship connections PeerSpective: users can query their friends’ viewed pages HTTP proxies on users computers index all browsed content When a Google search in performance, query is also send to the other proxies in parallel Alan Mislove, Krishna P. Gummadi, and Peter Druschel. Exploiting Social Networks for Internet Search. HotNets, 2006.
137
Peer-to-Peer Information Search - SBBD 2007 Tutorial137 2/6/2015 Social Networks New paradigm of publishing and searching content Rich data Different link structures Users input for free!!! Relatively recent topic: Lots of research opportunities Works mentioned are by no means complete, still a lot to do Since we are talking about Web 2.0… http://p2pinformationsearch.blogspot.com/
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.