Presentation is loading. Please wait.

Presentation is loading. Please wait.

DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)

Similar presentations


Presentation on theme: "DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)"— Presentation transcript:

1 DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)

2 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Proof of Concept for Scalable & Self-Organizing Data Structures and Algorithms (e.g., DHTs, Randomized Overlay Networks, Epidemic Spreading) Powerful Search Methods for Each Peer (Concept-based Search, Query Expansion, Personalization, etc.) Leverage Intellectual Input at Each Peer (Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.) Collaboration among Peers (Query Routing, Incentives, Fairness, Anonymity, etc.) Better Search Result Quality (Precision, Recall, etc.) Breaking Information Monopolies Testbed for CS Models, Algorithms, Technologies and Experimental Platform Why Peer-to-Peer Web Search? Vision: Self-organizing P2P Web Search Engine with Google-or-better functionality

3 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction What Google Can‘t Do Killer queries (disregarding NLP QA, multilingual, multimedia): drama with three women making a prophecy to a British nobleman that he will become king

4 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

5 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

6 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Efficient Top-k Search Index lists s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 … Data items: d 1, …, d n Query: q = (t 1, t 2, t 3 ) RankDocWorst- score Best- score 1d780.92.4 2d640.82.4 3d100.72.4 RankDocWorst- score Best- score 1d781.42.0 2d231.41.9 3d640.82.1 4d100.72.1 RankDocWorst- score Best- score 1d102.1 2d781.42.0 3d231.41.8 4d641.22.0 … … t1t1 d78 0.9 d1 0.7 d88 0.2 d10 0.2 d78 0.1 d99 0.2 d34 0.1 d23 0.8 d10 0.8 d1d1 d1d1 t2t2 d64 0.8 d23 0.6 d10 0.6 t3t3 d10 0.7 d78 0.5 d64 0.4 STOP! TA with sorted access only (NRA) (Fagin 01, Güntzer/Kießling/Balke 01): can index lists; consider d at pos i in L i ; E(d) := E(d)  {i}; high i := s(t i,d); worstscore(d) := aggr{s(t,d) |  E(d)}; bestscore(d) := aggr{worstscore(d), aggr{high |  E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’  top-k}; else if bestscore(d) > min-k then cand := cand  {d}; s threshold := max {bestscore(d’) | d’  cand}; if threshold  min-k then exit; TA: efficient & principled top-k query processing with monotonic score aggr. Scan depth 1 Scan depth 1 Scan depth 2 Scan depth 2 Scan depth 3 Scan depth 3 k = 1 Efficient Top-k Search Ex. Google: > 10 mio. terms > 8 bio. docs > 4 TB index

7 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Probabilistic Pruning Probabilistic Pruning of Top-k Candidates scan depth drop d from priority queue Approximate top-k with probabilistic guarantees: bestscore(d) worstscore(d) min-k score ? Add d to top-k result, if worstscore(d) > min-k Drop d only if bestscore(d) < min-k, otherwise keep in PQ TA family of algorithms based on invariant (with sum as aggr) worstscore(d)bestscore(d)  Often overly conservative (deep scans, high memory for PQ) discard candidates d from queue if p(d)   score predictor can use LSTs & Chernoff bounds, Poisson approximations, or histogram convolution  E[rel. precision@k] = 1 

8 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Experiments with TREC-12 Web Track Experiments with TREC-12 Web-Track Benchmark TA-sortedProb-sorted (smart) #sorted accesses2,263,652527,980 elapsed time [s]148.715.9 max queue size10849400 relative precision10.87 rank distance039.5 score error00.031 on.GOV corpus from TREC-12 Web track: 1.25 Mio. docs (html, pdf, etc.) 50 keyword queries, e.g.: „Lewis Clark expedition“, „juvenile delinquency“, „legalization Marihuana“, „air bag safety reducing injuries death facts“ speedup by factor 10 at high precision/recall (relative to TA-sorted); aggressive queue mgt. even yields factor 100 at 30-50 % prec./recall

9 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

10 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Query Expansion Threshold-based query expansion: substitute ~w by (c 1 |... | c k ) with all c i for which sim(w, c i )   „Old hat“ in IR; highly disputed for danger of topic dilution Approach to careful expansion: determine phrases from query or best initial query results (e.g., forming 3-grams and looking up ontology/thesaurus entries) if uniquely mapped to one concept then expand with synonyms and weighted hyponyms alternatively use statistical learning methods for word sense disambiguation Problem: choice of threshold  Query Expansion

11 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Query Expansion Example Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. From TREC 2004 Robust Track: 135530 sorted accesses in 11.073s. Results: 1.Interpol Chief on Fight Against Narcotics 2.Economic Counterintelligence Tasks Viewed 3.Dresden Conference Views Growth of Organized Crime in Europe 4.Report on Drug, Weapons Seizures in Southwest Border Region 5.SWITZERLAND CALLED SOFT ON CRIME... A parliamentary commission accused Swiss prosecutors today of doing little to stop drug and money-laundering international networks from pumping billions of dollars through Swiss companies.... Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20],...}} Let us take, for example, the case of Medellin cartel's boss Pablo Escobar. Will the fact that he was eliminated change anything at all? No, it may perhaps have a psychological effect on other drug dealers but,...... for organizing the illicit export of metals and import of arms. It is extremely difficult for the law-enforcement organs to investigate and stamp out corruption among leading officials.... Query Expansion Example

12 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Top-k with Query Expansion response time: 0.7 37: 0.9 44: 0.8... 22: 0.7 23: 0.6 51: 0.6 52: 0.6 throughput: 0.6 92: 0.9 67: 0.9... 52: 0.9 44: 0.8 55: 0.8 algorithm B+ tree index on terms 57: 0.6 44: 0.4... performance 52: 0.4 33: 0.3 75: 0.3 12: 0.9 14: 0.8... 28: 0.6 17: 0.55 61: 0.5 44: 0.5 44: 0.4 thesaurus / meta-index  i  q {max j  onto(i) { sim(i,j)*sj(d)) }} performance response time: 0.7 throughput: 0.6 queueing: 0.3 delay: 0.25... consider expandable query „algorithm and ~performance“ with score dynamic query expansion with incremental on-demand merging of additional index lists + much more efficient than threshold-based expansion + no threshold tuning + no topic drift Top-k Query Processing with Query Expansion

13 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Experiments with TREC-13 Robust Track on Acquaint corpus (news articles): 528 000 docs, 2 GB raw data, 8 GB for all indexes no exp.static exp.static exp. incr. merge (  =0.1)(  =0.3,(  =0.3, (  =0.1)  =0.0)  =0.1) #sorted acc. 1,333,756 10,586,1753,622,6865,671,493 #random acc. 0555,17649,78334,895 elapsed time [s] 9.3156.679.643.8 max #terms 4595959 relative prec. 0.9341.00.5410.786 precision@10 0.2480.2860.2380.298 MAP 0.0910.1110.0860.110 with Okapi BM25 probabilistic scoring model 50 most difficult queries, e.g.: „transportation tunnel disasters“ „Hubble telescope achievements“ potentially expanded into: „earthquake, flood, wind, seismology, accident, car, auto, train,...“ „astronomical, electromagnetic radiation, cosmic source, nebulae,...“ Experiments with TREC-13 Robust-Track Benchmark speedup by factor 4 at high precision/recall; no topic drift, no need for threshold tuning; also handles TREC-13 Terabyte benchmark

14 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

15 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Exploiting User Behavior Exploiting Query Logs and Click Streams from PageRank : uniformly random choice of links + random jumps Authority (page q) = stationary prob. of visiting q

16 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Exploiting User Behavior from PageRank : uniformly random choice of links + random jumps to QRank : + query-doc transitions + query-query transitions + doc-doc transitions on implicit links (w/ thesaurus) with probabilities estimated from log statistics a b a xyz Exploiting Query Logs and Click Streams

17 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Exploiting User Behavior Setup: 70 000 Wikipedia docs, 18 volunteers posing Trivial-Pursuit queries ca. 500 queries, ca. 300 refinements, ca. 1000 positive clicks ca. 15 000 implicit links based on doc-doc similarity Results (assessment by blind-test users): QRank top-10 result preferred over PageRank in 81% of all cases QRank has 50.3% precision@10, PageRank has 33.9% Untrained example query „philosophy“: PageRankQRank x 1. PhilosophyPhilosophy 2. GNU free doc. licenseGNU free doc. license 3. Free software foundationEarly modern philosophy 4. Richard StallmanMysticism 5. DebianAristotle Preliminary Experiments

18 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

19 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Self-Organization for Isolating Selfish Peers query peer P0 local index X0 book- marks B0 term g: 13, 11, 45,... term a: 17, 11, 92,... term f: 43, 65, 92,... peer lists (directory) term g: 13, 11, 45,... term c: 13, 92, 45,... url x: 37, 44, 12,... url y: 75, 43, 12,... url z: 54, 128, 7,... ?? ? Collaborative P2P Search Susceptible to misbehavior! How do we identify and penalize or isolate selfish/malicious peers?

20 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Self-Organization for Isolating Selfish Peers Rationale: mimic evolution in biological / social networks tag selfish vs. altruistic peers and bias interactions towards similar peers Algorithm: periodically do each peer compares its “utility” with a random peer if the other peer has higher utility then copy that peer’s strategy and links (reproduction) mutate with small probability: change behavior, change links

21 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Self-Organization for Isolating Selfish Peers Simulation Results for P2P File Sharing typical run for 10 4 peers Selfishness reduces Average performance increases peers generate queries and answer queries based on P  [0,1] with extreme behaviors: selfish P = 1.0 and altruistic P = 0.0 peer utility = # hits (queries answered) mutation: change P randomly 0 10 20 30 40 50 60 020406080100 cycles average per node queries generatedhits

22 Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 The End Thank you!


Download ppt "DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)"

Similar presentations


Ads by Google