DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Proof of Concept for Scalable & Self-Organizing Data Structures and Algorithms (e.g., DHTs, Randomized Overlay Networks, Epidemic Spreading) Powerful Search Methods for Each Peer (Concept-based Search, Query Expansion, Personalization, etc.) Leverage Intellectual Input at Each Peer (Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.) Collaboration among Peers (Query Routing, Incentives, Fairness, Anonymity, etc.) Better Search Result Quality (Precision, Recall, etc.) Breaking Information Monopolies Testbed for CS Models, Algorithms, Technologies and Experimental Platform Why Peer-to-Peer Web Search? Vision: Self-organizing P2P Web Search Engine with Google-or-better functionality
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction What Google Can‘t Do Killer queries (disregarding NLP QA, multilingual, multimedia): drama with three women making a prophecy to a British nobleman that he will become king
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Efficient Top-k Search Index lists s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 … Data items: d 1, …, d n Query: q = (t 1, t 2, t 3 ) RankDocWorst- score Best- score 1d d d RankDocWorst- score Best- score 1d d d d RankDocWorst- score Best- score 1d d d d … … t1t1 d d1 0.7 d d d d d d d d1d1 d1d1 t2t2 d d d t3t3 d d d STOP! TA with sorted access only (NRA) (Fagin 01, Güntzer/Kießling/Balke 01): can index lists; consider d at pos i in L i ; E(d) := E(d) {i}; high i := s(t i,d); worstscore(d) := aggr{s(t,d) | E(d)}; bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k}; else if bestscore(d) > min-k then cand := cand {d}; s threshold := max {bestscore(d’) | d’ cand}; if threshold min-k then exit; TA: efficient & principled top-k query processing with monotonic score aggr. Scan depth 1 Scan depth 1 Scan depth 2 Scan depth 2 Scan depth 3 Scan depth 3 k = 1 Efficient Top-k Search Ex. Google: > 10 mio. terms > 8 bio. docs > 4 TB index
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Probabilistic Pruning Probabilistic Pruning of Top-k Candidates scan depth drop d from priority queue Approximate top-k with probabilistic guarantees: bestscore(d) worstscore(d) min-k score ? Add d to top-k result, if worstscore(d) > min-k Drop d only if bestscore(d) < min-k, otherwise keep in PQ TA family of algorithms based on invariant (with sum as aggr) worstscore(d)bestscore(d) Often overly conservative (deep scans, high memory for PQ) discard candidates d from queue if p(d) score predictor can use LSTs & Chernoff bounds, Poisson approximations, or histogram convolution E[rel. = 1
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Experiments with TREC-12 Web Track Experiments with TREC-12 Web-Track Benchmark TA-sortedProb-sorted (smart) #sorted accesses2,263,652527,980 elapsed time [s] max queue size relative precision10.87 rank distance039.5 score error on.GOV corpus from TREC-12 Web track: 1.25 Mio. docs (html, pdf, etc.) 50 keyword queries, e.g.: „Lewis Clark expedition“, „juvenile delinquency“, „legalization Marihuana“, „air bag safety reducing injuries death facts“ speedup by factor 10 at high precision/recall (relative to TA-sorted); aggressive queue mgt. even yields factor 100 at % prec./recall
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Query Expansion Threshold-based query expansion: substitute ~w by (c 1 |... | c k ) with all c i for which sim(w, c i ) „Old hat“ in IR; highly disputed for danger of topic dilution Approach to careful expansion: determine phrases from query or best initial query results (e.g., forming 3-grams and looking up ontology/thesaurus entries) if uniquely mapped to one concept then expand with synonyms and weighted hyponyms alternatively use statistical learning methods for word sense disambiguation Problem: choice of threshold Query Expansion
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Query Expansion Example Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. From TREC 2004 Robust Track: sorted accesses in s. Results: 1.Interpol Chief on Fight Against Narcotics 2.Economic Counterintelligence Tasks Viewed 3.Dresden Conference Views Growth of Organized Crime in Europe 4.Report on Drug, Weapons Seizures in Southwest Border Region 5.SWITZERLAND CALLED SOFT ON CRIME... A parliamentary commission accused Swiss prosecutors today of doing little to stop drug and money-laundering international networks from pumping billions of dollars through Swiss companies.... Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20],...}} Let us take, for example, the case of Medellin cartel's boss Pablo Escobar. Will the fact that he was eliminated change anything at all? No, it may perhaps have a psychological effect on other drug dealers but, for organizing the illicit export of metals and import of arms. It is extremely difficult for the law-enforcement organs to investigate and stamp out corruption among leading officials.... Query Expansion Example
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Top-k with Query Expansion response time: : : : : : : 0.6 throughput: : : : : : 0.8 algorithm B+ tree index on terms 57: : performance 52: : : : : : : : : : 0.4 thesaurus / meta-index i q {max j onto(i) { sim(i,j)*sj(d)) }} performance response time: 0.7 throughput: 0.6 queueing: 0.3 delay: consider expandable query „algorithm and ~performance“ with score dynamic query expansion with incremental on-demand merging of additional index lists + much more efficient than threshold-based expansion + no threshold tuning + no topic drift Top-k Query Processing with Query Expansion
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Experiments with TREC-13 Robust Track on Acquaint corpus (news articles): docs, 2 GB raw data, 8 GB for all indexes no exp.static exp.static exp. incr. merge ( =0.1)( =0.3,( =0.3, ( =0.1) =0.0) =0.1) #sorted acc. 1,333,756 10,586,1753,622,6865,671,493 #random acc. 0555,17649,78334,895 elapsed time [s] max #terms relative prec MAP with Okapi BM25 probabilistic scoring model 50 most difficult queries, e.g.: „transportation tunnel disasters“ „Hubble telescope achievements“ potentially expanded into: „earthquake, flood, wind, seismology, accident, car, auto, train,...“ „astronomical, electromagnetic radiation, cosmic source, nebulae,...“ Experiments with TREC-13 Robust-Track Benchmark speedup by factor 4 at high precision/recall; no topic drift, no need for threshold tuning; also handles TREC-13 Terabyte benchmark
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Exploiting User Behavior Exploiting Query Logs and Click Streams from PageRank : uniformly random choice of links + random jumps Authority (page q) = stationary prob. of visiting q
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Exploiting User Behavior from PageRank : uniformly random choice of links + random jumps to QRank : + query-doc transitions + query-query transitions + doc-doc transitions on implicit links (w/ thesaurus) with probabilities estimated from log statistics a b a xyz Exploiting Query Logs and Click Streams
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Exploiting User Behavior Setup: Wikipedia docs, 18 volunteers posing Trivial-Pursuit queries ca. 500 queries, ca. 300 refinements, ca positive clicks ca implicit links based on doc-doc similarity Results (assessment by blind-test users): QRank top-10 result preferred over PageRank in 81% of all cases QRank has 50.3% PageRank has 33.9% Untrained example query „philosophy“: PageRankQRank x 1. PhilosophyPhilosophy 2. GNU free doc. licenseGNU free doc. license 3. Free software foundationEarly modern philosophy 4. Richard StallmanMysticism 5. DebianAristotle Preliminary Experiments
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Self-Organization for Isolating Selfish Peers query peer P0 local index X0 book- marks B0 term g: 13, 11, 45,... term a: 17, 11, 92,... term f: 43, 65, 92,... peer lists (directory) term g: 13, 11, 45,... term c: 13, 92, 45,... url x: 37, 44, 12,... url y: 75, 43, 12,... url z: 54, 128, 7,... ?? ? Collaborative P2P Search Susceptible to misbehavior! How do we identify and penalize or isolate selfish/malicious peers?
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Self-Organization for Isolating Selfish Peers Rationale: mimic evolution in biological / social networks tag selfish vs. altruistic peers and bias interactions towards similar peers Algorithm: periodically do each peer compares its “utility” with a random peer if the other peer has higher utility then copy that peer’s strategy and links (reproduction) mutate with small probability: change behavior, change links
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Self-Organization for Isolating Selfish Peers Simulation Results for P2P File Sharing typical run for 10 4 peers Selfishness reduces Average performance increases peers generate queries and answer queries based on P [0,1] with extreme behaviors: selfish P = 1.0 and altruistic P = 0.0 peer utility = # hits (queries answered) mutation: change P randomly cycles average per node queries generatedhits
Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 The End Thank you!