DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)

Slides:



Advertisements
Similar presentations
Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.
Advertisements

Best-Effort Top-k Query Processing Under Budgetary Constraints
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
IRDM WS Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query Processing 3.3 Index Access Scheduling.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Information Retrieval in Practice
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
Information Retrieval
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Web and Intranet Search: What‘s Next After Google* ? Moderator: Gerhard Weikum (Max-Planck Institute for CS) Panelists: Eric Brill (Microsoft Research)
An Experiment: How to Plan it, Run it, and Get it Published Gerhard Weikum Thoughts about the Experimental Culture in Our Community.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Search Engines and Information Retrieval Chapter 1.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
CIDR 20051/16 Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? Surajit Chaudhuri (Microsoft Research) Raghu Ramakrishnan (U.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
SLAC and SLACER: Simple copy & rewire algorithms for trust and cooperation in P2P David Hales, Stefano Arteconi, Ozalp Babaoglu University of Bologna,
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
1/28 Efficient Top-k Queries for XML Information Retrieval Gerhard Weikum Joint work with Ralf Schenkel.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Information Retrieval in Practice
Neighborhood - based Tag Prediction
Search Engine Architecture
WEB SPAM.
Max-Planck Institute for Informatics
Key Observation Theorem:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Wikitology Wikipedia as an Ontology
Martin Theobald Max-Planck-Institut Informatik Stanford University
Rank Aggregation.
Evolution for Cooperation
Evolution for Cooperation
Presentation transcript:

DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Proof of Concept for Scalable & Self-Organizing Data Structures and Algorithms (e.g., DHTs, Randomized Overlay Networks, Epidemic Spreading) Powerful Search Methods for Each Peer (Concept-based Search, Query Expansion, Personalization, etc.) Leverage Intellectual Input at Each Peer (Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.) Collaboration among Peers (Query Routing, Incentives, Fairness, Anonymity, etc.) Better Search Result Quality (Precision, Recall, etc.) Breaking Information Monopolies Testbed for CS Models, Algorithms, Technologies and Experimental Platform Why Peer-to-Peer Web Search? Vision: Self-organizing P2P Web Search Engine with Google-or-better functionality

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction What Google Can‘t Do Killer queries (disregarding NLP QA, multilingual, multimedia): drama with three women making a prophecy to a British nobleman that he will become king

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Efficient Top-k Search Index lists s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 … Data items: d 1, …, d n Query: q = (t 1, t 2, t 3 ) RankDocWorst- score Best- score 1d d d RankDocWorst- score Best- score 1d d d d RankDocWorst- score Best- score 1d d d d … … t1t1 d d1 0.7 d d d d d d d d1d1 d1d1 t2t2 d d d t3t3 d d d STOP! TA with sorted access only (NRA) (Fagin 01, Güntzer/Kießling/Balke 01): can index lists; consider d at pos i in L i ; E(d) := E(d)  {i}; high i := s(t i,d); worstscore(d) := aggr{s(t,d) |  E(d)}; bestscore(d) := aggr{worstscore(d), aggr{high |  E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’  top-k}; else if bestscore(d) > min-k then cand := cand  {d}; s threshold := max {bestscore(d’) | d’  cand}; if threshold  min-k then exit; TA: efficient & principled top-k query processing with monotonic score aggr. Scan depth 1 Scan depth 1 Scan depth 2 Scan depth 2 Scan depth 3 Scan depth 3 k = 1 Efficient Top-k Search Ex. Google: > 10 mio. terms > 8 bio. docs > 4 TB index

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Probabilistic Pruning Probabilistic Pruning of Top-k Candidates scan depth drop d from priority queue Approximate top-k with probabilistic guarantees: bestscore(d) worstscore(d) min-k score ? Add d to top-k result, if worstscore(d) > min-k Drop d only if bestscore(d) < min-k, otherwise keep in PQ TA family of algorithms based on invariant (with sum as aggr) worstscore(d)bestscore(d)  Often overly conservative (deep scans, high memory for PQ) discard candidates d from queue if p(d)   score predictor can use LSTs & Chernoff bounds, Poisson approximations, or histogram convolution  E[rel. = 1 

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Experiments with TREC-12 Web Track Experiments with TREC-12 Web-Track Benchmark TA-sortedProb-sorted (smart) #sorted accesses2,263,652527,980 elapsed time [s] max queue size relative precision10.87 rank distance039.5 score error on.GOV corpus from TREC-12 Web track: 1.25 Mio. docs (html, pdf, etc.) 50 keyword queries, e.g.: „Lewis Clark expedition“, „juvenile delinquency“, „legalization Marihuana“, „air bag safety reducing injuries death facts“ speedup by factor 10 at high precision/recall (relative to TA-sorted); aggressive queue mgt. even yields factor 100 at % prec./recall

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Query Expansion Threshold-based query expansion: substitute ~w by (c 1 |... | c k ) with all c i for which sim(w, c i )   „Old hat“ in IR; highly disputed for danger of topic dilution Approach to careful expansion: determine phrases from query or best initial query results (e.g., forming 3-grams and looking up ontology/thesaurus entries) if uniquely mapped to one concept then expand with synonyms and weighted hyponyms alternatively use statistical learning methods for word sense disambiguation Problem: choice of threshold  Query Expansion

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Query Expansion Example Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. From TREC 2004 Robust Track: sorted accesses in s. Results: 1.Interpol Chief on Fight Against Narcotics 2.Economic Counterintelligence Tasks Viewed 3.Dresden Conference Views Growth of Organized Crime in Europe 4.Report on Drug, Weapons Seizures in Southwest Border Region 5.SWITZERLAND CALLED SOFT ON CRIME... A parliamentary commission accused Swiss prosecutors today of doing little to stop drug and money-laundering international networks from pumping billions of dollars through Swiss companies.... Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20],...}} Let us take, for example, the case of Medellin cartel's boss Pablo Escobar. Will the fact that he was eliminated change anything at all? No, it may perhaps have a psychological effect on other drug dealers but, for organizing the illicit export of metals and import of arms. It is extremely difficult for the law-enforcement organs to investigate and stamp out corruption among leading officials.... Query Expansion Example

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Top-k with Query Expansion response time: : : : : : : 0.6 throughput: : : : : : 0.8 algorithm B+ tree index on terms 57: : performance 52: : : : : : : : : : 0.4 thesaurus / meta-index  i  q {max j  onto(i) { sim(i,j)*sj(d)) }} performance response time: 0.7 throughput: 0.6 queueing: 0.3 delay: consider expandable query „algorithm and ~performance“ with score dynamic query expansion with incremental on-demand merging of additional index lists + much more efficient than threshold-based expansion + no threshold tuning + no topic drift Top-k Query Processing with Query Expansion

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Experiments with TREC-13 Robust Track on Acquaint corpus (news articles): docs, 2 GB raw data, 8 GB for all indexes no exp.static exp.static exp. incr. merge (  =0.1)(  =0.3,(  =0.3, (  =0.1)  =0.0)  =0.1) #sorted acc. 1,333,756 10,586,1753,622,6865,671,493 #random acc. 0555,17649,78334,895 elapsed time [s] max #terms relative prec MAP with Okapi BM25 probabilistic scoring model 50 most difficult queries, e.g.: „transportation tunnel disasters“ „Hubble telescope achievements“ potentially expanded into: „earthquake, flood, wind, seismology, accident, car, auto, train,...“ „astronomical, electromagnetic radiation, cosmic source, nebulae,...“ Experiments with TREC-13 Robust-Track Benchmark speedup by factor 4 at high precision/recall; no topic drift, no need for threshold tuning; also handles TREC-13 Terabyte benchmark

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Exploiting User Behavior Exploiting Query Logs and Click Streams from PageRank : uniformly random choice of links + random jumps Authority (page q) = stationary prob. of visiting q

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Exploiting User Behavior from PageRank : uniformly random choice of links + random jumps to QRank : + query-doc transitions + query-query transitions + doc-doc transitions on implicit links (w/ thesaurus) with probabilities estimated from log statistics a b a xyz Exploiting Query Logs and Click Streams

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Exploiting User Behavior Setup: Wikipedia docs, 18 volunteers posing Trivial-Pursuit queries ca. 500 queries, ca. 300 refinements, ca positive clicks ca implicit links based on doc-doc similarity Results (assessment by blind-test users): QRank top-10 result preferred over PageRank in 81% of all cases QRank has 50.3% PageRank has 33.9% Untrained example query „philosophy“: PageRankQRank x 1. PhilosophyPhilosophy 2. GNU free doc. licenseGNU free doc. license 3. Free software foundationEarly modern philosophy 4. Richard StallmanMysticism 5. DebianAristotle Preliminary Experiments

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Introduction Vision Demo Efficient Top-k Search Ontology-based Query Expansion Outline Exploiting User Behavior Isolating Selfish Peers

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Self-Organization for Isolating Selfish Peers query peer P0 local index X0 book- marks B0 term g: 13, 11, 45,... term a: 17, 11, 92,... term f: 43, 65, 92,... peer lists (directory) term g: 13, 11, 45,... term c: 13, 92, 45,... url x: 37, 44, 12,... url y: 75, 43, 12,... url z: 54, 128, 7,... ?? ? Collaborative P2P Search Susceptible to misbehavior! How do we identify and penalize or isolate selfish/malicious peers?

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Self-Organization for Isolating Selfish Peers Rationale: mimic evolution in biological / social networks tag selfish vs. altruistic peers and bias interactions towards similar peers Algorithm: periodically do each peer compares its “utility” with a random peer if the other peer has higher utility then copy that peer’s strategy and links (reproduction) mutate with small probability: change behavior, change links

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 Self-Organization for Isolating Selfish Peers Simulation Results for P2P File Sharing typical run for 10 4 peers Selfishness reduces Average performance increases peers generate queries and answer queries based on P  [0,1] with extreme behaviors: selfish P = 1.0 and altruistic P = 0.0 peer utility = # hits (queries answered) mutation: change P randomly cycles average per node queries generatedhits

Gerhard Weikum (MPII) Data Management on Dynamic P2PSubproject 6 The End Thank you!