P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in P2P Networks Cache Hash P2PIR’2006, collocated with CIKM’06, Arlington VA, USA Gleb Skobeltsyn, Karl Aberer Nov 11, 2006 EPFL Ecole Polytechnique Fédérale de Lausanne, Switzerland

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2 / 25 Problem definition Given a document corpus stored in a DHT P2P network Provide an efficient indexing mechanism to find matching documents given a multi-term query Traffic consumption to be minimized The storage space provided by peers is limited Solutions: broadcast, naïve indexing of terms, HDK…

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 3 / 25 How the naïve approach works (1)? Naïve approach 1: store terms’ Inverted Lists in a DHT An inverted lists contains document ids. K I Query: “T 1 AND T 2 ” {I 1,I 2 } {I 2 } (h(T 1 ), {I 1,I 2 }) (h(T 2 ), {I 2,I 3 }) (h(T 3 ), {I 4,I 5 }) K I This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 4 / 25 How the naïve approach works (2)? Naïve approach 2: store terms’ Inverted Lists in a DHT An inverted lists contains document summaries. K I Query: “T 1 AND T 2 ” {I 2 } (h(T 1 ), {I 1,I 2 }) (h(T 2 ), {I 2,I 3 }) (h(T 3 ), {I 4,I 5 }) K I {I 2 } OR

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 5 / 25 Can we do better? Inverted lists can be very large => consume traffic Indexing of all/selected terms in all documents => huge redundancy in the index, space limitations Indexing of term combinations => how to choose them? Many index items are never or very rarely used. Our idea: –Indexing=caching –Efficiently fill in the available (distributed) storage space with result sets for popular queries –Use stored caches to answer queries

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 6 / 25 What is our idea? Conventionally, index is generated purely from the data Very large number of unused index entries Let us use the query popularity distribution by gathering statistics! We try to build an index specifically targeted for the current query log The size of the index is bounded by the available storage provided by peers Everything which is not indexed is searched via broadcast

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 7 / 25 Given a set of documents, each doc contains a set of terms We have an inverted index over all extracted terms: {key=h(term)} – {inverted list} What is our idea? Another explanation T1 T2 T3 T4 T5 T6 T7 T8 T9 D1 T1, T2, T3 D2 T1, T4, T5 D3T1,T2,T6 D4T5,T6,T7 D5T1,T8,T9 D o c u m e n t s:Search Keys:Inverted lists: D1, D2, D3, D5 D1, D3 D1 D2 D2, D3, D4 D4D4 D4D4 D5D5 D5 Query popularity T1 & T2 very high T3 high T3 & T4 high T7 low T8 & T9 very low D1, D3 T1&T2 We can monitor Query Load statistics: We can monitor Query Load statistics: Delete unused index entries Index term combinations(queries)

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 8 / 25 Idea: example QueryInv.list Flooding3 Query & P2P2 TermInv.list Efficient1 Search1,3 P2P1,2 Query2 Processing2 Network2,3 Flooding3 IDData 1Efficient search in P2P 2Query processing in P2P networks 3Search via network flooding Query statistics search flooding search query P2P flooding query processing P2P query P2P Data: Index: TermInv.list Efficient1 Query2 Processing2 Flooding3 P2P & Search1 Network & Search3 Network & P2P2

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 9 / 25 What are we searching for? Cache all queries Index all data Query-driven indexing structure Query subsumption?Unused index items?

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 10 / 25 Contents Motivation & IdeaMotivation & Idea Query subsumptionQuery subsumption Optimization problemOptimization problem DCT ’s indexing and caching strategy:DCT ’s indexing and caching strategy: –Meta-index –Cache management –Top-K caching –Load Balancing EvaluationsEvaluations ConclusionsConclusions

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 11 / 25 Query subsumption Given a query q, we are interested in locating at least one cache for a query q’ s.t.: RS(q’) contains RS(q) Query subsumptionQuery subsumption: q’ subsumes q if all terms of q’ are contained in q. That means RS(q’) contains RS(q). We can demonstrate subsumption on a lattice of size 2 m -1, where m is the number of terms Query subsumption if a and cd are cached

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 12 / 25 Optimization problem A vocabulary T=t 1,t 2 …t m : all terms in the query load. –A query q=t 1,t 2 …t n : q in 2 T –A document d=t 1,t 2 …t r : d in 2 T A Query load L=q 1,q 2 …q l : q i in 2 T, –p(q i ) – probability, |RS(q i )| – result set size for q i in L A cachehit function: –cachehit(q)=1, if there exists a cached query q’ subsuming q; –cachehit(q)=0, otherwise. Problem: to find a set of cached queries Ω, s.t: –Ω=argmax Σ q i in L cachehit(q i )*p(q i ) –Having a storage constraint: S Ω = Σ q i in Ω |RS(q i )|<S 0 A document d is the valid answer for a query q d contains q

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 13 / 25 DCT: Indexing and caching strategy DCT caches result sets of certain queries without constraining physical cache locations Each peer is running two services: –Meta-index service: stores index items with cache locations –Caching service: answers a query form a cache Meta-index: given a query q finds a list of cache locations capable of answering q. Cache service: returns the result set for q from the q’ cache (q’ subsumes q).

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 14 / 25 DCT: Meta-index Meta-index is based on the standard DHT indexing functionality. Index update: If a peer π caches a query q, it advertise the cache availability in the meta-index: It inserts a tuple {q-> address( π) } at the peer responsible for a random term from q. Lookup: If a query q=t 1 &t 2 &…&t n is submitted, every peer responsible for t 1,t 2 …t n is asked to provide a set of caches it indexes that subsume q. One of them (if any) is chosen randomly.

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 15 / 25 DCT: Meta-index example * 1.π orig looks up the meta-index: contacts peers π a, π c and π d * 2.π a, π c and π d response with known locations of caches subsuming q 3.π orig randomly selects a cache from the obtained list. Assume “cd” is picked. 4. RS(q) is sent to π orig * * interactions with π d are not shown q=“acd” is submitted q=“acd” is submitted at π orig

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 16 / 25 DCT: Cache Management Each peer provides some storage space s 0 for caches Caches with low profits are evicted: profit(q)=popularity(q) / (|RS(q)|+1) Every time a peer has to broadcast a query, it tries to cache it The query q with the result set size |RS(q)| is cached if: –There is enough free space to store |RS(q)|, –There is NOT enough free space but the least profitable caches can be dropped to fit q cache.

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 17 / 25 DCT: Top-K caching Problem: –A popular query q with a large result set might NOT be cached as its profit is relatively low Solution: –Introduce a top-k cache: –Can serve only q, no subsumption; –But consumes little space, avoids broadcasting the popular q

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 18 / 25 Evaluation

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 19 / 25 Evaluations: query load and data Source data: –English Wikipedia XML dump (6Gb) 05.2006 –Two Wikipedia query traces from August and September 2004 Query load properties (August trace): –1.3M unique queries, asked 4.6M times during the month –500K repeated at least twice, 800K only once –225K unique terms in both traces (after stemming) –Average number of terms in a query = 2.6 Java simulation: –Simulates a number of virtual peers –Each peer provides 200K records of storage space

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 20 / 25 Evaluations: how much storage do we need? 98% max cache hit with unlimited storage 81% max cache hit with unlimited storage but no subsumption

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 21 / 25 Evaluations: Traffic consumption 100 peers, 200K each Converges to 85% cache hit with 100x200K=20M records global cache capacity The naïve approach requires at least 240M records for the term index (if built for query load terms only)

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 22 / 25 Evaluations: stress test 300 peers, 200K each Converges to 97% cache hit with 300x200K=60M capacity Very small cache hit drop when changing the load due to the subsumption

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 23 / 25 Evaluations: load balancing Cache imbalance => only several peers are overloaded Meta-index imbalance => has less impact, can be partially avoided

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 24 / 25 Conclusions Distributed Cache Table: a (quite) large scale distributed cache for P2P IR applications based on both: –Query load –Data distribution Properties: –Efficiently utilizes and adapts to the available storage space –Trade off between huge index size and extra traffic costs for broadcasting rare queries –Subsumption is important: resilient to query load changes –Sufficiently load balanced –Requires 1-2 orders of magnitude less traffic than the naive approach –Requires substantially less storage then per-term index

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 25 / 25 Last slide Thank you for your attention! Questions?

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.

Similar presentations

Presentation on theme: "P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.

Similar presentations

Presentation on theme: "P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in."— Presentation transcript:

Similar presentations

About project

Feedback