Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by: Dongmei Jia Illinois Institute of Technology April.

Similar presentations


Presentation on theme: "1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by: Dongmei Jia Illinois Institute of Technology April."— Presentation transcript:

1 1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by: Dongmei Jia Illinois Institute of Technology April 11, 2008 D. Jia, W. G. Yee, L. T. Nguyen, O. Frieder. Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems. In Proc. of the 7th IEEE Intl. Conf. on Peer-to-Peer Computing (P2P), Ireland, Sept. 2007

2 2 Outline Objective Problem Proposed Approach Experimental Results Conclusions

3 3 Objective To improve the accuracy of search in P2P file-sharing systems. –Finding poorly described data

4 4 Problem Statement Characteristics: –Binary files (e.g., music file). –Each replica described with a descriptor.  Sparse.  Vary across peers. –Queries are conjunctive. Problem: poor/sparse description makes files hard to match with queries!

5 5 Approach Peers independently search the network for other descriptors of local files –Incorporate them into the local replica’s descriptor –Search implemented by “probe” queries

6 6 Example Q = {Mozart, piano} D1 = {Mozart} D2 = {piano} tell me your description of F D2 = {piano} D1’ = {Mozart, piano} tell me your description of F Two descriptors of File F: D1, D2. Q = {Mozart, piano}. D3 = {Mozart, piano} Peer1 Peer2 Peer3 No result returned for Q!

7 7 How P2P File-Sharing Systems Work Peers share a set of files. Each replica of a file is identified by a descriptor. –Every descriptor contains a unique hash key (MD5) identifying the file. Query is routed to all reachable peers. Query result contains its descriptor and the identity of the source server.

8 8 Probe Query Design Contains one term: the key of a file. –Matches all the replicas of the file reached by the probe query.

9 9 Design Challenges When to probe What file to probe What to do with probe results How to control cost Do this in a fully distributed way

10 10 When to Probe? When a peer is not busy and under-utilized –Measured by number of responses returned N r When a peer has a high desire to participate –Measured by number of files published N f When the system is active –Measured by number of queries received N q

11 11 When to Probe? (Cont’d) Triggering mechanism: T > N r /N f N q + N p T, N f, N q > 0 Where T: user-defined threshold N p : number of probe queries performed N r /N f N q : number of results returned per shared file per incoming query All the metrics are locally maintained by each peer, easy to implement

12 12 What File to Probe? Goal is to increase participation level Criteria to choose from: –File that is least probed (RR) –File that is in the least or most query responses (LPF or MPF) –File with a smallest descriptor

13 13 What to do with Probe Results? Select terms from the result set to add to the local descriptor –Random (rand) –Weighted random (wrand) –Most frequent (mfreq) –Least frequent (lfreq) Stop when local descriptor size limit is reached

14 14 Experimental Setup Query Length Distribution: Parameters Used in the Simulation:

15 15 Metrics MRR (mean reciprocal rank) = Precision = Recall = A: set of replicas of the desired file. R: result set of the query.

16 16 Data TREC wt2g Web track. –Arbitrary set of 1000 Web docs from 37 Web domains. –Preprocessing Stemming and removing html markup and stop words. –Final data set 800,000 terms, 37,000 are unique.

17 17 Experimental Results – Applying Probe Results to Local Descriptors MRR with Various Term Copying Techniques

18 18 Experimental Results - Probe Triggering No probing (base case). Random –Assign each peer a probability of probing. –5K probes are issued over the 10K queries. T5K –Tune T to perform 5K probes over the 10K queries.

19 19 Experimental Results - Probe Triggering (Cont’d) MRR: random + 20%; T5K +30%

20 20 Experimental Results - Probe Triggering (Cont’d) Probing dramatically increases MRR of longer queries. Solve query over-specification problem.

21 21 Experimental Results - Probe Triggering (Cont’d) Effect of Various Probing Rates on MRR.

22 22 Experimental Results - Probe File Selection Rand – randomly select a file to probe (base case). LPF – least popular first. –Min query hits; on a tie, min descriptor size. MPF – Most popular first. –Max query hits; on a tie, min descriptor size. RR-LPF – round-robin-LPF. –Min probes; on a tie, LPF. RR-MPF – round-robin-MPF. –Min probes; on a tie, MPF.

23 23 Experimental Results - Probe File Selection (Cont’d) Compared with Rand base case, only RR-MPF has better performance (~10%) and lower cost (~-10%). MRRCostRecallPrec.Pct. Contained Rand===== LPF<<<<> MPF<><<< RR-LPF<<><> RR-MPF><>>>

24 24 Putting Them Together… Probe: T5K, RR-MPF, wrand Probing improves MRR by ~30%

25 25 Explanation Triggering: tuning T –Probes are issued by underactive peers. File selection: –RR avoids same file being probed repeatedly –MPF improves peer’s ability of sharing popular files Term copying: –wrand selects from bag of words in proportion of frequency –Allows new queries to be matched with a bias toward more strongly associated terms

26 26 How to Control Cost? Cost components: –Probe query results –File query results Cost: avg number of responses per file query Randomly sample each type of result on server side with a probability P Impact on performance?

27 27 Performance/Cost Analysis Total Per-file-query Cost for Different File and Probe Query Sampling Rates

28 28 Performance/Cost Analysis (Cont’d) MRR is increased in all sampling settings

29 29 Performance/Cost Analysis (Cont’d) Example: It can both reduce the cost (-15%) and improve the performance (18%) Cost Performance

30 30 Conclusions and Future Work Probing enriches data description –MRR is improved by 30% Sampling is effective in controlling cost –Reduce cost by 15% and improve performance by 18% at the same time We consider better ways of controlling cost

31 31 Thank You! Any Questions?


Download ppt "1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by: Dongmei Jia Illinois Institute of Technology April."

Similar presentations


Ads by Google