Presentation is loading. Please wait.

Presentation is loading. Please wait.

SPAM DETECTION IN P2P SYSTEMS Team Matrix Abhishek GhagDarshan Kapadia Pratik Singh.

Similar presentations


Presentation on theme: "SPAM DETECTION IN P2P SYSTEMS Team Matrix Abhishek GhagDarshan Kapadia Pratik Singh."— Presentation transcript:

1 SPAM DETECTION IN P2P SYSTEMS Team Matrix Abhishek GhagDarshan Kapadia Pratik Singh

2 AGENDA REFRESHER ANALYSIS OF PAPER 1 ANALYSIS OF PAPER 2 ANALYSIS OF PAPER 3 PROGRESS SOFTWARE DESIGN

3 REFRESHER P2P Basics Spam The Spam Detection Problem Approaches to the Spam Detection Problem Proposal References

4 Pollution in P2P file Sharing Systems. Authors: - Jia Liang, Ramesh Kumar, Yongjian Xi, Keith Ross. Conference: - Infocomm 2005. 24 th Annual Joint Conference of the IEEE Computer and Communication Societies. Date: - March 2005. Publisher: - ACM. Pages:- 1174 – 1185.

5 OVERVIEW Injecting Pollution. The FasTrack Crawling System. Automated pollution Detection system. False positives and negatives. Pollution pervasive for recent popular songs. Anti-Pollution Mechanisms.

6 Introduction Heavy loss to music industry. Three ways to minimize P2P usage 1) Take the P2Psystem providers to court. 2) To prosecute the individual users for copyright infringement. 3) To destruct the P2P file sharing systems. Injecting polluted files in large volumes. Spreading of polluted files. Frustrate users of P2P.

7 Classification of P2P Pollution P2P pollution can mainly be classified into two types: - 1) Content Pollution 2) Metadata Pollution Intentional and Non-Intentional Pollution.

8 The FastTrack Crawling System Developed to gather data from over 30000 nodes of fastrack network. Examines up to 20000 nodes in one hour. Study conducted on data collected. Ranking System is Ineffective.

9 Automated Pollution Detection To detect a file as polluted or not without downloading. Two general ways to do this: - First, the polluting parties usually tamper with the binary format of the data. Thus the file becomes non – decodable into the corresponding PCM format. Thus, if a file can be decoded into its PCM format, it is not polluted. Second, the polluted versions of the audio files have durations that are significantly shorter or longer than the official CD version.

10 Ratings and Pollution Users allowed to rate integrity of a file. Ranking of multiple responses aggregated. A file is falsely rated if it is rated as good although it is not. Higher rate of false copies, higher is the pollution. New versions introduced without the old ones being tackled.

11 Anti-Pollution Mechanisms Anti-Pollution Mechanisms can be broadly classified into: - 1)Detection without downloading: - In this, the file is checked for pollution without downloading it. This mechanism depends upon the appraisal of others peers who have actually downloaded the file. There are several techniques of this type. They are: - a) Rigid Trust b) Web of Trust c) Reputation Systems d) Block IP address of polluter

12 2) Detection with downloading: - This mechanism deals with detecting the polluted file by downloading either a portion or the complete file. The various types under this mechanism are: - a) Matching b) User Filtering

13 Conclusion It is found that almost all P2P file sharing systems have a large number of recent popular songs have multiple copies and many versions. Almost over 60% of the copies are polluted and the pollution is intentional. The rating system of the current FasTrack Networks is ineffective.

14 Spam Characterization and Detection in Peer-to-Peer File-Sharing Systems Author – Dongmei Jia, Wai Gen Yee, Ophir Frieder Title – Spam Characterization and Detection in Peer to Peer File Sharing Systems. Conference -- Proceeding of the 17th ACM conference on Information and knowledge mining Date -- October 2008. Publisher -- ACM. URL -- http://portal.acm.org.ezproxy.rit.edu/citation.cfm?id=1458082.145 8128&coll=portal&dl=ACM&CFID=14901064&CFTOKEN=96029 385

15 Organization of the Paper Introduction Related Work Query Processing Classification of Spam Features of P2P Spam Feature Based Spam Detection Conclusion

16 Introduction Related Work

17

18 Query Processing Client writes a query. Server compares the query with its own files On match server returns System Identifier and Descriptor. The client groups the individual groups by keys. The Client ranks according to some ranking function. The client download the file and becomes the server.

19 Spamming Steps 1, 3 and 5. Object Reputation on step 1. Feature based Spam Detection on steps 3 and 5.

20 Classification of Spam Type 1:-Files whose replicas have semantically different descriptors. The Spammer might name a file after a currently popular song or might give multiple names to the same file. Eg: different song titles for a same key 26NZUBS655CC66COLKMWHUVJGUXRPVUF: “ 12 days after christmas.mp3 ” “ i want you thalia.mp3 ” “ come on be my girl.mp3 ” …

21 Classification of Spam Type 2:- Files with long descriptors In this a Spammer inserts a single long descriptor for the file. E.g., a single replica descriptor for key 1200473A4BB17724194C5B9C271F3DC4: “ Aerosmith, Van Halen, Quiet Riot, Kiss, Poison, Acdc, Accept, Def Leappard, Boney M, Megadeth, Metallica, Offspring, Beastie Boys, Run Dmc, Buckcherry, Salty Dog Remix.mp3 ”

22 Classification of Spam Type 3:- Files with descriptors with no query terms. In this, if a server is wishing to share a file, it may return the file regardless of whether it matches the query results. Eg. “ Can you afford 0.09 www.BuyLegalMP3.com.mp3”

23 Classification of Spam Type 4:- Files that are highly replicated on a single peer. Normal users do not create multiple replicas of the same file on a single server. This is aimed at manipulating the group size. E.g..177 replicas of the file DY2QXX3MYW75SRCWSSUG6GY3FS7N7YC shared on a single peer.

24 Features of P2P Spam Candidate Features Replication degree of a file (numRep): Number of hosts on which a file is shared (numHost) Average descriptor length of a file (avgDLen) Vocabulary size of a file’s group descriptor (numUniqueTerms): Variance of terms in replica descriptors of a file Per-host replication degree of a file (repPerHost) Average file replication degree on a peer (avgRepDegree)

25 Variance of terms in replica descriptors of a file Jaccard Distance The Jaccard distance between a single replica descriptor Di and file group descriptor G is defined as: 1 - |Di ∩ G| / | Di ∪ G|

26 Cosine Distance The cosine distance between replica descriptor Di and file group descriptor G is defined as: 1 - (VG·VDi) / (|VG| |VDi|)

27 Effectiveness of Features P2P FEATURE % Spam in first 20 Result Files numRep75% numHost40% avgDLen0% numUniqueTerms95% Jaccard100% Cosine100% repPerHost100% From Paper Spam Characterization and Detection in Peer to Peer File Sharing Systems.

28 Vocabulary size of a file’s group descriptor From Paper Spam Characterization and Detection in Peer to Peer File Sharing Systems.

29 Jaccard Distance From Paper Spam Characterization and Detection in Peer to Peer File Sharing Systems.

30 Cosine Distance From Paper Spam Characterization and Detection in Peer to Peer File Sharing Systems.

31 Query Processing 1 Client writes a query. 2Server compares the query with its own files 3On match server returns System Identifier and Descriptor. 4 The client groups the individual groups by keys. 5 The Client ranks according to some ranking function. 6The client download the file and becomes the server.

32 Algorithm for Spam Detection For Type 2 and 3 Spam 5a. Groups are ranked by cosine similarity (or some other query- dependent ranking function).

33 For Type 1 and 4 Spam 5b. Identify the top-M results as candidate results. 5c. Re-rank the top-M results by either NumUniqueTerms or Jaccard/Cosine distance. The results that are low in the order are more likely to be Type 1 spam than those higher up. 5d. Identify the top-N results, where N < M as the new candidate results. 5e. Re-rank the top-N results by their per-host file replication degree. The results that are low in the order are more likely to be Type 4 spam than those higher up.

34 Probe Queries Local descriptors Number of replicas Unique files Identifier for the Peer

35 Papers. Author – Dongmei Jia Title – Cost Effective Spam Detection Techniques in P2P File Sharing Systems. Conference -- Proceeding of the 2008 ACM workshop on Large scale Distributed Systems for information retrieval. Date -- October 2008. Publisher -- ACM. URL -- http://portal.acm.org.ezproxy.rit.edu/results.cfm?coll=portal&dl=AC M&CFID=14901064&CFTOKEN=96029385 References

36 Cost Analysis Random Sampling of Query Results Piggybacking Limiting scope

37 From Paper Cost Effective Spam Detection Techniques in P2P File Sharing Systems.

38 Experimental Results From Paper Cost Effective Spam Detection Techniques in P2P File Sharing Systems.

39 SPAM DETECTION IN P2P SYSTEMS Progress Software Design

40 Papers. Author – Dongmei Jia Title – Cost Effective Spam Detection Techniques in P2P File Sharing Systems. Conference -- Proceeding of the 2008 ACM workshop on Large scale Distributed Systems for information retrieval. Date -- October 2008. Publisher -- ACM. URL -- http://portal.acm.org.ezproxy.rit.edu/results.cfm?coll=portal&dl=AC M&CFID=14901064&CFTOKEN=96029385 References

41 Author – Dongmei Jia, Wai Gen Yee, Ophir Frieder Title – Spam Characterization and Detection in Peer to Peer File Sharing Systems. Conference -- Proceeding of the 17th ACM conference on Information and knowledge mining Date -- October 2008. Publisher -- ACM. URL -- http://portal.acm.org.ezproxy.rit.edu/citation.cfm?id=14580 82.1458128&coll=portal&dl=ACM&CFID=14901064&CFTO KEN=96029385

42 References Author – Jia Liang, Rakesh Kumar, Yongjian Xi, Keith W Ross Title – Pollution in P2P File Sharing Systems. Conference -- INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings IEEE Date -- March 2005. Publisher -- ACM. URL -- http://ieeexplore.ieee.org.ezproxy.rit.edu/stamp/stamp.jsp? arnumber=1498344&isnumber=32100

43 Questions???


Download ppt "SPAM DETECTION IN P2P SYSTEMS Team Matrix Abhishek GhagDarshan Kapadia Pratik Singh."

Similar presentations


Ads by Google