Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.

Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung

2 Outline Introduction Association Discovery by SOM Tag Spam Detection Experimental Results Conclusions

3 Social Bookmarking –Why? Social bookmarking services (aka folksonomy) are gaining popularity since they have the following benefits: Alleviation of efforts in Web page annotation Improvement of retrieval precision Simplification of Web page classification

How folksonomy works? Simple A user (u i ) annotates a Web page (o j ) with a set of tags (T ij ). Generally represented as a set of tuples (u i, o j, T ij ), where u i  U, o j  O, and t ij  T. 4

5 Collaboration Semantic relatedness Possibility of spam Characteristics of Folksonomy

6 Tags that are unrelated or improperly related to the content/semantics of the annotated Web pages. Arise for advertisement or promotional purposes. Misleading users and deterioration of retrieval result. Tag Spams

System Architecture 7 Web pages Tags Page/tag associations Association discovery Association discovery Preprocessing Preprocessing Web page vectors Tag vectors SOM training SOM training Page clusters Tag clusters Synaptic weight vectors Labeling Labeling

Preprocessing Bag of words approach Web page P i is transformed to a binary vector P i. T i, which is the tag list of P i, is transformed to a binary vector T i. 8

9 SOM Training All P i and T i were trained by the self- organizing map algorithm separately. Two maps M P and M T were obtained after the training.

10 Labeling We labeled each Web page on M P by finding its most similar neuron. A page cluster map (PCM) was obtained after all pages being labeled. The same approach was applied on all tag lists on M T and obtained tag cluster map (TCM).

Association Discovery Finding associations between page clusters and tag clusters. We used a voting scheme to find the associations. 11 PiPi TiTi PCMTCM +1

12 Architecture of Tag Spam Detection Incoming Web page Incoming tag list Page/tag associations Preprocessing Preprocessing Incoming page vector Incoming tag vector Labeling Labeling Labeled page cluster Labeled tag cluster Spam detection Spam detection Tag spams PCM and TCM

13 Spam Detection Two types of tag spams Document-scope detection (post-level detection) The whole tag list is identified as spam. Tag-scope detection (tag-level detection) Individual tags are identified as spams. Let P I and T I be the incoming Web page and its tag list, respectively. Let P I and T I be labeled to and, respectively.

14 Document-Scope Detection Relatedness between page cluster and tag cluster : Q: neighborhood of A = [a ij ] is the correlation matrix between PCM and TCM. a pk = 1 if and are related; otherwise a pk = 0 D: geometric distance between two clusters T I is identified as spam if

15 Tag-Scope Detection A tag is a spam if it is inconsistent to other tags in the same tag cluster. Let T i = {t ij } be a tag list and An incoming tag t Ij  T I is a spam if t Ij  W.

16 Experimental Result Dataset 1500 Web page / tag list pairs collected from www.delicious.com www.delicious.com each pair was inspected manually both in post- level and tag-level 583 distinct Web pages Sizes of vocabularies Web pages: 13437 tag lists: 5157 average number of tags per page: 4.7

17 Experimental Result Parameters map sizes PCM: 10  10 TCM: 10  10 training epochs PCM: 400 TCM: 200  : 0.7

18 Experimental Result Number of training / test data: 1000 / 500 Confusion matrix for document-scope detection Accuracy = (118 + 273) / 500 = 78.2% Recall = 118 / (118 + 44) = 72.8% Precision = 118 / (118 + 65) = 64.5% Actual result SpamNon-spam Predicted result Spam11865 Non-spam44273

Further Result of Document-Scope Detection Result after 10-fold cross validation Confusion matrix Accuracy = (123.1 + 271) / 500 = 78.8% Recall = 123.1 / (123.1 + 43.6) = 73.8% Precision = 123.1 / (123.1 + 62.3) = 66.4% 19 Actual result SpamNon-spam Predicted result Spam123.162.3 Non-spam43.6271

Further Result of Tag-Scope Detection Result after 10-fold cross validation Confusion matrix * average number of tags per page Accuracy = (1.4 + 2.2) / 4.7 = 76.6% Recall = 1.4 / (1.4 + 0.4) = 77.8% Precision = 1.4 / (1.4 + 0.7) = 66.7% 20 Actual result SpamNon-spam Predicted result Spam1.4*0.7 Non-spam0.42.2

21 Conclusions A novel scheme for tag spam detection based on text mining. Relatedness between Web pages and tags were discovered based on self-organizing map. Use only the content of Web pages instead of user behaviors.

Thanks for your attention. 22

Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.

Similar presentations

Presentation on theme: "Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.

Similar presentations

Presentation on theme: "Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National."— Presentation transcript:

Similar presentations

About project

Feedback