Stable web spam detection using features based on lexical items

Stable web spam detection using features based on lexical items
Source: Computers&Security,2014,46:79-93 Authors: M. Luckner,M. Gad,P. Sobkowiak Speaker: Jia Qing Wang Date: 2016/12/08

Outline Introduction Proposed features Experiments &Results
Discussions Conclusion

Introduction(1/3) The typical Web spam detection scheme
Extract features Dimensionality reduction use feature selection and feature extraction methods Classifier Experiment result Novel high-quality features for web pages Link features Content features …… PCA LDA

Introduction(2/3) The main contributions of this paper
Create a web spam detector that works over years by using datasets from different year as training and testing sets; Select several new features based on lexical items; Verify the high influence of the selected new features; Improve accuracy of Web spam detection.

Contains only the pure text between tags
Introduction(3/3) Data preprocessing obtained by removing all space characters from the Visible Text document Contains only the pure text between tags v i a g r a to viagra. set of unique domain names extracted from the Visible Text Documents and whole of origin documents Before calculating actual features, each analyzed HTML document was transformed into three separate forms

Proposed features(1/7) Commonly used statistics in computing features in Web spam detection the average length, maximum length, and standard deviation of the length Basic features Statistics of links, URLs, domains and words. the number of words in the title of a HTML document the number of dots in document's domain the count of IP addresses in Distinct Domain document the rate of compression by bzip2 algorithm, entropy of chars, entropy of words, and length for both origin texts and visible ones Links不一定指向资源，但是URL肯定能找到资源，domain是标识主机的，具有一定的标识含义的。

Proposed features(2/7) Features based on Consonant Clusters (6)
A Consonant Cluster event was defined as a sequence of three or more consonants. (extracted by the regular expression) In the Distinct Domains documents and the Non-blank Visible Text document, statistics of Consonant Cluster were calculated as new features. (6) Detecting spam of created words: such as PRlCE, PROFlTS, or SATlSFACTlON. Detect spam with created words: Usually, created words do not concern with the physical properties of phonemes. Therefore, long words without vowels can be a determinant of spam. Good examples are word with the small letter ‘l’ in place of the capital letter ‘I’ such as PRlCE, PROFlTS, or SATlSFACTlON. ‘l’  ‘I’

Proposed features(3/7) Weird Combinations (2)
Such as: v1agra, p0rn, credit4U, StuffForFree, qwq23ewc In the 𝐷𝑖𝑠𝑡𝑖𝑛𝑐𝑡 𝐷𝑜𝑚𝑎𝑖𝑛𝑠 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡: 𝑡𝑜𝑡𝑎𝑙 𝑤𝑒𝑖𝑟𝑑 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑢𝑛𝑖𝑞𝑢𝑒 𝑑𝑜𝑚𝑎𝑖𝑛𝑠 In the Visible Text document: 𝑡𝑜𝑡𝑎𝑙 𝑤𝑒𝐶𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠 Detecting spam of hiding prohibited content.

Proposed features(4/7) Analysis of chars (8)
In the Visible Text document and the Distinct Domains document (2) 𝑛𝑜𝑛−𝐴𝑆𝐶𝐼𝐼 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠 In the Non-blank Visible Text documents, statistics of all continuous sequences of letters from the Latin alphabet. (3) In the Non-blank Visible Text documents, statistics of all continuous sequences of non-Latin symbols. (3)

Proposed features(5/7) Analysis of lexical items (based on words and syllables in the Visible Text document) Word items: a continuous sequence of letters that was not prefixed or suffixed with numbers or underscores Word Syllable Count feature: the number of continuous sequences of the basic vowel characters. (1) the average count of syllables in a word, the maximum count of syllables in a word, and the standard deviation of Word Syllable Count distribution. (3) Sentence Count feature : regular expression. (1)

the number of words whose syllable count is greater than 2
Proposed features(6/7) Gunning Fog Index the number of words whose syllable count is greater than 2 It can be useful to detect spam created by Internet bots or by persons with a limited vocabulary.

Proposed features(7/7) Significance of features (22 new features)

Experiments &Results (1/7)
Test the stability of dataset usage, try to find a web spam detector that works over years Datasets : two datasets, WEBSPAMeUK2006 and WEBSPAMeUK2007, used interchangeably as the learning set and the testing set. Classifier : modified SVM where f(x) is a distance to the decision line and p(x) is the calculated probability of correct classification for the point x Evaluation measures: AUC: area under the ROC curve

综合看得到的三个结果图。首先从accuracy来看，当2006set作为learning set的时候，无论test set 是什么，得到的结果都比较稳定。而2007set作为learning set的时候，结果的稳定性就不高。另外，从specificity中可以看到，当learning set是2006set2的时候，这个specificity比例保持在一个平稳的水平。但是testing set的sensitivity的平稳性就不好了，可以看到2006测试集的sensitivity有80%以上，但是2007的就只有43%~57%左右了。由于sensitivity代表了spam的检测率，由于2007数据的不平衡度远远大于2006，所以直接检测，效果没有2006好。所以把这些单独来看，分析spam检测的性能力度不是很够，ROC曲线综合了specificity和sensitivity，我们用它来分析更加直观。所以论文中给出了不同实验的ROC曲线以及它的AUC值。

When test data from the same year, the AUC of 2006 sets are higher than 2007 sets, but when test data from different year, the stabilities are better than sensitivity, but still worse than accuracy. 这里我们直接来看AUC，可以看到2007 set的AUC比2006 set的要小很多，但是总体来看稳定性好了很多。

Analyze the influence of new features

To prove that the difference was significant, we performed Wilcoxon's Signed-Rank test for paired scores. One method of Hypothesis Testing. Reject the null hypothesis (p =6.4 x 10-4), at 0.05 level. Accept (p=3.4 x 10-2), at the 0.05 level, the hypothesis that the mean difference between the AUCs is The full set of features was statistically significantly better for all pairs except the learning set 2007 I and the testing sets 2006 I and 2006 II. 拒绝0假设，代表这两个特征集合不是明显不同的。接受假设，代表

Analysis of stability Split the data from 2006 into 30 random subsets, trained on the subset and tested on all data from both 2006 and 2007.

Analysis of stability Wilcoxon's Signed-Rank test for 30 paired scores. The accuracy for 2006 and 2007 was not statistically significantly different. (difference is 0.01, stable) The specificity is also stable(difference is 0.001). The sensitivity: the average difference is 0.35. That AUC: the average difference in the AUC between years is 0.18.

Discussions (1/2) methods UK2006/AUC UK2007/AUC Our method 0.895 0.745
Qualified link analysis and language models [1] 0.88 0.76 Multilayer Perceptrons and Support Vector Machines [2] 0.80 0.72 methods AUC Our method 0.738 The C4.5 tree classifier trained on the data from 2006 and tested on the data from 2007 [3] 0.73 06加07的方法 [1] Araujo L, Martinez-Romo J. Web spam detection: new classification features based on qualified link analysis and language models. Information Forensics and Security, IEEE Transactions on 2010;5(3):581e90. [2] Goh, K. L., Singh, A., & Lim, K. H. Multilayer perceptrons neural network based web spam detection application. In: Signal and Information Processing (ChinaSIP), 2013 IEEE China Summit International Conference on, 2013 (pp. 636e640). [3] Erdelyi M, Benczur AA, Masanes J, Siklosi D. Web spam filtering in internet archives. In: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb '09. New York, NY, USA: ACM; p. 17e20. URL,

Discussions (2/2) we used the WEKA toolkit to create random forests (RF) evaluated by the 10-fold cross validation on the WEBSPAM-UK2007 dataset. The obtained AUC (0.991) is better than in the discussed works[4][5][6]. [4] Bíró I, Sikló i D, Szabó J, Benczúr AA. Linked latent dirichlet allocation in web spam filtering. In: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb '09. New York, NY, USA: ACM; p. 37e40. [5] Erdélyi M, Garzó A, Benczúr AA. Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, WebQuality '11. New York, NY, USA: ACM; p. 27e34. [6] Dong C, Zhou B. Effectively detecting content spam on the web using topical diversity measures. In: Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology. WI-IAT '12, vol. 01. Washington, DC, USA: IEEE Computer Society; 2012.p 特征的有效性

Conclusion This paper has shown that data from WEBSPAM-UK2006 can be used to create classifiers that work stably both on the WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets. This paper proved that the proposed new features improved the classification results.

Thanks!

Stable web spam detection using features based on lexical items

Similar presentations

Presentation on theme: "Stable web spam detection using features based on lexical items"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stable web spam detection using features based on lexical items

Similar presentations

Presentation on theme: "Stable web spam detection using features based on lexical items"— Presentation transcript:

Similar presentations

About project

Feedback