Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.

Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project

Overview  Introduction  Previous Work  Methodology o Word-similarity o Hidden Content o Phrase-similarity  Initial Results  Conclusion

Introduction:  Information is available on the Web. However, 14 % of the Web consists of Spam Web pages.  Spam Web pages: o Web Pages that receive an unjustifiably favorable relevance or high ranking, regardless of their true value. o Attempt to deceive a search engine’s relevancy ranking algorithm.  Serious retrieval problem: o Quality of Web search is affected. o Search engines’ reputation is damaged. o User’s trust in the retrieval process is weakened.

Previous Work  Content Analysis: o [Ntoulas et al. -2006] Introduce and combine several heuristics based on the content of a Web page (number of words in a page, average length of words, fraction of visible content).  Link Analysis: o [Becchetti et al. -2006] and [Benczur et al. -2005] consider links to and from a given Web page in order to determine if it is spam.

Methodology  Focus on the title and the body of a Web page in order to determine whether they are spam: o In legitimate Web pages the title and the body are closely related. o In spam Web pages, the title and the body are usually not related.

Methodology  Computing the title-body similarity: o Word-correlation factors, computed using Wikipedia documents: o Degree of resemblance between t (a word in a title) and B (the body of a Web page): o Degree of similarity between the words in the title and the words in the body of a Web page:  Status of a Web page:

Methodology  Fraction of Hidden Content: o Proportion of markup content of a given Web page (spam Web pages tend to content less markup than legitimate Web pages): o Threshold value to determine the status of a Web page:

Methodology  Phrase similarity value o Use the Odds measure to determine the phrase-correlation factor (based on the word-correlation factor): o Phrase similarity threshold value

Overall Spam Detection Approach

Experimental Results  WEBSPAM-UK2006: 77.9 millions of classified (spam, non- spam, borderline) Web pages.  Accuracy – Error Rate, using phrase similarity:

Experimental Results  Enhancement of the phrase similarity approach: o Method A: only phrase similarity. o Method B: phrase similarity as well as hidden content.

Experimental Results  Our performance (in terms of F-Measure) with respect to other known spam-detection approaches.

Conclusion  By using the phrase (words) in the title and body of a Web page as well as the fraction of hidden content we achieve 92% accuracy.  Computational inexpensive: can be incorporated into existing search engines to enhance Web searches

Questions

Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.

Similar presentations

Presentation on theme: "Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.

Similar presentations

Presentation on theme: "Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project."— Presentation transcript:

Similar presentations

About project

Feedback