Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005
What is Blog Spam? Bots posting comments unrelated to the original blog post Comments contain links to irrelevant sites Links are used to fool Google
Current Solutions Register Solve a puzzle Prevent HTML Prevent comments in old posts IP Filter Limit comment rate
Objective Filter out blog spams
Approach Compare post contents with comment contents
KL-Divergence Similarity Use KL-Divergence as a similarity score between post and comment Lower score = Higher similarity
Clustering with Gaussian Mixture Use clustering based on Gaussian Mixture Cluster all comments of a post into 2 groups by KL-Divergence value Higher KL-Divergence value group is the spam group
Limitations Cheat the system by using words similar to the post in comments Posts and comments are too short to extract the language model –follow the links
Experiment Corpus 50 random blog posts with 1024 comments At least 3 comments per post 32% of comments are valid 68% of comments are spams
Sample Spams
Result Baseline: classify as spam with 68% probability Threshold Multiplier: adjust classification boundary
Conclusion No training No hand-coded rules Still working on –Follow the link to the website