Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
CAPTCHA Completely Automated Public Turing test to tell Computers and Humans Apart A Computer Program that can generate and grade test that: Most Humans.
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
A fast identification method for P2P flow based on nodes connection degree LING XING, WEI-WEI ZHENG, JIAN-GUO MA, WEI- DONG MA Apperceiving Computing and.
Recommender Systems. >1,000,000,000 Finding Trusted Information How many cows in Texas?
Comment Spam Identification Eric Cheng & Eric Steinlauf.
Advance Web Promotions Analyzing Your Backlinks How to avoid trouble Harold Compton Austin Account Manager.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Making the most of social historic data Aleksander Kolcz Twitter, Inc.
Spam Filtering. From: "" Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Presented by team 4: Mateo ALBARRACIN Aurelie CHEUCLE Arfa FANG Zheng Jun Roshan GERAMIAN-NIK Shahrukh QURESHI Ricky YOUNG Wing Kei.
Spam Detection Ethan Grefe December 13, 2013.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
IR, IE and QA over Social Media Social media (blogs, community QA, news aggregators)  Complementary to “traditional” news sources (Rathergate)  Grow.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
1/18 New Feature Presentation of Transition Probability Matrix for Image Tampering Detection Luyi Chen 1 Shilin Wang 2 Shenghong Li 1 Jianhua Li 1 1 Department.
Solving Systems of Equations: The Elimination Method Solving Systems of Equations: The Elimination Method Solving Systems of Equations: The Elimination.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Long Reports. Recommendation Report – Intro, Problem, Solution, Conclusion Topic and reader Principles of effective page design – Type, margins, textual.
Wikispam, Wikispam, Wikispam PmWiki Patrick R. Michaud, Ph.D. March 4, 2005.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Links and PageRank. How much do links effect rank?
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Artificial Intelligence Methods Neural Networks Lecture 3 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,
Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Presentation by Giorgos Theodoridis. WordPress is a free web software you can use to create a beautiful website, blog, or app, (CMS) based on PHP and.
Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.
© 2013, Grazitti Interactive Search Engine O ptimization Movers & Shakers 2012.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Long Reports.
WEB SPAM.
Julián ALARTE DAVID INSA JOSEP SILVA
27 April ESANN 2006 Benjamin Schrauwen and Jan Van Campenhout
Jekyll Documentation Theme
العدد تذكيره وتأنيثه مقدمة
Overview of Supervised Learning
Advanced Techniques for Automatic Web Filtering
Albert Xue, Binbin Huang, Jianrong Wang
Advanced Techniques for Automatic Web Filtering
ROC Curves and Operating Points
Knowledge Transfer via Multiple Model Local Structure Mapping
Support Vector Machine _ 2 (SVM)
Equations Objectives for today’s lesson :
Text Mining Application Programming Chapter 9 Text Categorization
Lecture 16. Classification (II): Practical Considerations
Report 7 Brandon Silva.
Tips to Stop Spam in Gmail Account |Gmail Customer Helpline Number
Presentation transcript:

Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005

What is Blog Spam? Bots posting comments unrelated to the original blog post Comments contain links to irrelevant sites Links are used to fool Google

Current Solutions Register Solve a puzzle Prevent HTML Prevent comments in old posts IP Filter Limit comment rate

Objective Filter out blog spams

Approach Compare post contents with comment contents

KL-Divergence Similarity Use KL-Divergence as a similarity score between post and comment Lower score = Higher similarity

Clustering with Gaussian Mixture Use clustering based on Gaussian Mixture Cluster all comments of a post into 2 groups by KL-Divergence value Higher KL-Divergence value group is the spam group

Limitations Cheat the system by using words similar to the post in comments Posts and comments are too short to extract the language model –follow the links

Experiment Corpus 50 random blog posts with 1024 comments At least 3 comments per post 32% of comments are valid 68% of comments are spams

Sample Spams

Result Baseline: classify as spam with 68% probability Threshold Multiplier: adjust classification boundary

Conclusion No training No hand-coded rules Still working on –Follow the link to the website