Crowd Fraud Detection in Internet Advertising Tian Tian 1 Jun Zhu 1 Fen Xia 2 Xin Zhuang 2 Tong Zhang 2 Tsinghua University 1 Baidu Inc. 2 1.

Slides:

Advertisements

Similar presentations

Beliefs & Biases in Web Search

Advertisements

Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

The Lane’s Gifts vs. Google Report part 2 Andrew Holoman.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Cloak and Dagger. In a nutshell… Cloaking Cloaking in search engines Search engines’ response to cloaking Lifetime of cloaked search results Cloaked pages.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

SASH Spatial Approximation Sample Hierarchy

Small-world Overlay P2P Network

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Spatial Outlier Detection and implementation in Weka Implemented by: Shan Huang Jisu Oh CSCI8715 Class Project, April Presented by Jisu.

1 BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University.

Beyond the perimeter: the need for early detection of Denial of Service Attacks John Haggerty,Qi Shi,Madjid Merabti Presented by Abhijit Pandey.

Internet Cache Pollution Attacks and Countermeasures Yan Gao, Leiwen Deng, Aleksandar Kuzmanovic, and Yan Chen Electrical Engineering and Computer Science.

Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

BotGraph: Large Scale Spamming Botnet Detection Yao Zhao Yinglian Xie *, Fang Yu *, Qifa Ke *, Yuan Yu *, Yan Chen and Eliot Gillum ‡ EECS Department,

Unconstrained Endpoint Profiling (Googling the Internet)‏ Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.

FACE DETECTION AND RECOGNITION By: Paranjith Singh Lohiya Ravi Babu Lavu.

Introduction to Monte Carlo Methods D.J.C. Mackay.

1. Introduction Generally Intrusion Detection Systems (IDSs), as special-purpose devices to detect network anomalies and attacks, are using two approaches.

Water Contamination Detection – Methodology and Empirical Results IPN-ISRAEL WATER WEEK (I 2 W 2 ) Eyal Brill Holon institute of Technology, Faculty of.

Incremental Support Vector Machine Classification Second SIAM International Conference on Data Mining Arlington, Virginia, April 11-13, 2002 Glenn Fung.

A Statistical Anomaly Detection Technique based on Three Different Network Features Yuji Waizumi Tohoku Univ.

Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.

Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )

From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)

Chapter 1 Introduction to Simulation

Network Intrusion Detection Using Random Forests Jiong Zhang Mohammad Zulkernine School of Computing Queen's University Kingston, Ontario, Canada.

P.1Service Control Technologies for Peer-to-peer Traffic in Next Generation Networks Part2: An Approach of Passive Peer based Caching to Mitigate P2P Inter-domain.

SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.

An affinity-driven clustering approach for service discovery and composition for pervasive computing J. Gaber and M.Bakhouya Laboratoire SeT Université.

Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

1 Characterizing Botnet from Spam Records Presenter: Yi-Ren Yeh ( 葉倚任 ) Authors: L. Zhuang, J. Dunagan, D. R. Simon, H. J. Wang, I. Osipkov, G. Hulten,

Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.

Collusion-Resistance Misbehaving User Detection Schemes Speaker: Jing-Kai Lou 2015/10/131.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Mapping Internet Sensors with Probe Response Attacks Authors: John Bethencourt, Jason Franklin, Mary Vernon Published At: Usenix Security Symposium, 2005.

SAMPLING TECHNIQUES. Definitions Statistical inference: is a conclusion concerning a population of observations (or units) made on the bases of the results.

1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese.

Detecting Influenza Outbreaks by Analyzing Twitter Messages By Aron Culotta Jedsada Chartree 02/28/11.

BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.

BotGraph: Large Scale Spamming Botnet Detection Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum Speaker: 林佳宜.

Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.

April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.

Neural Network Implementation of Poker AI

AJ Guardado Comp. Sci. 49S.  In order to correctly manage programs (AdSense, AdWords), properly charge for the PPC revenue model, and detect invalid.

SybilGuard: Defending Against Sybil Attacks via Social Networks.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Post-Ranking query suggestion by diversifying search Chao Wang.

Click to edit Master title style Multi-Destination Routing and the Design of Peer-to-Peer Overlays Authors John Buford Panasonic Princeton Lab, USA. Alan.

Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Network-based and Attack-resilient Length Signature Generation for Zero-day Polymorphic Worms Zhichun Li 1, Lanjia Wang 2, Yan Chen 1 and Judy Fu 3 1 Lab.

An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.

A Nonparametric Method for Early Detection of Trending Topics Zhang Advisor: Prof. Aravind Srinivasan.

CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks (WWW2013) BEUTEL, ALEX, WANHONG XU, VENKATESAN GURUSWAMI, CHRISTOPHER.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Network Anomaly Detection Using Autonomous System Flow Aggregates Thienne Johnson 1,2 and Loukas Lazos 1 1 Department of Electrical and Computer Engineering.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Experience Report: System Log Analysis for Anomaly Detection

The PageRank Citation Ranking: Bringing Order to the Web

Flavio Toffalini, Ivan Homoliak, Athul Harilal,

Presentation transcript:

Crowd Fraud Detection in Internet Advertising Tian Tian 1 Jun Zhu 1 Fen Xia 2 Xin Zhuang 2 Tong Zhang 2 Tsinghua University 1 Baidu Inc. 2 1

Outline Motivation Characteristic Analysis Detection Methods Empirical Results Conclusion 2

About Internet Advertising Charged by the volume of clicks 3

What is Crowd Fraud? Rise the risk of fraud. Malwares, Auto clickers.. A group of People 4 A group of people driven by economic benefits work together to increase fraudulent traffic on certain targets.

What’s new? Crowd FraudConventional Number of workers Traffic per person Behavior pattern? Have normal clicks? Large Small Random Yes Few Large Regular No 5

Outline Motivation Characteristic Analysis Detection Methods Empirical Results Conclusion 6

Characteristic Analysis we collect click datasets of both normal traffics and crowd fraud traffics. 7

Moderateness the hit frequencies of crowd fraud target queries will be neither too small nor too large. Aim to raise traffic human efficiency is limited 8

Synchronicity Target Synchronicity Surfers can be grouped into coalitions ; each coalition attack a common set of advertisers Temporal Synchronicity most clicks toward an advertiser happen within common short time period 9

Dispersivity Crowd fraud surfers may search unrelated queries Normal: Eye cream, Cleansing milk, Skin care Crowd Fraud: Beach BBQ, Hospital, Royal jelly Same Business Domain Different Business Domain No Real Information Demand 10

Outline Motivation Characteristic Analysis Detection Methods Empirical Results Conclusion 11

Crowd Fraud Detection Based on the above characteristics, we propose a Crowd Fraud Detection Method of Search Engine 12

Construction Stage Remove irrelevant data based on moderateness more than 70% Reorganize the remain logs into a surfer-advertiser inverted list 13 IP: abc,{ } {Ader ID: 26 },time: 456,{Ader ID: 64,time:136},… Click history Click event

Clustering Stage—— Formulation Detect malicious surfer coalitions in which all surfers have similar behavior patterns 14 Click histories

Clustering Stage—— Formulation Sync-similarity numerically equals to the number of targets shared (and happened in the same time period) by two click histories. 15 IP: a,{ {Id:12,Time:135}, {Id:13,Time:45}, {Id:28,Time:97}} IP: b,{ {Id:12,Time:122}, {Id:13,Time:135}, {Id:21,Time:15}} | |<24|45-135|>24 Sync-similarity=0 Sync-similarity=1

Clustering Stage—— Formulation We define Coalition center μ, with the same form of Click history. Looks like clustering! But Number of coalitions is very hard to decide in advance 16

Clustering Stage——Algorithm Inspired by nonparametric clustering DP-means Each normal surfer as an one-member-coalition 1 Update = Assignment Step + UpdateCenter Step 17

Filtering Stage Remove false alarm clusters (e.g. games ad.) False alarm clusters usually focus on one business domain, which invalid the dispersivity. Use query-advertiser inverted list! 18 Query: game,{Advertiser Id: 2,19,79,184,336,…} Center: k,{ Advertiser Id: 2,19,66,79} Query: game,{Advertiser Id: 2,19,79,184,336,…} Center: k,{ Advertiser Id: 2,19,66,79} At least 3 Advertisers in this coalition share same business domain!

Parallelization The real world click logs can be very large, a serial algorithm may cannot be used. So we develop a parallel implementation to make it practical in real scenarios. Its difficulty is in the Assignment step of the nonparametric clustering algorithm. 19

Parallel Assignment Step 20 Divide data into epoches Assign in parallel Save assignments Gather new clusters Merge similar clusters Save merged clusters Go to next epoch

Parallel Assignment Step Notes The parallel algorithm is equivalent to the serial algorithm. The validation step is the bottleneck of algorithm, we can skip it to speed up. 21

Outline Motivation Characteristic Analysis Detection Methods Empirical Results Conclusion 22

Synthetic Data Experiments Label the data is hard, so we build a Synthetic Dataset to test the Recall performance. Normal Part: Simulate 1 million surfers and 100 thousand advertisers. Each normal surfer randomly clicks 10 advertisers Hit times are uniformly sampled from [1, 240] Fraudulent Part: Generate L coalitions, L at 100, 250, 500, 750 and 1,000. each consists of 200 surfers and 5 advertisers and assigns each advertiser a random hit time. 23

Synthetic Data Experiments Number of discovered coalitions for both settings increase about linearly with the number coalitions. Recall rates of the algorithm without validation are lower, but acceptable when coalitions are rare % 65% 3x~4x

Real World Data Experiments One week data of advertisement click logs of a real Chinese commercial search engine. 398 million logs, 6.5 million unique IPs, 330 thousand unique advertiser Ids and 2.9 million unique queries. 25

Convergence & Scalability (a) shows the number of malicious IPs we found during iterating, converges after about 30 iterations. (b) shows the overall running time after each epoch, increases about linearly. 26

Accuracy Found 231 malicious coalitions After filtering, 210 coalitions are removed 29.3 K logs, 20 of the remain satisfy the dispersivity condition 27 1.Compared with commercial rule-based system new discovered logs are labeled by experts. 90% of them are crowd fraud.

Outline Motivation Characteristic Analysis Detection Methods Empirical Results Conclusion 28

Conclusion We formally analyze the crowd fraud problem for Internet advertising. 29

Conclusion We formally analyze the crowd fraud problem for Internet advertising. We present an effective method to detect crowd fraud. 30

Conclusion We formally analyze the crowd fraud problem for Internet advertising. We present an effective method to detect crowd fraud. We scale up the method for large-scale search engine advertising. Experiments on both synthetic and real world data show the effectiveness. 31

Thank You! We formally analyze the crowd fraud problem for Internet advertising. We present an effective method to detect crowd fraud. We scale up the method for large-scale search engine advertising. Experiments on both synthetic and real world data show the effectiveness. 32 Tian