Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clickstream Models & Sybil Detection Gang Wang ( 王刚 ) UC Santa Barbara

Similar presentations


Presentation on theme: "Clickstream Models & Sybil Detection Gang Wang ( 王刚 ) UC Santa Barbara"— Presentation transcript:

1 Clickstream Models & Sybil Detection Gang Wang ( 王刚 ) UC Santa Barbara

2 Modeling User Clickstream Events  User-generated events  E.g. profile load, link follow, photo browse, friend invite  Assume we have event type, userID, timestamp  Intuition: Sybil users act differently from normal users  Sybil users act differently from normal users Goal-oriented: focus on specific actions, less “extraneous” events Time-limited: focused on efficient use of time, smaller gaps?  Forcing Sybil users to mimic users  win?

3 Legit Sybils System Overview Clickstream Log Sequence Clustering Cluster Coloring Known Good Users ? Incoming Clickstream 3

4 Clickstream Models  Clickstream log  user clicks (click type) with timestamp  Modeling Clickstream  Event-only Sequence Model: order of events e.g. ABCDA  Time-based Model: sequence of inter-arrival time e.g. {t1, t2, t3, …}  Hybrid Model: sequence of click events with time e.g. A(t1)B(t2)C(t3)D(t4)A 4

5 Clickstream Clustering  Similarity Graph  Vertices: users (or sessions)  Edges: weighted by the similarity score of two user’s clickstream  Clustering Similar Clickstreams together  Graph partitioning using METIS Q: How to compare two clickstreams? 5

6 Distance Functions Of Each Model  Click Sequence (CS) Model  Ngram overlap  Ngram+count  Time-based Model  Compare the distribution of inter-arrival time  K-S test  Hybrid Model  Bucketize inter-arrival time  Compute 5grams (similar with CS Model) ngram1= {A, B, AA, AB, AAB} ngram2= {A, C, AA, AC, AAC} S1= AAB S2= AAC ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)} ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)} S1= AAB S2= AAC Euclidean Distance V1=(2,1,0,1,0,1,1,0)/6 V2=(2,0,1,1,1,0,0,1)/6 6

7 Detection In A Nutshell  Inputs:  Trained clusters  Input sequences for testing  Methodology: given a test sequence A  K nearest neighbor: find the top-k nearest sequences in the trained cluster  Nearest Cluster: find the nearest cluster based on average distance to sequences in the cluster  Nearest Cluster (center): pre-compute the center(s) of cluster, find the nearest cluster center ? 7

8 Clustering Sequences Model (Sequence Type) Distance Function (False positives, False negatives) of users 20 clusters50 clusters100 clusters Click Sequence Model (Categories) unigram(3%, 6%)(1%, 7%)(2%, 4%) unigram+count(1%, 4%)(1%, 3%) 10gram(1%, 3%) (2%, 2%) 10gram+count(1%, 4%)(2%, 4%)(1%, 2%) Time-based ModelK-S Test(9%, 8%)(2%, 10%)(5%, 10%) Hybrid Model (Categories) 5gram(3%, 2%)(2%, 2%) 5gram+count(3%, 4%)(4%, 5%)(1%, 2%)  How well can each method separate Sybils from legitimate users? 8

9 Detection Accuracy  Basics  Training on one group of users, and test on the other group of users.  Clusters trained using Hybrid Model  Key takeaways  High accuracy with 50 clicks in the test sequence  Nearest Cluster (Center) method achieves high accuracy with minor computation overhead Number of Clicks in the Sequence (length) (False positives, False negatives) of users K-nearest Neighbors (k=3) Nearest Cluster (Avg. Distance) Nearest Cluster (Center) Length <=50(1.5%, 2.1%)(0.6%, 2.6%)(0.4%, 2.3%) Length <=100(0.9%, 1.8%)(0.2%, 2.5%)(0.3%, 2.3%) All(0.6%, 3%)(0.4%, 2.8%)(0.4%, 2.3%) 9

10 Can Model Be Effective Over Time?  Experiment method  Using first two-week data to train the model  Testing on the following two-week data Model (False positives, False negatives) of users K-nearest Neighbors (k=3) Nearest Cluster (Avg. Distance) Nearest Cluster (Center) Click Sequence Model(1.8%, 1%)(3%, 2%)(3%, 0.8%) Hybrid Model(3%, 2%)(3%, 1%)(1.2%, 1.4%) 10

11 Still Ongoing Work  With broad interest and applications  As Sybil detection tool  Code being tested internally at Renren Trained with 10K users (2-week log) Testing on 1 Million users (1-week log) 5 Sybil clusters 22K suspicious profiles Further improvement Training with longer clickstream (half users have <5 clicks in 2-week) More conservative in labeling Sybil clusters.  As user modeling tool  Code being tested by LinkedIn as user profiler

12 Some Useful Tools  Graph Partitioning  Metis  Community Detection  Louvain code https://sites.google.com/site/findcommunities/

13 Other Ongoing Works/Ideas  Fighting against crowdturfing  Crowdturfing: real users are paid to spam  How to detect these malicious real users User behavior model Network-wised temporal anomaly detection  Information Dissemination  Content sharing visa social edges How often will user click on the content How often will user comment on the content  Sybil detection, target ad placement

14 Questions? Thank You!


Download ppt "Clickstream Models & Sybil Detection Gang Wang ( 王刚 ) UC Santa Barbara"

Similar presentations


Ads by Google