Clickprints on the Web: Are there Signatures in Web Browsing Data?

Slides:

Advertisements

Similar presentations

Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)

Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Fast Algorithms For Hierarchical Range Histogram Constructions

G. Alonso, D. Kossmann Systems Group

Chapter 7 – Classification and Regression Trees

SIA: Secure Information Aggregation in Sensor Networks Bartosz Przydatek, Dawn Song, Adrian Perrig Carnegie Mellon University Carl Hartung CSCI 7143: Secure.

UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.

On the Relationship between Visual Attributes and Convolutional Networks Paper ID - 52.

Keystroke Biometric Studies Security Research at Pace Keystroke Biometric Drs. Charles Tappert and Allen Stix Seidenberg School of CSIS.

Giga-Mining Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999.

Learning-Based Anomaly Detection in BGP Updates Jian Zhang Jennifer Rexford Joan Feigenbaum.

Ensemble Learning (2), Tree and Forest

FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.

Data Mining for Intrusion Detection: A Critical Review Klaus Julisch From: Applications of data Mining in Computer Security (Eds. D. Barabara and S. Jajodia)

From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,

Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.

Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.

Chapter 9 – Classification and Regression Trees

Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 18 Inference for Counts.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

SIA: Secure Information Aggregation in Sensor Networks B. Przydatek, D. Song, and A. Perrig. In Proc. of ACM SenSys 2003 Natalia Stakhanova cs610.

Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,

BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

I can be You: Questioning the use of Keystroke Dynamics as Biometrics —Paper by Tey Chee Meng, Payas Gupta, Debin Gao Presented by: Kai Li Department of.

Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang Wojtek Kowalczyk ECML/PKDD Discovery.

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.

Ensemble Learning for Low-level Hardware-supported Malware Detection

 Who Uses Web Search for What? And How?. Contribution  Combine behavioral observation and demographic features of users  Provide important insight.

Classification Ensemble Methods 1

The seven traditional tools of quality I - Pareto chart II – Flowchart III - Cause-and-Effect Diagrams IV - Check Sheets V- Histograms VI - Scatter Diagrams.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Keystroke Dynamics By Hafez Barghouthi.

A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,

Introduction to Digital Analytics Keith MacDonald Guest Presentation.

Unobtrusive Mobile User Recognition Patent by Seal Mobile ID Presented By: Aparna Bharati & Ashrut Bhatia.

Learning Profiles from User Interactions

Uncovering Social Spammers: Social Honeypots + Machine Learning

Data Transformation: Normalization

Evaluating Classifiers

Jacob R. Lorch Microsoft Research

CSCE 3110 Data Structures & Algorithm Analysis

CSC 427: Data Structures and Algorithm Analysis

QianZhu, Liang Chen and Gagan Agrawal

CFA: A Practical Prediction System for Video Quality Optimization

Decision Trees (suggested time: 30 min)

PCB 3043L - General Ecology Data Analysis.

Vijay Srinivasan Thomas Phan

CS 4/527: Artificial Intelligence

Inferential Statistics

Outlier Discovery/Anomaly Detection

Tremor Detection Using Motion Filtering and SVM Bilge Soran, Jenq-Neng Hwang, Linda Shapiro, ICPR, /16/2018.

Rank Aggregation.

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz

Unit 2: Descriptive Statistics

Attentional Modulations Related to Spatial Gating but Not to Allocation of Limited Resources in Primate V1 Yuzhi Chen, Eyal Seidemann Neuron Volume.

Using analytics to drive traffic

Data Transformations targeted at minimizing experimental variance

CSC 427: Data Structures and Algorithm Analysis

Data Warehousing Data Mining Privacy

Finding Periodic Discrete Events in Noisy Streams

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Testing & modeling users

Fast Sequences of Non-spatial State Representations in Humans

Yingze Wang and Shi-Kuo Chang University of Pittsburgh

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

Clickprints on the Web: Are there Signatures in Web Browsing Data? Balaji Padmanabhan The Wharton School University of Pennsylvania Yinghui (Catherine) Yang Graduate School of Management University of California, Davis

Signatures in technology mediated applications Unique typing patterns, or “keystroke dynamics” Miller 1994, Monrose and Rubin 1997, Everitt and McOwan 2003. In an experiment involving 42 user profiles, Monrose and Rubin (1997) shows that depending on the classifier used, between 80 to 90 percent of users can be automatically recognized using features such as the latency between keystrokes and the length of time different keys are pressed. Writeprints Li, Zheng and Chen (2006) Experiments involving 10 users in two different message boards suggest that “writeprints” could well exist since the accuracies obtained were between 92 and 99 percent. Walkie Talkie? Mäntyjärvi et al. 2005 individuals may have unique “gait” or walking patterns when they move with mobile devices.

Motivating Questions Do unique behavioral signatures exist in Web browsing data? How can behavioral signatures be learned? Why is this useful?

How to Decide Whether Signatures Exist Two General Methods: Build features and classify. Build features/variables to describe users’ activities Learn a classifier (user ID as the dependent variable) Check it’s accuracy on unseen data Answer the question A patterns-based approach pick a pattern representation, and search for distinguishing patterns. e.g. for user k, “total_time < 5 minutes and number of pages > 50” may be a unique clickprint since there is no other user for whom this is true.

The Aggregation Question Given a unit of analysis (click/session), how much aggregation is needed before there is enough information in each aggregation to uniquely identify a person? For some level of aggregation, agg, we’d like {c1, c2,…, cagg}  user {c1, c2,…, ck}  <v1, v2,…,vq, user> Feature construction, F {<v1, v2,…,vq, user>}  user = M(v1, v2,…,vq) Building a predictive model Find the smallest level of aggregation agg at which unique clickprints (accuracy > threshold) exist. Key elements: How features are constructed for a group of sessions How much aggregation needs to be done

An example of aggregations

An example of aggregations

Experiments and Design comScore Networks, 50,000 users, 1 year User-Centric data A session is a user’s activities across Web sites Created multiple data sets by combining sessions from 2, 3, 4, 5, 10, 15, 20 users (140 data sets in total) User selection: Users with household size 1 Users with enough sessions for adequate out-of-sample testing Pick users with > 300 sessions in a year First 2/3 sessions as training, last 1/3 sessions as hold-out Same number of sessions for the selected users in each data set to guarantee same class prior before and after aggregation.

Experiments and Design The Features For a single session (i) The duration (ii) The number of pages viewed (iii) The starting time (in seconds after 12.00am) and (iv) The number of sites visited (v) Binary variables indicating for the top k (=5, 10) Web sites are visited note: these top-k Web sites for each user are identified only from the training set For sets of sessions Create variables capturing distributions of these measures Mean, median, variance max and min for the continuous attributes Frequency counts for the top Web sites

Experiments and Design Classifier J4.8 classification tree in weka Model goodness Temporal hold out samples (1/3 testing) Threshold accuracy 90%, also used other different levels Increase aggregation level and stop when accuracy is high enough or stopping condition is reached. Set agg=30 in these experiments

Results for one specific accuracy threshold The optimal levels of aggregation averaged across 20 runs for 90% accuracy (top 10 web sites). # of users Mean % runs with agg<30 2 1.05 100% 3 1.26 95% 4 1.78 90% 5 2.16 10 4.24 85% 15 5.2 75% 20 8.9 50%

Heuristic for Large Problems: A Monotonicity Assumption accuracy(M | agg1)  accuracy(M | agg2) whenever agg1  agg2 In words: the goodness of the model when applied to “more aggregated” data is never worse than the goodness of the model applied to “less aggregated” data Can then use a binary search procedure to find the optimal agg. Perhaps not very useful when useful agg values are much smaller, as in our problems/experiments Continuing to study when this may work and be useful

Conclusion Contribution: Significance of the problem and initial results Challenges Scale What is a signature? On-going/Future Research Pattern-based signature Application-driven signature problems (e.g. fraud detection, personalization, etc.)

Thank you.

Learning user profiles online Related Work Learning user profiles online Aggarwal et al. (1998) Adomavicius and Tuzhilin (2001) Mobasher et al. (2002) User profiles for fraud detection Fawcett and Provost (1996) Cortes and Pregibon (2001) Data Preprocessing Cooley et al. (1999), Zheng et al. (2003). Online intrusion detection Ellis et al. (2004)

Binary search for the optimal aggregation Start with N users’ Web sessions mixed together. Assume that the range of aggregations we wish to consider are 1, 2, 3,…, K sessions Consider accuracy at agg = K/2 If this accuracy ≥ threshold then recursively search in the lower half of the sequence If this accuracy < threshold then recursively search in the higher half of the sequence

Histogram of number of sessions

Distribution of the agg values