Clickprints on the Web: Are there Signatures in Web Browsing Data?

Clickprints on the Web: Are there Signatures in Web Browsing Data?
Balaji Padmanabhan The Wharton School University of Pennsylvania Yinghui (Catherine) Yang Graduate School of Management University of California, Davis

Signatures in technology mediated applications
Unique typing patterns, or “keystroke dynamics” Miller 1994, Monrose and Rubin 1997, Everitt and McOwan 2003. In an experiment involving 42 user profiles, Monrose and Rubin (1997) shows that depending on the classifier used, between 80 to 90 percent of users can be automatically recognized using features such as the latency between keystrokes and the length of time different keys are pressed. Writeprints Li, Zheng and Chen (2006) Experiments involving 10 users in two different message boards suggest that “writeprints” could well exist since the accuracies obtained were between 92 and 99 percent. Walkie Talkie? Mäntyjärvi et al. 2005 individuals may have unique “gait” or walking patterns when they move with mobile devices.

Motivating Questions Do unique behavioral signatures exist in Web browsing data? How can behavioral signatures be learned? Why is this useful?

How to Decide Whether Signatures Exist
Two General Methods: Build features and classify. Build features/variables to describe users’ activities Learn a classifier (user ID as the dependent variable) Check it’s accuracy on unseen data Answer the question A patterns-based approach pick a pattern representation, and search for distinguishing patterns. e.g. for user k, “total_time < 5 minutes and number of pages > 50” may be a unique clickprint since there is no other user for whom this is true.

The Aggregation Question
Given a unit of analysis (click/session), how much aggregation is needed before there is enough information in each aggregation to uniquely identify a person? For some level of aggregation, agg, we’d like {c1, c2,…, cagg}  user {c1, c2,…, ck}  <v1, v2,…,vq, user> Feature construction, F {<v1, v2,…,vq, user>}  user = M(v1, v2,…,vq) Building a predictive model Find the smallest level of aggregation agg at which unique clickprints (accuracy > threshold) exist. Key elements: How features are constructed for a group of sessions How much aggregation needs to be done

An example of aggregations

Experiments and Design
comScore Networks, 50,000 users, 1 year User-Centric data A session is a user’s activities across Web sites Created multiple data sets by combining sessions from 2, 3, 4, 5, 10, 15, 20 users (140 data sets in total) User selection: Users with household size 1 Users with enough sessions for adequate out-of-sample testing Pick users with > 300 sessions in a year First 2/3 sessions as training, last 1/3 sessions as hold-out Same number of sessions for the selected users in each data set to guarantee same class prior before and after aggregation.

The Features For a single session (i) The duration (ii) The number of pages viewed (iii) The starting time (in seconds after 12.00am) and (iv) The number of sites visited (v) Binary variables indicating for the top k (=5, 10) Web sites are visited note: these top-k Web sites for each user are identified only from the training set For sets of sessions Create variables capturing distributions of these measures Mean, median, variance max and min for the continuous attributes Frequency counts for the top Web sites

Classifier J4.8 classification tree in weka Model goodness Temporal hold out samples (1/3 testing) Threshold accuracy 90%, also used other different levels Increase aggregation level and stop when accuracy is high enough or stopping condition is reached. Set agg=30 in these experiments

Results for one specific accuracy threshold
The optimal levels of aggregation averaged across 20 runs for 90% accuracy (top 10 web sites). # of users Mean % runs with agg<30 2 1.05 100% 3 1.26 95% 4 1.78 90% 5 2.16 10 4.24 85% 15 5.2 75% 20 8.9 50%

Heuristic for Large Problems: A Monotonicity Assumption
accuracy(M | agg1)  accuracy(M | agg2) whenever agg1  agg2 In words: the goodness of the model when applied to “more aggregated” data is never worse than the goodness of the model applied to “less aggregated” data Can then use a binary search procedure to find the optimal agg. Perhaps not very useful when useful agg values are much smaller, as in our problems/experiments Continuing to study when this may work and be useful

Conclusion Contribution: Significance of the problem and initial results Challenges Scale What is a signature? On-going/Future Research Pattern-based signature Application-driven signature problems (e.g. fraud detection, personalization, etc.)

Thank you.

Learning user profiles online
Related Work Learning user profiles online Aggarwal et al. (1998) Adomavicius and Tuzhilin (2001) Mobasher et al. (2002) User profiles for fraud detection Fawcett and Provost (1996) Cortes and Pregibon (2001) Data Preprocessing Cooley et al. (1999), Zheng et al. (2003). Online intrusion detection Ellis et al. (2004)

Binary search for the optimal aggregation
Start with N users’ Web sessions mixed together. Assume that the range of aggregations we wish to consider are 1, 2, 3,…, K sessions Consider accuracy at agg = K/2 If this accuracy ≥ threshold then recursively search in the lower half of the sequence If this accuracy < threshold then recursively search in the higher half of the sequence

Histogram of number of sessions

Distribution of the agg values

Clickprints on the Web: Are there Signatures in Web Browsing Data?

Similar presentations

Presentation on theme: "Clickprints on the Web: Are there Signatures in Web Browsing Data?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clickprints on the Web: Are there Signatures in Web Browsing Data?

Similar presentations

Presentation on theme: "Clickprints on the Web: Are there Signatures in Web Browsing Data?"— Presentation transcript:

Similar presentations

About project

Feedback