Project Presentations

Project Presentations
Thursday next week, each student will make a 4-minute presentation on their project in class (with 1 or 2 minutes for questions) me your Powerpoint or PDF slides, with your name (e.g., joesmith.ppt), before 10am next Thursday Suggested content: Definition of the task/goal Description of data sets Description of algorithms Experimental results and conclusions Be visual where possible! (i.e., use figures, graphs, etc) Final project report will be due by 12 noon Tuesday of finals week – more details to come later Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

ICS 278: Data Mining Lecture 18: Analysis of Web User Data
Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Outline Basic concepts in Web mining
Analyzing user navigation or clickstream data Predictive modeling of Web navigation behavior Markov modeling methods Analyzing search engine data Ecommerce aspects of Web log mining Automated recommender systems Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Further Reading Modeling the Internet and the Web, P. Baldi, P. Frasconi, P. Smyth, Wiley, 2003. ACM Transactions on Internet Technology (ACM TOIT) – can be accessed via ACM Digital Library (available from UCI IP addresses). Annual WebKDD workshops at the ACM SIGKDD conferences. Papers on Web page prediction Selective Markov models for predicting Web page accesses, M. Deshpande, G. Karypis, ACM Transactions on Internet Technology, May 2004. Model-based clustering and visualization of navigation patterns on a Web site, Cadez et al, Journal of Data Mining and Knowledge Discovery, 2003. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Introduction to Web Mining
Useful to study human digital behavior, e.g. search engine data can be used for Exploration e.g. # of queries per session? Modeling e.g. any time of day dependence? Prediction e.g. which pages are relevant? Applications Understand social implications of Web usage Design of better tools for information access E-commerce applications Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Advertising Applications
Revenue of many internet companies is driven by advertising Key problem: Given user data: Pages browsed Keywords used in search Demographics Determine the most relevant ads (in real-time) Currently about 50% of keyword searches can not be matched effectively to any ads (other aspects include bidding/pricing of ads) Another major problem: “click fraud” Algorithms that can automatically detect when online advertisements are being manipulated (this is a major problem for Internet advertising) Understanding the user is key to these types of applications Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Data Sources for Web Mining
Web content Text and HTML content on Web pages, e.g., categorization of content Web connectivity Hyperlink/directed-graph structure of the Web e.g., using PageRank to infer importance of Web pages e.g., using links to improve accuracy in classification of Web pages Web user data Data on how users interact with the Web Navigation data, aka “clickstream” data Search query data (keywords for users) Online transaction data (e.g., purchases at an ecommerce store) Volume of data? Large portals (e.g., Yahoo!, MSN) report 100’s of millions of users per month Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Flowchart of a typical Web Mining process (From Cooley, ACM TOIT,
2003) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

How our Web navigation is recorded…
Web logs Record activity between client browser and a specific Web server Easily available Can be augmented with cookies (provide notion of “state”) Search engine records Text in queries, which pages were viewed, which snippets were clicked on, etc Client-side browsing records Automatically recorded by client-side software Harder to obtain, but much more accurate than server-side logs Other sources Web site registration, purchases, , etc ISP recording of Web browsing Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Web Server Log Files Server Transfer Log: Referrer Log: Agent Log:
transactions between a browser and server are logged IP address, the time of the request Method of the request (GET, HEAD, POST…) Status code, a response from the server Size in byte of the transaction Referrer Log: where the request originated Agent Log: browser software making the request (spider) Error Log: request resulted in errors (404) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

W3C Extended Log File Format
cs = client-to-server actions s = server actions c = client actions sc = server-to-client actions Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Example of Web Log entries
Apache web log: [29/Mar/2002:03:58: ] "GET /~sophal/whole5.gif HTTP/1.0" " "Mozilla/4.0 (compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)" [29/Mar/2002:03:59: ] "GET /~alexlam/resume.html HTTP/1.0" "-" "Mozilla/5.0 (Slurp/cat; [29/Mar/2002:03:00: ] "GET /~tahir/indextop.html HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“ [29/Mar/2002:03:00: ] "GET /~tahir/animate.js HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“ Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Routine Server Log Analysis
Typical statistics/histograms that are computed Most and least visited web pages Entry and exit pages Referrals from other sites or search engines What are the searched keywords How many clicks/page views a page received Error reports, like broken links Many software products that produce standard reports of this type of data Very useful for Web site managers But does not provide “deep” insights e.g., are there clusters/groups of users that use the site in different ways? Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Visualization of Web Log Data over Time
Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Descriptive Summary Statistics
Histograms, scatter plots, time-series plots Very important! Helps to understand the big picture Provides “marginal” context for any model-building models aggregate behavior, not individuals Challenging for Web log data Examples Session lengths (e.g., power laws) Click rates as a function of time, content Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

L = number of page requests in a single session
from visitors to over 1 week in November 2002 (robots removed) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Best fit of simple power law model Log P(L) = -a Log L + b
or P(L) = b L-a Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Web data measurement issues
Important to understand how data is collected Web data is collected automatically via software logging tools Advantage: No manual supervision required Disadvantage: Data can be skewed (e.g. due to the presence of robot traffic) Important to identify robots (also known as crawlers, spiders) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

A time-series plot of ICS Website data
Number of page requests per hour as a function of time from page requests in the Web server logs during the first week of April 2002. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Example: Web Traffic from Commercial Site
Example: Web Traffic from Commercial Site (slide from Ronny Kohavi, Amazon) Sept-11 Note significant drop in human traffic, not bot traffic Weekends Internal Perfor-mance bot Registration at Search Engine sites Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Robot / human identification
Removal of robot data is important preprocessing step before clickstream analysis Robot page-requests often identified using a variety of heuristics e.g. some robots self-identify themselves in the server logs All robots in principle should visit robots.txt on the Web Server Also, robots should identify themselves via the User Agent field in page requests But other robots actively try to disguise that they are robots Patterns of access Robots explore the entire website in breadth first fashion Humans access web-pages more typically in depth-first fashion Timing between page-requests can be more regular for robots (e.g., every 5 seconds) Duration of sessions, number of page-requests per day: often unusually large (e.g., 1000’s of page-requests per day) for robots. Tan and Kumar (Journal of Data Mining and Knowledge Discovery, 2002) provide a detailed description of using classification techniques to learn how to detect robots Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Fractions of Robot Data (from Tan and Kumar, 2002)
Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

From Tan and Kumar, Overall accuracies of around 90% were obtained using decision tree classifiers, trained on sessions of lengths 1, 2, 3, 4,.. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers
In theory, requester browser requests a page from a Web server and the request is processed In practice, there are Other users Browser caching Dynamic addressing in local network Proxy Server caching Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

A graphical summary of how page requests from an individual user can be masked at various stages between the user’s local computer and the Web server. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Web server logs are therefore not so ideal in terms of a complete and faithful representation of individual page views There are heuristics to try to infer the true actions of the user: - Path completion (Cooley et al. 1999) e.g. If known B -> F and not C -> F, then session ABCF can be interpreted as ABCBF Anderson et al for more heuristics In general case, it is hard to know what exactly the user viewed Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Identifying individual users from Web server logs
Useful to associate specific page requests to specific individual users IP address most frequently used Disadvantages One IP address can belong to several users Dynamic allocation of IP address Better to use cookies (or login ID if available) Information in the cookie can be accessed by the Web server to identify an individual user over time Actions by the same user during different sessions can be linked together Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Identifying individual users from Web server logs
Commercial websites use cookies extensively 97 % of users have cookies enabled permanently on their browsers (source: Amazon.com, 2003) However … There are privacy issues – need implicit user cooperation Cookies can be deleted / disabled Another option is to enforce user registration High reliability But can discourage potential visitors Large portals (such as Yahoo!) have high fraction of logged-in users Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Sessionizing Time oriented (robust)
e.g., by gaps between requests not more than 20 minutes between successive requests this is a heuristic – but is a standard “rule” used in practice Navigation oriented (good for short sessions and when timestamps unreliable) Referrer is previous page in session, or Referrer is undefined but request within 10 secs, or Link from previous to current page in web site Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Client-side data Advantages of collecting data at the client side:
Direct recording of page requests (eliminates ‘masking’ due to caching) Recording of all browser-related actions by a user (including visits to multiple websites) More-reliable identification of individual users (e.g. by login ID for multiple users on a single computer) Preferred mode of data collection for studies of navigation behavior on the Web Companies like ComScore and Nielsen use client-side software to track home computer users Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Client-side data Statistics like ‘Time per session’ and ‘Page-view duration’ are more reliable in client-side data Some limitations Still some statistics like ‘Page-view duration’ cannot be totally reliable e.g. user might go to fetch coffee Need explicit user cooperation Typically recorded on home computers – may not reflect a complete picture of Web browsing behavior Web surfing data can be collected at intermediate points like ISPs, proxy servers Can be used to create user profile and target advertise Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Modeling Clickrate Data
200k Alexa users, client-side, over 24 hours ignore URLs requested goal is to build a time-series model that characterizes user click rates Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov-Poisson Model (Scott and Smyth, 2003)
Doubly stochastic process Locally constant Poisson rate indexed by M Markov states Fit a model with M = 3 states absence of a Web session Web session with slow click rate: 1 minute rate Web session with rapid click rate: 10 second rate Used hierarchical Bayes on individuals Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Hierarchical Bayes Model
Population Prior p(l|q) l1 Individual 1 li Individual i lN Individual N D1 D1 D2 D3 D1 D2 Individuals with little data get “shrunk” to the prior Individuals with a lot of data are more data-driven Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Prediction with Hierarchical Bayes
Population Prior p(l|q) New Individual l = ? l1 Individual 1 lN Individual N D1 D2 D1 D2 D3 First few clicks Historical Training Data Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Early studies from 1995 to 1997 Earliest studies on client-side data are Catledge and Pitkow (1995) and Tauscher and Greenberg (1997) In both studies, data was collected by logging Web browser commands Population consisted of faculty, staff and students Both studies found clicking on the hypertext anchors as the most common action using ‘back button’ was the second common action Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Early studies from 1995 to 1997 high probability of page revisitation (~ ) Lower bound because the page requests prior to the start of the studies are not accounted for Humans are creatures of habit? Content of the pages changed over time? strong recency (page that is revisited is usually the page that was visited in the recent past) effect Correlates with the ‘back button’ usage Similar repetitive actions are found in telephone number dialing etc Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

The Cockburn and McKenzie study from 2002
Earlier studies were outdates Web has changed dramatically in the past few years Cockburn and McKenzie (2002) provides a more up-to-date analysis Analyzed the daily history.dat files produced by the Netscape browser for 17 users for about 4 months Population studied consisted of faculty, staff and graduate students Study found revisitation rates higher than past 94 and 95 studies (~0.81) Time-window is three times that of past studies Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Revisitation rate less biased than the previous studies? Human behavior changed from an exploratory mode to a utilitarian mode? The more pages user visits, the more are the requests for new pages The most frequently requested page for each user can account for a relatively large fraction of his/her page requests Useful to see the scatter plot of the distinct number of pages requested per user versus the total pages requested Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

The number of distinct pages visited versus page vocabulary size of each of the 17 users in the Cockburn and McKenzie (2002) study (log-log plot) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Bar chart of the ratio of the number of page requests for the most frequent page divided by the total number of page requests, for 17 users in the Cockburn McKenzie (2002) study Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Outline Basic concepts in Web log data analysis
Predictive modeling of Web navigation behavior Markov modeling methods Analyzing search engine data Ecommerce aspects of Web log mining Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov models for page prediction
General approach is to use a finite-state Markov chain Each state can be a specific Web page or a category of Web pages If only interested in the order of visits (and not in time), each new request can be modeled as a transition of states Issues Self-transition Time-independence Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

For simplicity, consider order-dependent, time-independent finite-state Markov chain with M states Let s be a sequence of observed states of length L. e.g. s = ABBCAABBCCBBAA with three states A, B and C. st is state at position t (1<=t<=L). In general, first-order Markov assumption This provides a simple generative model to produce sequential data Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

If we denote Tij = P(st = j|st-1 = i), we can define a M x M transition matrix Properties Strong first-order assumption Simple way to capture sequential dependence If each page is a state and if W pages, O(W2), W can be of the order 105 to 106 for a CS dept. of a university To alleviate, we can cluster W pages into M clusters, each assigned a state in the Markov model Clustering can be done manually, based on directory structure on the Web server, or automatic clustering using clustering techniques Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Tij = P(st = j|st-1 = i) represents the probability that an individual user’s next request will be from category j, given they were in category i We can add E, an end-state to the model E.g. for three categories with end state: - E denotes the end of a sequence, and start of a new sequence Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

First-order Markov model assumes that the next state is based only on the current state Limitations Doesn’t consider ‘long-term memory’ We can try to capture more memory with kth-order Markov chain Inordinate amount of training data O(Mk+1) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Parameter estimation for Markov model transitions
Smoothed parameter estimates of transition probabilities are If nij = 0 for some transition (i, j) then instead of having a parameter estimate of 0 (ML), we will have allowing prior knowledge to be incorporated If nij > 0, we get a smooth combination of the data-driven information (nij) and the prior Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Parameter estimation for Markov models
One simple way to set prior parameter is Consider alpha as the effective sample size Partition the states into two sets, set 1 containing all states directly linked to state i and the remaining in set 2 Assign uniform probability r/K to all states in set 2 (all set 2 states are equally likely) The remaining (1-r) can be either uniformly assigned among set 1 elements or weighted by some measure Prior probabilities in and out of E can be set based on our prior knowledge of how likely we think a user is to exit the site from a particular state Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Predicting page requests with Markov models
Deshpande and Karypis (2004) propose schemes to prune kth-order Markov state space Provide systematic but modest improvements Another way is to use empirical smoothing techniques that combine different models from order 1 to order k (Chen and Goodman 1996) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Mixtures of Markov Chains
Cadez et al. (2003) and Sen and Hansen (2003) replace the first-order Markov chain: with a mixture of first-order Markov chains where c is a discrete-value hidden variable taking K values Sk P(c = k) = 1 and P(st | st-1, c = k) is the transition matrix for the kth mixture component One interpretation of this is user behavior consists of K different navigation behaviors described by the K Markov chains Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Modeling Web Page Requests with Markov chain mixtures
MSNBC Web logs Order of 2 million individual users per day different session lengths per individual difficult visualization and clustering problem WebCanvas uses mixtures of Markov chains to cluster individuals based on their observed sequences software tool: EM mixture modeling + visualization Next few slides are based on material in: I. Cadez et al, Model-based clustering and visualization of navigation patterns on a Web site, Journal of Data Mining and Knowledge Discovery, 2003. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

From Web logs to sequences
, -, 3/22/00, 10:35:11, W3SVC, SRVR1, , 781, 363, 875, 200, 0, GET, /top.html, -, , -, 3/22/00, 10:35:16, W3SVC, SRVR1, , 5288, 524, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 10:35:17, W3SVC, SRVR1, , 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 16:18:50, W3SVC, SRVR1, , 60, 425, 72, 304, 0, GET, /top.html, -, , -, 3/22/00, 16:18:58, W3SVC, SRVR1, , 8322, 527, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 16:18:59, W3SVC, SRVR1, , 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:54:37, W3SVC, SRVR1, , 140, 199, 875, 200, 0, GET, /top.html, -, , -, 3/22/00, 20:54:55, W3SVC, SRVR1, , 17766, 365, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 20:54:55, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:55:07, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:55:36, W3SVC, SRVR1, , 1061, 382, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 20:55:36, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:55:39, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:56:03, W3SVC, SRVR1, , 1081, 382, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 20:56:04, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:56:33, W3SVC, SRVR1, , 0, 262, 72, 304, 0, GET, /top.html, -, , -, 3/22/00, 20:56:52, W3SVC, SRVR1, , 19598, 382, 414, 200, 0, POST, /spt/main.html, -, User 1 2 3 2 2 3 3 3 1 1 1 3 1 3 3 3 3 User 2 3 3 3 1 1 1 User 3 7 7 7 7 7 7 7 7 User 4 1 5 1 1 1 5 1 5 1 1 1 1 1 1 User 5 5 1 1 5 … … Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Clusters of Finite State Machines
B B D D E E A B D Cluster 3 E Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Learning Problem Assumptions Given Learn Solution
data is being generated by K different groups Each group is described by a stochastic finite state machine (SFSM) aka, a Markov model with an end-state Given A set of sequences from different users of different lengths Learn A “mixture” of K different stochastic finite state machines Solution EM is very easy: fractional counts of transitions efficient and accurate, scales as O(KN) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Sketch of EM Algorithm for Mixtures of Markov Chains
Model = mixture of K Markov chains (K fixed) Input data = N categorical sequences (can be variable length) Initialization: Generate random initial transition matrices for each of the K groups E-step: Compute p( sequence i | model k), for i=1,..N, k = 1,…K Use Bayes rule to compute p(model k | sequence i) Yields membership probabilities for each sequence M-step: Estimate the transition probabilities for each cluster, given membership probabilities Consists of “fractional counting of transitions”, e.g., sequence with probability 0.8 in cluster k, results in transition counts weighted by 0.8 Repeat E and M steps until convergence Complexity of each iteration is O(K N L) where L is the average sequence length Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Prediction with Markov mixtures
P(st+1 | s[1,t] ) = Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] ) = S P(st+1 | k , s[1,t] ) P(k | s[1,t] ) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] ) = S P(st+1 | k , s[1,t] ) P(k | s[1,t] ) = S P(st+1 | k , st ) P(k | s[1,t] ) Prediction of kth component Membership, based on sequence history => Predictions are a convex combination of K different component transition matrices, with weights based on sequence history Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Experimental Methodology
Model Training: fit 2 types of models mixtures of histograms (multinomials) mixtures of finite state machines Train on a full day’s worth of MSNBC Web data Model Evaluation: “one-step-ahead” prediction on unseen test data Test sequences from a different day of Web logs compute log P(user’s next click | previous clicks, model) Using equation on the previous slide logP score: Rewarded if next click was given high P by the model Punished if next click was given low P by the model negative average of logP scores ~ “predictive entropy” Has a natural interpretation Lower bounded by 0 bits (perfect prediction) Upper bounded by log M bits, where M is the number of categories Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Timing Results Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

WebCanvas Software tool for Web log visualization
uses Markov mixtures to cluster data for display extensively used within Microsoft also applied to non-Web data (e.g., how users navigate in Word, etc) Algorithm and visualization are in latest release of SQLServer (the “sequence mining” tool) Model-based visualization random sample of actual sequences interactive tiled windows displayed for visualization more effective than planar graphs traffic-flow movie in Microsoft Site Server v3.0 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Insights from WebCanvas for MSNBC data
From msnbc.com site adminstrators…. significant heterogeneity of behavior relatively focused activity of many users typically only 1 or 2 categories of pages many individuals not entering via main page detected problems with the weather page missing transitions (e.g., tech <=> business) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Possible Extensions Adding time-dependence Uncategorized Web pages
adding time-between clicks, time of day effects Uncategorized Web pages coupling page content with sequence models Modeling “switching” behaviors allowing users to switch between behaviors Could use a topic-style model: users = mixtures of behaviors e.g., Girolami M & Kaban A., Sequential Activity Profiling: Latent Dirichlet Allocation of Markov Chains, Journal of Data Mining and Knowledge Discovery, Vol 10, Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Related Work Mixtures of Markov chains
special case: Poulsen (1990) general case: Ridgeway (1997), Smyth (1997) Clustering of Web page sequences non-probabilistic approaches (Fu et al, 1999) Markov models for prediction Anderson et al (IJCAI, 2001): mixtures of Markov outperform other sequential models for page-request prediction Sen and Hansen 2003 Zukerman et al. 1999 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Analysis of Search Engine Query Logs
# of Queries Source Time Period Lau & Horvitz 4690 Excite Sep 1997 Silverstein et al 1 Billion AltaVista 6 weeks in Aug & Sep 1998 Spink et al (series of studies) 1 Million for each time period Sep 1997 Dec 1999 May 2001 Xie & O’Hallaron 110,000 Vivisimo 35 days Jan & Feb 2001 1.9 Million 8 hrs in a day, Dec 1999 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Main Results Average number of terms in a query ranges from a low of 2.2 to a high of 2.6 The most common number of terms in a query was 2 The majority of users don’t refine their query The number of users who viewed only a single page increased from 29% (1997) to 51% (2001) (Excite) 85% of users viewed only first page of search results (AltaVista) 45% (2001) of queries are about Commerce, Travel, Economy, People (was 20% in 1997) All four studies produced a generally consistent set of findings about user behavior in a search engine context most users view relatively few pages per query most users don’t use advanced search features For both search engine, Distribution of query lengths is qualitatively similar Adult content, 1/ to 1/ Entertainment or recreation, 20% 1997 to 7% 2001 Business 20% 1997 to 45% 2001 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Xie and O Halloran Study (2002)
For both search engine, Distribution of query lengths is qualitatively similar - Query Length Distributions (bars) - Poisson Model (dots & lines) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Power-law Characteristics of Common Queries
Power-Law in log-log space Frequency f(r) of Queries with Rank r queries from Vivisimo 1.9 Million queries from Excite There are strong regularities in terms of patterns of behavior in how we search the Web The easiest way to see a power law is to plot the data in log/log space. If there is a power law relationship, you get a straight line Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Ecommerce Data Page request Web logs combined with
Purchase (market-basket) information User address information (if they make a purchase) Demographics information (can be purchased) s to/from the customer Search query information Product ratings information Main focus here is to increase revenue Data mining widely used by online commerce companies like Amazon This is a very rich source of problems for data mining What products should we advertise to this person? Can we do dynamic pricing? If a person buys X should we also suggest Y? Who are our best customers? etc Additional Reading Kohavi, Mason, Parekh, Zheng, Lessons and challenges from mining ecommerce retail data, Machine Learning Journal, 2004 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Predicting Purchase Behavior
Can use predictive models, e.g., logistic regression, to try to predict on real-time if a customer will make a purchase or not Statistical models: couple click-rate with purchase behavior Markov-type model through different states product viewing detailed product information reviews combine states with click rate and page content to predict p(purchase | data up to time t) Reference: Alan L. Montgomery, Shibo Li, Kannan Srinivasan, and John C. Liechty (2004), “Modeling Online Browsing and Path Analysis Using Clickstream Data,” Marketing Science, Vol. 23, No. 4, Fall 2004, p Potentially useful for ecommerce applications, e.g., real-time pricing/discounts but generally difficulty to predict if a customer will make a purchase or not Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Recommender Systems “Vote data” = n x m sparse binary matrix
m columns = “products”, e.g., books for purchase or movies for viewing n rows = users Interpretation: Implicit Ratings: v(i,j) = user i’s rating of product j (e.g. on a scale of 1 to 5) Explicit Purchases: v(i,j) = 1 if user i purchased product j entry = 0 if no purchase or rating We will refer to non-zero entries generically as “votes” Automated recommender systems Given votes by a user on a subset of items, recommend other items that the user may be interested in Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Examples of Recommender Systems
Books and movies purchasing: Amazon.com, Cdnow.com, etc Movie recommendations: Netflix MovieLens (movielens.umn.edu) Digital library recommendations CiteSeer (Popescul et al, 2001): m = 177,000 documents N = 33,000 users Each user accessed 18 documents on average (0.01% of the database -> very sparse!) Web page recommendations E.g., Alexa toolbar ( Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Treatment of Zero’s in Ratings Data
Ratings data (e.g., rating movies on Netflix) User voluntarily assigns scores to movies viewed e.g., 5 for best and 1 for worst Interpretation of a score of 0 The user has not seen this movie The user has seen the movie but has not rated it A 0 score is not necessarily the same as “missing” but often treated that way In much research work on recommender systems, ratings data is converted into binary votes e.g., ratings from >3 mapped to a “vote” of 1, <3 “mapped” to 0 Not ideal since now the 0 score can represent low ratings or “unrated” Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Different recommender algorithms
Nearest-neighbor/collaborative filtering algorithms Cluster-based algorithms Probabilistic model-based algorithms Details discussed in class…. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Additional Aspects of Recommender Systems
Dimension reduction Techniques like SVD can be used to perform predictions in a lower-dimensional space Content-based recommender systems In many cases there is additional information about the items E.g., reviews and synposes of movies A different approach to recommender algorithms is to make predictions on new items based on properties of rated items This approach can be combined with collaborative/user data Particularly useful (e.g.) when many items have no ratings e.g., Decoste et al (IUI, 2005) report that 85% of movies have no ratings in a Yahoo! recommender system Additional data on users, e.g., demographic data May be useful, e.g., in clustering users Sequential aspect of recommendations e.g., novel Markov Decision Process approach by Shani et al, JMLR, 2005 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

General Issues The “cold start” problem Sparsity of data
How to make accurate recommendations for new users Sparsity of data Computational issues For real-time applications need to be able to make recommendations very quickly Significant engineering involved, many tricks Algorithm evaluation Not always clear what the evaluation metric (score) should be See next slide Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Evaluation of Recommender Systems
Research papers use historical data to evaluate and compare different recommender algorithms predictions typically made on items whose ratings are known e.g., leave-1-out method, each positive vote for each user in a test data set is in turn “left out” predictions on left-out items made given rated items e.g., predict-given-k method Make predictions on rated items given k=1, k=5, k=20 ratings See Herlocker et al (2004) for detailed discussion of evaluation Approach 1: measure quality of rankings Score = weighted sum of true votes in top 10 predicted items Approach 2: directly measure prediction accuracy Mean-absolute-error (MAE) between predictions and actual votes Typical MAE on large data sets ~ 20% (normalized) E.g., on a 5-point scale predictions are within 1 point on average Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Evaluation of Recommender Systems
Cautionary note: It is not clear that prediction on historical data is a meaningful way to evaluate recommender algorithms, especially for purchasing Consider: User purchases products A, B, C Algorithm ranks C highly given A and B, gets a good score However, what if the user would have purchased C anyway, i.e., making this recommendation would have had no impact? (or possibly a negative impact!) What we would really like to do is reward recommender algorithms that lead the user to purchase products that they would not have purchased without the recommendation This can’t be done based on historical data alone Requires direct “live” experiments (which is often how companies evaluate recommender algorithms) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Additional Reading on Recommender Systems
GroupLens research group, Papers, demo systems, data sets Breese et al, Empirical analysis of predictive algorithms for collaboration filtering, 1998 Schafer et al, Recommender systems in e-commerce, 1999 Sarwar et al, Analysis of recommendation algorithms for e-commerce, 2000 Herlocker et al, Evaluating collaborative filtering recommender systems, ACM TOIS, 2004 Shani et al, An MDP-based recommender system, 2005 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Project Presentations

Similar presentations

Presentation on theme: "Project Presentations"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Project Presentations

Similar presentations

Presentation on theme: "Project Presentations"— Presentation transcript:

Similar presentations

About project

Feedback