Empirical Investigations of WWW Surfing Paths Jim Pitkow User Interface Research Xerox Palo Alto Research Center.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Link Prediction and Path Analysis using Markov Chains
شهره کاظمی 1 آزمايشکاه سيستم های هوشمند ( گزارش پيشرفت کار پروژه مدل مارکف.
Terminology Project: Combination of activities that have to be carried out in a certain order Activity: Anything that uses up time and resources CPM: „Critical.
Computer Science Generating Streaming Access Workload for Performance Evaluation Shudong Jin 3nd Year Ph.D. Student (Advisor: Azer Bestavros)
Web Content Filter: technology for social safe browsing Ilya Tikhomirov Institute for Systems Analysis of the Russian Academy of Sciences
Sampling distributions. Example Take random sample of 1 hour periods in an ER. Ask “how many patients arrived in that one hour period ?” Calculate statistic,
1 10 Web Workload Characterization Web Protocols and Practice.
Part 9: Normal Distribution 9-1/47 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.
2-5 : Normal Distribution
Mining Longest Repeating Subsequences to Predict World Wide Web Surfing Jatin Patel Electrical and Computer Engineering Wayne State University, Detroit,
A Hierarchical Characterization of a Live Streaming Media Workload E. Veloso, V. Almeida W. Meira, A. Bestavros, S. Jin Proceedings of Internet Measurement.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Markov Models. Markov Chain A sequence of states: X 1, X 2, X 3, … Usually over time The transition from X t-1 to X t depends only on X t-1 (Markov Property).
Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Copyright © , Software Engineering Research. All rights reserved. Creating Responsive Scalable Software Systems Dr. Lloyd G. Williams Software.
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Standard error of estimate & Confidence interval.
A Measurement-driven Analysis of Information Propagation in the Flickr Social Network WWW09 报告人: 徐波.
Can Internet Video-on-Demand Be Profitable? SIGCOMM 2007 Cheng Huang (Microsoft Research), Jin Li (Microsoft Research), Keith W. Ross (Polytechnic University)
Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
STAT 13 -Lecture 2 Lecture 2 Standardization, Normal distribution, Stem-leaf, histogram Standardization is a re-scaling technique, useful for conveying.
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
Network Traffic Modeling Punit Shah CSE581 Internet Technologies OGI, OHSU 2002, March 6.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
P2P Architecture Case Study: Gnutella Network
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Chapter Twelve Census: Population canvass - not really a “sample” Asking the entire population Budget Available: A valid factor – how much can we.
Models and Algorithms for Complex Networks Power laws and generative processes.
Theory of Probability Statistics for Business and Economics.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Web Caching and Content Distribution: A View From the Interior Syam Gadde Jeff Chase Duke University Michael Rabinovich AT&T Labs - Research.
Review of Probability Concepts ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
Spatial Analysis & Geostatistics Methods of Interpolation Linear interpolation using an equation to compute z at any point on a triangle.
Characterising Browsing Strategies in the World Wide Web Lara D. Catledge & James E. Pitkow Presented by: Mat Mannion, Dean Love, Nick Forrington & Andrew.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Summary of WWW Characterizations James E. Pitkow Xerox Palo Alto Research Center WWW Journal 99 발표자 : 노양우.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Research Academic Computer Technology Institute (RACTI) Patras Greece1 An Algorithmic Framework for Adaptive Web Content Christos Makris, Yannis Panagis,
IMA Summer Program on Wireless Communications VoIP over a wired link Phil Fleming Network Advanced Technology Group Motorola, Inc.
Review of Probability Concepts Prepared by Vera Tabakova, East Carolina University.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
CHAPTER 2: Basic Summary Statistics
CHAPTER – 1 UNCERTAINTIES IN MEASUREMENTS. 1.3 PARENT AND SAMPLE DISTRIBUTIONS  If we make a measurement x i in of a quantity x, we expect our observation.
The Normal Probability Distribution. What is a distribution? A collection of scores, values, arranged to indicate how common various values, or scores.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
A Large Scale Study of Wireless Search Behavior: Google Mobile Search By Maryam Kamvar, Shumeet Baluja Presented by Prashanth Kumar Muthoju, Aditya Varakantam.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
Review of Probability Theory
Stochastic Models of User-Contributory Web Sites
Normal Distribution and Parameter Estimation
Understanding Human Mobility from Twitter
DTMC Applications Ranking Web Pages & Slotted ALOHA
Evaluation of Load Balancing Algorithms and Internet Traffic Modeling for Performance Analysis By Arthur L. Blais.
Descriptive and inferential statistics. Confidence interval
Coded Caching in Information-Centric Networks
Additional notes on random variables
Web Mining Department of Computer Science and Engg.
Continuous Statistical Distributions: A Practical Guide for Detection, Description and Sense Making Unit 3.
Additional notes on random variables
The estimate of the proportion (“p-hat”) based on the sample can be a variety of values, and we don’t expect to get the same value every time, but the.
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

Empirical Investigations of WWW Surfing Paths Jim Pitkow User Interface Research Xerox Palo Alto Research Center

August 1999 Agenda A few characteristics of the Web and clicks Aggregate click models –user surfing behaviors –post-hoc hit prediction Individual click models –entropy –path prediction

August 1999 Web is big, Web is good 1 new server every 2 seconds 7.5 new pages per second

August 1999 Users, sessions, and clicks Current Internet Universe Estimate97.1 million Time spent/month7:28:16 Number of unique sites visited/month15 Page views/month313 Number of sessions/month16 Page views/session19 Time spent/session0:28:01 Time spent/site 0:31:27 Duration of a page viewed0:01:28 Source Nielsen//NetRatings February 1999

August 1999 Popularity of pages—Zipf Zipf Distribution: –frequency is inversely proportionate to rank Zipf Law: –slope equals minus one

August 1999 EnterExit Users enter a website at various pages and begin surfing Continuing surfers distribute themselves down various paths Surfers arrive at pages having traveled different paths After some number of page visits surfers leave the web site (a) (b) (c) (d) p 1 p 3 p 2

August 1999 Model of surfing V L = V L-1 +  L Where L is the number of clicks and is  L varies as independent and identically distributed Gaussian random variables Surfing proceeds until the perceived cost is larger than the discounted expected future value

August 1999 Random walk with a stopping threshold Two parameter inverse Gaussian distribution mean (L) =  and variance (L) =  3 /

August 1999 Experimental design Client data –Georgia Tech (3 weeks August, 1994) –Boston University (1995) tens of thousands of requests Proxy data –AOL (5 days in December, 1997) tens of millions of requests Server data –Xerox WWW Site (week during May 1997)

August 1999 Probability distribution function Clicks 1 click/site mode 3-4 clicks/site median 8-10 clicks/site mean

August 1999 Cumulative distribution function *** Experimental — inverse Gaussian 75% of the distribution accounted for in three clicks

August 1999 Two observations Inverse Gaussian distribution has very long tail –expect to see large deviations from average Due to asymmetric nature of the Inverse Gaussian distribution, typical behavior does not equal average behavior

August 1999 An interesting derivation Up to a constant given by the third term, the probability of finding a group surfing at a given level scales inversely in proportion to its depth

August 1999 Number of surfers at each level

August 1999 Implications of the Law of Surfing Implications on techniques designed to enhance performance –HTTP Keep-Alive, Pre-fetching of content Can adapt content based location on curve in a cost sensitive manner –different user modalities (browser, searcher, etc.) –expend different CPU resources for different users Web site visitation modeling

August 1999 Spreading Activation Pump activation into source Activation spreads through the network Activation settles into asymptotic pattern

August 1999 Matrix Formulation C R A SourcesNetworkActivation Content + Usage + Topology Technique A (t) = C + M A (t - 1) M = (1 -  ) I +  R Networks User Paths Text Similarity Topology

August 1999 Application to hit prediction Let f L be the fraction of users who, having surfed along L-1 links, continue to surf to depth L. Define the activation value N i,L as the number of users who are at node i after surfing through L clicks

August 1999 Predicted versus observed

August 1999 Surfing probabilitie s by outlink density

August 1999 Investigating user paths Each user path can be thought of as an ngram and represented as tuples of the form to indicate sequences of page clicks –Distribution of ngrams is the Law of Surfing Determine the conditional probability of seeing the next page given a matching prior ngram

August 1999 Uncertainty and entropy Conditional probabilities are also know as as k th -order Markov approximations/models Entropy is the expected (average) uncertainty of the random variable measured in bits –minimal number of bits to encode information –uncertainty in the sequence of letters in languages

August 1999 Mathematics of entropy

August 1999 Conditional entropy Conditional probability Chaining rule Joint probabilities

August 1999 Entropy versus ngram length

August 1999 Path predictions  Pr(PPM) the probability that a penultimate path,, observed in the test data was matched by the same penultimate path in the training data  Pr(Hit|PPM) the probability that page x n is visited, given that, is the penultimate path and the highest probability conditional on that path is p(x n |x n-1,…x n-k )  Pr(Hit) = Pr(Hit|PPM)*Pr(PPM), the probability that the page visited in the test set is the one estimated from the training as the most likely to occur

August 1999 Pr(Path Match)

August 1999 Pr(Hit|Path Match)

August 1999 Pr(Hit)

August 1999 Principles of path prediction Paths are not completely modeled by 1 st order Markov approximations Path Specificity Principle: longer paths contain more predictive power than lower order paths Complexity Reduction: keep models as simple and as small as possible

August 1999 Agenda revisited A few characteristics of the Web and clicks Aggregate click models –user surfing behaviors –post-hoc hit prediction Individual click models –entropy –path prediction

August 1999 Areas of further investigation Advance the state of predictive modeling –Move from post-hoc to a-priori prediction of user interactions on the Web –Test hypothetical models of Web site usage Validate existing and new models on more representative data sets Understand new and emerging applications –streaming video and audio, mobile, etc.

August 1999 More information istl/projects/uir/projects/Webology.html