CS-791/891--Preservation of Digital Objects and Collections

Slides:



Advertisements
Similar presentations
Let X 1, X 2,..., X n be a set of independent random variables having a common distribution, and let E[ X i ] = . then, with probability 1 Strong law.
Advertisements

Exponential Distribution. = mean interval between consequent events = rate = mean number of counts in the unit interval > 0 X = distance between events.
Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Set #3: Discrete Probability Functions Define: Random Variable – numerical measure of the outcome of a probability experiment Value determined by chance.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
SUMS OF RANDOM VARIABLES Changfei Chen. Sums of Random Variables Let be a sequence of random variables, and let be their sum:
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.
Probability theory 2011 Outline of lecture 7 The Poisson process  Definitions  Restarted Poisson processes  Conditioning in Poisson processes  Thinning.
Discrete Probability Distributions
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
Standard error of estimate & Confidence interval.
This is a discrete distribution. Poisson is French for fish… It was named due to one of its uses. For example, if a fish tank had 260L of water and 13.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 5 Discrete Probability Distributions n Random Variables n Discrete.
TELECOMMUNICATIONS Dr. Hugh Blanton ENTC 4307/ENTC 5307.
Poisson Random Variable Provides model for data that represent the number of occurrences of a specified event in a given unit of time X represents the.
Estimation in Sampling!? Chapter 7 – Statistical Problem Solving in Geography.
Random Variables and Probability Models
Chapter 01 Discrete Probability Distributions Random Variables Discrete Probability Distributions Expected Value and Variance Binomial Probability Distribution.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 5 Discrete Random Variables.
Chapter 12 Probability. Chapter 12 The probability of an occurrence is written as P(A) and is equal to.
Mean and Standard Deviation of Discrete Random Variables.
The final exam solutions. Part I, #1, Central limit theorem Let X1,X2, …, Xn be a sequence of i.i.d. random variables each having mean μ and variance.
4.3 More Discrete Probability Distributions NOTES Coach Bridges.
Random Variables Example:
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 5 Discrete Random Variables.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
Chap 5-1 Chapter 5 Discrete Random Variables and Probability Distributions Statistics for Business and Economics 6 th Edition.
Discrete Probability Distributions Chapter 4. § 4.3 More Discrete Probability Distributions.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Discrete Probability Distributions
MAT 446 Supplementary Note for Ch 3
Applications of the Poisson Distribution
Poisson Distribution.
Construct a probability distribution and calculate its summary statistics. Then/Now.
DISCRETE RANDOM VARIABLES
Chapter 19: Unbiased estimators
Random Variables and Probability Distribution (2)
Old Dominion University Feburary 1st, 2005
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Chapter 2 Simple Comparative Experiments
The Maximum Likelihood Method
Random Variables and Probability Models
Discrete Random Variables
Flood Frequency Analysis
St. Edward’s University
St. Edward’s University
Spatial Online Sampling and Aggregation
An Example of {AND, OR, Given that} Using a Normal Distribution
CS246 Page Refresh.
Hydrologic Statistics
Some Discrete Probability Distributions
Statistical NLP: Lecture 4
Consider the following problem
Business Statistics Chapter 5 Discrete Distributions.
Junghoo “John” Cho UCLA
Statistics for Business and Economics (13e)
Elementary Statistics
ECE 5345 Stochastic Processes
Chapter 5 Discrete Probability Distributions
Random Variables A random variable is a rule that assigns exactly one value to each point in a sample space for an experiment. A random variable can be.
Poisson Process and Related Distributions
Uniform Probability Distribution
CIS 2033 based on Dekking et al
Presentation transcript:

CS-791/891--Preservation of Digital Objects and Collections Estimating Frequency of Change Written By Junghoo Cho, Hector Garcia-Molina Presented By Suman Kumar Narsing.

The topics to be dealt in this are: INTRODUCTION TAXONOMY OF ISSUES PRELIMINARIES ESTIMATION OF FREQUENCY: EXISTENCE OF CHANGE ESTIMATION OF FREQUENCY: LAST DATE OF CHANGE EXPERIMENTS CONCLUSION

1. INTRODUCTION: These are autonomous and are updated independently. Now many data sources are available online. These are autonomous and are updated independently. Ex: CNN & NY Times, online stores etc. As sources updated autonomously, clients don’t know exactly when and how the sources change often.

HOW TO IMPROVE THEIR EFFECTIVENESS: Improving a Web crawler. Improving the update policy of a data warehouse. Improving Web caching. Data mining.

HOW TO ESTIMATE THE FREQUENCY OF CHANGE: Incomplete change history. Irregular access interval. Difference in available information.

EXAMPLE 1: A web crawler accessed a page on a daily basis for 10 days, and it detected 6 changes. From this data, the Change frequency is = 6/10 = 0.6 times a day. EXAMPLE 2: In a web cache a user accessed a web page for 4 times at day1, day2, day 7 and day 10. Web page had changes in it on day 2 and day 7. Then what does this imply? Does the page change every 10/2 = 5 days on an average? EXAMPLE 1: A web crawler accessed a page on a daily basis for 10 days, and it detected 6 changes. From this data, the Change frequency is = 6/10 = 0.6 times a day. EXAMPLE 2. In a web cache a user accessed a web page for 4 times at day1, day2, day 7 and day 10. Web page had changes in it on day 2 and day 7. Then what does this imply? Does the page change every 10/2 = 5 days on an average?

2. TAXONOMY OF ISSUES: What do we mean by “ Change of an Element”? What does “Element” mean? What does “Change” mean? Element – “Web page” and any Change is – any modification to the page.

Developing Taxonomy: How do we trace the history of an element? Passive monitoring Active monitoring Regular interval Random interval What information do we have? Complete history of changes. Last date of change Existence of change

Developing Taxonomy: (Contd..) How do we use estimated frequency? Estimation of frequency. Categorization of frequency

E[X(t+1)-X(t)] = ∑kPr{X(t+1)-X(t)=k}= ∑k(λk e-λ /k|)= λ 3. PRELIMINARIES: Poisson Process: The model for the changes of an element. The no. of events expected to occur in a unit interval: E[X(t+1)-X(t)] = ∑kPr{X(t+1)-X(t)=k}= ∑k(λk e-λ /k|)= λ X(t)—No. of occurrences of a change in interval (0,t] λ – Poisson process of rate or frequency. For s>= 0 and t<0, the random variable X(s+t)-X(s) has the Poisson probability distribution Pr{X(s+t)-X(s) = k} = (λt)k e-λt /k! for k =0,1…….

Graphs explaining the importance of λ:

Estimator: λ = X/T; The distribution of λ determines how effective the estimator λ is: Bias. Efficiency. Consistency. Estimator: λ = X/T; The distribution of λ determines how effective the estimator λ is: Bias. Efficiency. Consistency.

4. ESTIMATION OF FREQUENCY: EXISTENCE OF CHANGE: Total time elapsed =, T = nI = n/f; Assuming estimator from now as frequency ratio, r = λ/f = 1/f(X/T) = X/n.

Measuring X repeated accesses to the element: Is the estimator r biased? Theorem 4.1 The expected value of the estimator r is E[r] = 1 – e -r Is the estimator r consistent? How efficient is the estimator? Corollary 4.2 The standard deviation of the estimator r = X/n is calculated.

5. ESTIMATION OF FREQUENCY: LAST DATE OF CHANGE Let T be the time to the previous event in a Poisson process with rate λ. Then the expected value of T is E[T] = 1/ λ. The new estimator consists of three functions. Init() Update() Estimate()

The estimator using last modified changes: Init() /* initialize variables */ N = 0; /* total number of accesses */ X = 0; /* number of detected changes */ T = 0; /* sum of the times from changes */ Update(Ti, Ii) /* update variables */ N = N + 1; /* Has the element changed? */ If (Ti < Ii) then /* The element has changed. */ X = X + 1; T = T + Ti; else /* The element has not changed */ T = T + Ii; Estimate() /* return the estimated lambda */ return X/T;

6. EXPERIMENTS: Non-Poisson model. Improvement from last modification date. Effectiveness of estimators for real Web data. 6. EXPERIMENTS Non-Poisson model. Improvement from last modification date. Effectiveness of estimators for real Web data.

COMPARISION OF NAÏVE ESTIMATOR AND OURS

Application to a Web crawler: Uniform Policy: Naïve Policy. Our Policy. Application to a Web crawler: Uniform Policy: Naïve Policy. Our Policy.

7. CONCLUSION: Future work: Adaptive Scheme: Changing λ CONCLUSION:

REFERENCES: Junghoo Cho, Hector Garcia-Molina "Estimating frequency of change." ACM Transactions on Internet Technology, 3(3): August 2003. http://oak.cs.ucla.edu/~cho/papers/cho-freq.pdf REFERENCES: Junghoo Cho, Hector Garcia-Molina "Estimating frequency of change." ACM Transactions on Internet Technology, 3(3): August 2003. http://oak.cs.ucla.edu/~cho/papers/cho-freq.pdf

THANK YOU THANK YOU