Download presentation

Presentation is loading. Please wait.

Published bySteven O'Connor Modified over 6 years ago

1
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA

2
Junghoo "John" Cho (UCLA Computer Science)2 Application Web search engines/crawlers Web archive Data warehouse... Problem Polling Remote database Local database Query Update

3
Junghoo "John" Cho (UCLA Computer Science)3 Existing Approach Round robin Download pages in a round robin manner Change-frequency based [CLW98, CGM00, EMT01] Estimate the change frequency Adjust download frequency Proven to be optimal

4
Junghoo "John" Cho (UCLA Computer Science)4 Our Approach Sampling-based Sample k pages from each source Download more pages from the source with more changed samples

5
Junghoo "John" Cho (UCLA Computer Science)5 Comparison Frequency based Proven to be optimal Change history required Difficult to estimate change frequency Sampling based Can be worse than frequency based policy No history/frequency-estimation required Experimental comparison later

6
Junghoo "John" Cho (UCLA Computer Science)6 Questions Are we assuming correlation? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources?

7
Junghoo "John" Cho (UCLA Computer Science)7 Is Correlation Necessary? Random sampling Correlation not necessary. Only random sampling More discussion later 4/51/5

8
Junghoo "John" Cho (UCLA Computer Science)8 Questions Are we assuming correlation? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources?

9
Junghoo "John" Cho (UCLA Computer Science)9 Download Model (1) Fixed download cycle Say, once a month Fixed download resources in each cycle Say, 100,000 page download every month Goal Download as many changes as we can ChangeRatio = No of changed & downloaded pages No of downloaded pages

10
Junghoo "John" Cho (UCLA Computer Science)10 Download Model (2) Two-stage sampling policy Sampling stage Download stage Sampling requires page download

11
Junghoo "John" Cho (UCLA Computer Science)11 How to Use Sampling Result? Sites A and B, each with 20 pages 20 total download, 5 samples from each site 10 page download remaining 4/5 1/5 AB

12
Junghoo "John" Cho (UCLA Computer Science)12 Proportional Policy Download pages proportionally to the detected changes 8 pages from A, 2 pages from B 4/5 1/5 AB

13
Junghoo "John" Cho (UCLA Computer Science)13 Greedy Policy Download pages from the sites with most changes 10 pages from A 4/5 1/5 AB

14
Junghoo "John" Cho (UCLA Computer Science)14 Optimality of Greedy Theorem Greedy is optimal if we make download decisions purely based on sampling results Probabilistic optimality for their expected values

15
Junghoo "John" Cho (UCLA Computer Science)15 Questions Are we assuming correlation? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources?

16
Junghoo "John" Cho (UCLA Computer Science)16 How Many Samples? Too few samples Inaccurate change estimates Too many samples Waste of resources for sampling How to determine optimal sample size?

17
Junghoo "John" Cho (UCLA Computer Science)17 Optimal Sample Size Factors to consider Total number of pages that we maintain Number of pages that we can download in the current cycle Number of pages in each Web site Change distribution Scenario 1 -- A: 90/100, B: 10/100 Scenario 2 -- A: 60/100, B: 40/100

18
Junghoo "John" Cho (UCLA Computer Science)18 Change Fraction Distribution fraction of sites f( ) t i : fraction of changed pages in site i f( ): distribution of values

19
Junghoo "John" Cho (UCLA Computer Science)19 Optimal Sample Size N : no of pages in a site r : no of pages to download / no of pages we maintain Analysis is complex is a good rule of thumb

20
Junghoo "John" Cho (UCLA Computer Science)20 Dynamic Sample Size? Do we need the same sample size for every site? A: = 0, B: = 0.45, C: = 0.55, D: = 1

21
Junghoo "John" Cho (UCLA Computer Science)21 Adaptive Sampling If the estimated is high/low enough, make an early decision What does high enough mean? Confidence interval above threshold t ( ) i i i

22
Junghoo "John" Cho (UCLA Computer Science)22 In the Paper More details on Optimal sample size Adaptive policy The cases where resource is too limited for sampling

23
Junghoo "John" Cho (UCLA Computer Science)23 Experiments 353,000 pages from 252 sites Mostly popular sites Yahoo, CNN, Microsoft, … ~ 1400 pages from each site Followed the links in the breadth-first manner Monthly change history for 6 months 5 download cycles In experiments, 100,000 page downloads in each download cycle

24
Junghoo "John" Cho (UCLA Computer Science)24 Comparison of Policies ChangeRatio

25
Junghoo "John" Cho (UCLA Computer Science)25 Optimal Sample Size Optimal sample size ~ 10 through 60 ~ 20 ChangeRatio Sample Size

26
Junghoo "John" Cho (UCLA Computer Science)26 Comparison of Long-Term Performance Problem: We have only 5-download-cycle data Solution: Extrapolate the history ? Repeat

27
Junghoo "John" Cho (UCLA Computer Science)27 Frequency vs. Sampling Download Cycle ChangeRatio Frequency Greedy

28
Junghoo "John" Cho (UCLA Computer Science)28 Related Work Frequency-based policy Coffman et al., Journal of Scheduling 1998 Cho et al., SIGMOD 2000 Edwards et al., WWW 2001 Source cooperation Olston et al., SIGMOD 2002

29
Junghoo "John" Cho (UCLA Computer Science)29 Conclusion Sampling-based policy Great short-term performance No change history required Frequency-based policy Potentially good long-term performance if the change frequency does not change Greedy is easy to implement and shows high performance

30
Junghoo "John" Cho (UCLA Computer Science)30 Future Work Combination of sampling and frequency based policies Switch to the frequency-based policy after a while Good partitioning for sampling? Site based? Directory based? Content based? Link-structure based?

31
Junghoo "John" Cho (UCLA Computer Science)31 Questions?

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google