Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Advanced Piloting Cruise Plot.
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
We need a common denominator to add these fractions.
Summary of Convergence Tests for Series and Solved Problems
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt ShapesPatterns Counting Number.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
PP Test Review Sections 6-1 to 6-6
ABC Technology Project
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
VOORBLAD.
15. Oktober Oktober Oktober 2012.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
LO: Count up to 100 objects by grouping them and counting in 5s 10s and 2s. Mrs Criddle: Westfield Middle School.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Chapter 5 Test Review Sections 5-1 through 5-4.
Before Between After.
Addition 1’s to 20.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Subtraction: Adding UP
Januar MDMDFSSMDMDFSSS
Week 1.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Intracellular Compartments and Transport
PSSA Preparation.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Essential Cell Biology
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Energy Generation in Mitochondria and Chlorplasts
CpSc 3220 Designing a Database
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
CS246 Page Refresh.
Presentation transcript:

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA

Junghoo "John" Cho (UCLA Computer Science)2 Application Web search engines/crawlers Web archive Data warehouse... Problem Polling Remote database Local database Query Update

Junghoo "John" Cho (UCLA Computer Science)3 Existing Approach Round robin Download pages in a round robin manner Change-frequency based [CLW98, CGM00, EMT01] Estimate the change frequency Adjust download frequency Proven to be optimal

Junghoo "John" Cho (UCLA Computer Science)4 Our Approach Sampling-based Sample k pages from each source Download more pages from the source with more changed samples

Junghoo "John" Cho (UCLA Computer Science)5 Comparison Frequency based Proven to be optimal Change history required Difficult to estimate change frequency Sampling based Can be worse than frequency based policy No history/frequency-estimation required Experimental comparison later

Junghoo "John" Cho (UCLA Computer Science)6 Questions Are we assuming correlation? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources?

Junghoo "John" Cho (UCLA Computer Science)7 Is Correlation Necessary? Random sampling Correlation not necessary. Only random sampling More discussion later 4/51/5

Junghoo "John" Cho (UCLA Computer Science)8 Questions Are we assuming correlation? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources?

Junghoo "John" Cho (UCLA Computer Science)9 Download Model (1) Fixed download cycle Say, once a month Fixed download resources in each cycle Say, 100,000 page download every month Goal Download as many changes as we can ChangeRatio = No of changed & downloaded pages No of downloaded pages

Junghoo "John" Cho (UCLA Computer Science)10 Download Model (2) Two-stage sampling policy Sampling stage Download stage Sampling requires page download

Junghoo "John" Cho (UCLA Computer Science)11 How to Use Sampling Result? Sites A and B, each with 20 pages 20 total download, 5 samples from each site 10 page download remaining 4/5 1/5 AB

Junghoo "John" Cho (UCLA Computer Science)12 Proportional Policy Download pages proportionally to the detected changes 8 pages from A, 2 pages from B 4/5 1/5 AB

Junghoo "John" Cho (UCLA Computer Science)13 Greedy Policy Download pages from the sites with most changes 10 pages from A 4/5 1/5 AB

Junghoo "John" Cho (UCLA Computer Science)14 Optimality of Greedy Theorem Greedy is optimal if we make download decisions purely based on sampling results Probabilistic optimality for their expected values

Junghoo "John" Cho (UCLA Computer Science)15 Questions Are we assuming correlation? How to use sampling results? Proportional vs Greedy How many samples? Dynamic sample size adjustment? What if we have very limited resources?

Junghoo "John" Cho (UCLA Computer Science)16 How Many Samples? Too few samples Inaccurate change estimates Too many samples Waste of resources for sampling How to determine optimal sample size?

Junghoo "John" Cho (UCLA Computer Science)17 Optimal Sample Size Factors to consider Total number of pages that we maintain Number of pages that we can download in the current cycle Number of pages in each Web site Change distribution Scenario 1 -- A: 90/100, B: 10/100 Scenario 2 -- A: 60/100, B: 40/100

Junghoo "John" Cho (UCLA Computer Science)18 Change Fraction Distribution fraction of sites f( ) t i : fraction of changed pages in site i f( ): distribution of values

Junghoo "John" Cho (UCLA Computer Science)19 Optimal Sample Size N : no of pages in a site r : no of pages to download / no of pages we maintain Analysis is complex is a good rule of thumb

Junghoo "John" Cho (UCLA Computer Science)20 Dynamic Sample Size? Do we need the same sample size for every site? A: = 0, B: = 0.45, C: = 0.55, D: = 1

Junghoo "John" Cho (UCLA Computer Science)21 Adaptive Sampling If the estimated is high/low enough, make an early decision What does high enough mean? Confidence interval above threshold t ( ) i i i

Junghoo "John" Cho (UCLA Computer Science)22 In the Paper More details on Optimal sample size Adaptive policy The cases where resource is too limited for sampling

Junghoo "John" Cho (UCLA Computer Science)23 Experiments 353,000 pages from 252 sites Mostly popular sites Yahoo, CNN, Microsoft, … ~ 1400 pages from each site Followed the links in the breadth-first manner Monthly change history for 6 months 5 download cycles In experiments, 100,000 page downloads in each download cycle

Junghoo "John" Cho (UCLA Computer Science)24 Comparison of Policies ChangeRatio

Junghoo "John" Cho (UCLA Computer Science)25 Optimal Sample Size Optimal sample size ~ 10 through 60 ~ 20 ChangeRatio Sample Size

Junghoo "John" Cho (UCLA Computer Science)26 Comparison of Long-Term Performance Problem: We have only 5-download-cycle data Solution: Extrapolate the history ? Repeat

Junghoo "John" Cho (UCLA Computer Science)27 Frequency vs. Sampling Download Cycle ChangeRatio Frequency Greedy

Junghoo "John" Cho (UCLA Computer Science)28 Related Work Frequency-based policy Coffman et al., Journal of Scheduling 1998 Cho et al., SIGMOD 2000 Edwards et al., WWW 2001 Source cooperation Olston et al., SIGMOD 2002

Junghoo "John" Cho (UCLA Computer Science)29 Conclusion Sampling-based policy Great short-term performance No change history required Frequency-based policy Potentially good long-term performance if the change frequency does not change Greedy is easy to implement and shows high performance

Junghoo "John" Cho (UCLA Computer Science)30 Future Work Combination of sampling and frequency based policies Switch to the frequency-based policy after a while Good partitioning for sampling? Site based? Directory based? Content based? Link-structure based?

Junghoo "John" Cho (UCLA Computer Science)31 Questions?