Presentation is loading. Please wait.

Presentation is loading. Please wait.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Similar presentations


Presentation on theme: "Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques."— Presentation transcript:

1 Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques Robin ! Juliana Freire * *Univesity of Utah ! Universidade Federal de Pernambuco

2 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Traditional Library Scenario Central Catalog Shelves Add a book Update or remove a book Perfect scenario All the information up-to-date

3 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Huge-Web Library Scenario Central Catalog Shelves Add a book Update or remove a book Anybody can add and remove books without notifying the librarian And there are billions of books

4 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Challenges Autonomous and non-cooperative sources Lots of data -- billions of pages Dynamism 40% of Web pages change at least once a week (Cho and Molina, 2000) Applications run over limited resources Search engines only cover a subset of the pages on the Web It takes search engines an average of 186 days to update pages (Lawrence and Giles, 1999) Update too often – waste resources Update sporadically – stale content

5 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Applications that Keep Replicas of Web Content Search engine Stale content Results have broken links and new content of pages not available Low quality of results Proxy server Web archive E.g.: http://www.archive.org

6 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Current Solutions Two main approaches: push and pull Push Site provides information about change frequency of pages Efficient, but requires cooperation E.g., google sitemaps Pull No cooperation required Uniform policy Non-uniform policy Application learns change frequency Expensive to learn – need exhaustive crawls until frequencies are learned Can we do better?

7 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Our Solution: Also look at the present! Similar to “pull” approaches Non-uniform policy Look at the present to reduce the cost of learning Take page content into account Quickly adapts to changes in update frequencies Avoid unnecessary visits

8 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Updating Web Content: Our Solution Crawler New Page? Static Classifier Change History Change Predictions PageYes Historic Classifier No Change prediction Change prediction Page history Phase 1 Phase 2

9 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Building the Solution: Overview 1. Gathering the data 2. Creating the change rate groups 3. Learning static features 4. Learning from History

10 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Gathering the Data Set 100 most accessed sites of the Brazilian Web Breadth-first search down to depth 9 Total of about 85000 URLs 2/3 third used to build the classifers 1/3 third used to run the experimental evaluation Each page visited once a day for 100 days Result: Attributes of pages History of page changes–the average change rate of each page

11 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Creating the Change Rate Groups Discretizing the change rates Classification task Performed an unsupervised discretization Result:

12 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Static Classifier Classify pages in modification groups Based on some static features Relation between some Web page attributes in its dynamism Dynamic pages are larger and have more images (Douglis et al, 1999) The absence of the HTTP header LAST-MODIFIED indicates more dynamic pages (Brewington and Cybenko, 2000) Attributes used Presence of the HTTP header LAST-MODIFIED, file size in bytes, number of images, depth of a page in its domain...

13 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Feature Selection Determining the relevance of different features Make sure that the features are really relevant Wrapper method Uses classification algorithms Chooses the subset that results in the lowest error rate Result Depth of a page in its domain is not relevant Remaining features used in the static classifier

14 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Building the Static Classifier Classify pages in modification groups Classification algorithms J48: decision tree, NaiveBayes: naïve bayes, IBk: k-nearest neighbor Measures of performance Error test rate Classification time Results AlgorithmsError test rateClassification time J48 without pruning11.92.41s J48 postpruning10.71.63s NaivesBayes40.5120.5s IBk with k=111.274393.15s IBk with k=211.886335.49s

15 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning from History Historic classifier Classify pages in modification groups Based on change history Basic idea: Pages whose change rate is close to the average rate of their class Each modification class has: Average update rate Windows size Number of visits to re-classified a page Minimum and maximum change average thresholds

16 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning from History Example: class “one week”, ws=3 yes Move to a lower class no Higher than max. threshold Visit P Lower than min. threshold num of visits equals to 3 No Move to a higher class yes Keep at the same class no

17 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Experiment Compare against Bayesian estimator (Cho and Molina) First visit: make any assumption about the page behavior Over time: bayesian inference 1/3 third of the monitored data Performance measure: error rate Low error rate: pages are visited close to the actual frequency

18 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Results ClassifierError rate Random75.22 J4825.64 Static classifier is more effective than no- assumption about the page behavior ConfigurationError rate Random + Bayesian34.73 J48 + Historic14.95 Combining historic and static gives the best performance

19 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Related Work Cho and Molina Uniform policy is always superior to the proportional (non-uniform) approach Overall freshness is maximized, but their measure penalizes the most dynamic pages which may not be updated as frequently as they change Pandey and Olston User-centric approach to guide the update process Relevance related to the likelihood that this page is viewed

20 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Conclusion Efficient strategy for keeping replicas of Web content current: Look at page contents Adapt quickly to change in update frequency Static classifier is effective Page contents are good indication of its change behavior Use static and historic information leads to improved performance Future work Take additional features into account, e.g., page rank and backlink Experiment with other learning techniques

21 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content THE END.


Download ppt "Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques."

Similar presentations


Ads by Google