Presentation is loading. Please wait.

Presentation is loading. Please wait.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado! Francisco Tenorio! Jacques.

Similar presentations


Presentation on theme: "Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado! Francisco Tenorio! Jacques."— Presentation transcript:

1 Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado! Francisco Tenorio! Jacques Robin! Juliana Freire * *Univesity of Utah ! Universidade Federal de Pernambuco

2 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Keeping Web Information Up-to-date Types of applications Proxy servers Search engine Quality of results: broken links

3 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Keeping Web Information Up-to-date Challenges in Web Data Atualization Sources autonomous and independent Lots of data billion of pages Dynamism 40% of Web pages change at least once a week (Cho and Molina, 2000) Applications run over limited resources Search engine coverage – 42% (Lawrence and Giles, 1999) Average time to search engeine updates a page – 186 days

4 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Updating Web Content: Our Solution Basic Idea Predict change rate of pages Update pages based on this prediction Two phases First visit Page attributes, e.g., file size and number of pages Over time Change history

5 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Updating Web Content: Our Solution Crawler New Page? Historic Classifier Static Classifier Change History Change Predictions PageYes No Change prediction Change prediction Page history

6 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Building the Solution: Overview 1. Gathering the training set 2. Creating the change rate groups 3. Learning static features 4. Learning from History

7 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Gathering the Training Set 100 most accessed sites of the Brazilian Web Breadth-first search down to depth 9 Total of 84.699 URLs Each page visited once a day for 100 days

8 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Creating the Change Rate Groups Predict the average interval of time at which a given page is modified Regression task Discretizing the target attribute Classification task Performed an unsupervised discretization Result:

9 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning Static Features Relation between some Web page attributes in its dynamism Dynamic pages are larger and have more images [Douglis et al] The absence of the HTTP header LAST-MODIFIED indicates that a page is more volatile than pages that contain this information [Brewington and Cybenko]

10 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning Static Features Attributes used: Number of links Number of e-mail addresses Existence of the HTTP header LAST-MODIFIED File size in bytes (without html tags) Number of images Depth of a page in its domain A domain represents, for instance, for the site www.yahoo.com every page in *.yahoo.com) The directory level of the page URL in relation to the URL root from the Web server E.g., www.yahoo.com is level 1, www.yahoo.com/mail/ level 2, and so on).

11 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Determining the Relevance of the Features Feature selection task Wrapper method with backward elimination Result Depth of a page in its domain is not relevant Remaining features used in the static classifier

12 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Building the Static Classifier Classification algorithms J48: decision tree NaiveBayes: naïve bayes IBk: k-nearest neighbor Measures of performance Error rate Classification time Results AlgorithmsError test rateClassification time J48 without pruning11.92.41s J48 postpruning10.71.63s NaivesBayes40.5120.5s IBk with k=111.274393.15s IBk with k=211.886335.49s

13 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning from History

14 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Learning from History

15 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Experimental Results

16 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Experimental Results

17 WIDM 2005 - Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Future Directions


Download ppt "Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado! Francisco Tenorio! Jacques."

Similar presentations


Ads by Google