Old Dominion University Feburary 1st, 2005

Old Dominion University Feburary 1st, 2005
Crawling the Web Old Dominion University Feburary 1st, 2005 Good afternoon, everyone. Thank you for coming to my presentation. My name is Junghoo Cho and the title of my presentation is “Crawling the Web, discovery and maintenance of large-scale web data.” and under this title, I will talk about how we can design and implement an effective Web crawler. Junghoo “John” Cho UCLA

What is a Crawler? init get next url get page web extract urls
initial urls init to visit urls get next url get page visited urls web But before talking about how to design an effective web crawler, let me briefly discuss what a web crawler is. A Web crawler is a program that downloads pages from the Web in order to provide the downloaded pages to other applications, such as Web search engines. Typically the crawlers operate in the following way. It starts with an initial set of URLs, and for each of these URLs, it downloads the pages and from the downloaded pages, it extracts all the URLs embedded in them, and use these URLs as the next set of pages to crawl. And this process is repeated until the crawler decides to stop for various reasons. extract urls web pages

Applications Internet search engines Web archiving Data mining
Google, Yahoo Web archiving Data mining Then you may wonder why this is an interesting topic to study. Well, there are so many applications that require the crawler as part of their services. For example, most of the internet search engines has to download pages in advance and build keyword indexes to answer user’s queries fast. Also many comparison shopping services need to download Web pages to extract the price and availability information at various Web merchants. In addition, if we want to run some mining algorithm on the Web data to extract interesting information, we need a local copy of Web pages because otherwise it may take forever to finish the mining algorithm. Considering how many people are using these services and rely on them as every day service, we can see how important a Web crawler is. Even a minor improvement in the Web crawler may enhance user’s experience of the internet quite significantly.

Crawling Issues (1) Load at visited web sites
Space out requests to a site Limit number of requests to a site per day Limit depth of crawl So what are the challenges in implementing an effective crawler? The first challenge is the load at visited web sites. To download pages, the crawler has to visit millions of Web sites, which are run by other organizations and which have to serve their own clients. So if the crawler somehow interferes with their primary operations, a lot of times, the administrators of the Web sites get upset and sometimes completely block the access from our crawler. So the crawler has to be very careful in visiting the sites. For example, there is a protocol called “robots.txt” in which the site administrators can specify what part of the site a crawler can download, so the crawler has to strictly abide by that protocol. Also, we need to give enough pauses between requests to a single site, and may want to limit the total number of downloads from a single site per day, because sometimes the Web sites are charged by the bandwidth they use per day.

Crawling Issues (2) ? Load at crawler Parallelize init init
initial urls init to visit urls init get next url get next url get page get page And at the same time, the crawler also has to handle very heavy load on its machine. Note that a lot of times the crawler has to download hundreds of millions of pages in a short period of time. In order to handle this load, we may have to parallelize this crawling process to multiple machines. And one simple way of parallelization is replicating only the download part. So there still is a central queue of URLs which has all pages to download, but multiple processes at multiple machines get the URLs from this central queue and download the pages in parallel. Clearly we may also want to parallelize and split the central queue, but if we are not careful, then multiple processes may download the same page multiple times, because some pages are cross-linked and the crawling processes may not know that another process has already downloaded the page. extract urls extract urls visited urls web pages

Crawling Issues (3) Scope of crawl Not enough space for “all” pages
Not enough time to visit “all” pages Solution: Visit “important” pages visited pages Intel The third challenge is the scope of the crawl. Because the Web is so large, in many cases we do not have enough space to store all the pages available on the Web. Even if we do have enough space, we may not have enough time to visit all pages, because after we download a certain number of pages we also want to refresh them to maintain the pages up to date. For this reason, most of the time, the crawler can download only a small subset of the Web, not the entire Web. In this context, it becomes important exactly what pages the crawler decides to crawl, because this decision will significantly impact the “quality” of the downloaded pages. So the crawler has to visit “important” pages first. For example, if the crawler wants to download “popular” pages linked by many other pages, then the crawler may want to first download the pages with highest incoming links from the pages that have been already downloaded. And also, if the crawler wants to download pages related to Intel, it may first download the page if the link for that page contains the word “Intel” in it.

Crawling Issues (4) Replication Pages mirrored at multiple locations
The fourth challenge is replication. For various reasons, there are tons of replicated pages available on the Web. For example, the Java manuals are replicated on more than 50 different Web sites and the manual consists of several hundred web pages. Obviously, we do not want to download all the replicated pages multiple times when we can download only a small subset of Web. Then the interesting challenge is how can the crawler automatically identify these replicated or mirrored pages. But to make matters worse, when people replicate pages, they do not necessarily copy them bit by bit. They often make small modifications, like deleting copyright notices or adding some personal links on the pages and so on, so replicas are not necessarily the exact replicas. Then how can the crawler identify these similar pages not just exact copies? What will be the right definition for the “similar pages” that the crawler should avoid downloading?

Crawling Issues (5) Incremental crawling
How do we avoid crawling from scratch? How do we keep pages “fresh”? And the final issue is incremental crawling. Because Web pages are changing constantly, the crawler has to refresh or revisit downloaded pages periodically to maintain them up-to-date. One easy to solution to this problem is to periodically download a new set of pages from scratch and then replace the old copy with this fresh copy. But can we avoid starting from scratch every time? When different pages change differently, how can we exploit the differerence to maximize the “freshness” of the downloaded pages?

Papers on Web Crawler Load on sites [PAWS00] Parallel crawler [WWW03]
Page selection [WWW7] Replicated page detection [SIGMOD00] Page freshness [SIGMOD00] Crawler architecture [VLDB00] In my thesis work, I studied many of these challenges. For example, in my PAWS paper, I studied the issue of load on the visited sites and in my recent work submitted to VLDB, I studied the parallelization problem. In addition, in my SIGMOD paper, I studied copy identification problem. Out of these work that I have done at Stanford, what I want to focus on today is how we can maintain pages up-to-date.

Outline of This Talk How can we maintain pages fresh?
How does the Web change? What do we mean by “fresh” pages? How should we refresh pages? So in the remainder of this presentation I will discuss how a crawler can maintain pages up to date. And In order to answer this question, we need to understand the following four issues. The first issue is we have to understand how the Web pages change over time. So in the beginning, I will present a experimental results which addresses this question. And the second issue is we need to clarify our notion of “freshness.” Although we have this intuitive notion of fresh pages or freshness, we have to precisely define this notion in order to study this problem scientifically. Based on these two understandings, then we can compare various refresh policies and try to identify which policy is the best one.

Web Evolution Experiment
How often does a Web page change? How long does a page stay on the Web? How long does it take for 50% of the Web to change? How do we model Web changes? Now let me move on to the first topic, The Web Evolution Experiment. In this experiment, I tried to answer the following questions. How often does a Web page change? What is the average change interval of a page? How long does a page stay on the Web? What is the average lifespan of a page? How long does it take for 50% of the Web to change? And finally, what will be a good model to describe the Web changes? Is there a good mathematical model that fits well with the Web change?

Experimental Setup February 17 to June 24, 1999
270 sites visited (with permission) identified 400 sites with highest “PageRank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests In order to answer these questions, I did the following experiment. From February 17 until June 24 in 1999, for about 4 months, I monitored the change history of about 720,000 Web pages on a daily basis. And the way that I selected the pages for this experiment was as follows: Based on the pages in our WebBase repository, I first identified the most popular “400” sites. By most popular I mean the Web sites which are linked by the highest number of pages in our repository. Then I contacted the administrators of the 400 sites to get their permission for my experiment. Out of these 400 site, about 270 sites said it was okay to visit their sites on a daily basis. Anyway, for each of these 270 sites, I downloaded about 3000 pages every day. By the way, my advisor hector always jokes that his biggest contribution to my thesis is to this experiment because one of the things that he did was when I tried to send this initial contact , he rephrased it from “is it okay?” to something like “if you don’t respond I will assume yes,” and thanks to his biggest contribution, I ended up visiting 270 sites. The way that I downloaded pages was every day, I started the root pages of the sites and followed links in the breadth first manner. Note that I did not download a preselected set of URLs. Because I followed the links everyday, as old pages disappear and new pages are created I could also detect these pages. Finally, to minimize the load on a particular site, I ran the crawler only in the night from 9PM to 6AM and I gave at least 10 second pauses between requests to a single site.

Average Change Interval
fraction of pages Based on the data obtained this way, I tried to the answer the questions that I posed before, and this is one of the results. In this graph, I show the average change interval of the pages, so the horizontal axis shows average change intervals and the vertical axis is the fraction of the pages that had the given interval. There are certain important issues in estimating page change intervals, but let me first summarize the results here. I will talk about those issues later on. From this graph, we can see that the Web pages have very different characteristics. For example, from the first bar of this graph, we can see that about 23% of the pages has average change interval of shorter than one day and more than 30% of the pages has average change interval of longer than 4 months from the last bar of this graph. To clarify, in obtaining the graph, I considered any change to a page as a change. And when I clarify this, one common question that I get is then how the dynamic pages would have affected the result. And the answer is yes, dynamic pages would have affected the results, but not necessarily as much as we expect. For example, the advertising banners do not necessarily change the content of the page even if the banner changes, because they usually point to a central ad server and the banner is changed at that site. Also, access counters are typically a java-script or a cgi-script which point to another program that generates the counter, so the page itself does not necessarily change. average change interval

Change Interval – By Domain
fraction of pages I also tried to break that statistics down by domains, and this graph shows the change intervals by domain. For example, from the first blue bar of the graph, we can see that more about 40% of the pages in the com domain has average change interval of shorter than one day and more than 50% of the pages in the education or government domain had average change interval of longer than 4 months. This result is kind of embarrasing because it seems like people in education or government domain do not work at all, as we usually expect, but keep in mind that this result is before the stock market downturn. After the downturn, I don’t know what it would be like. One thing that I want to note in this graph is that Web pages indeed have very different characteristics. Because pages change at very different rates, if we exploit this, we may significantly improve the freshness of downloaded pages. average change interval

Modeling Web Evolution
Poisson process with rate  T is time to next event fT (t) =  e- t (t > 0) That was one result, the average change interval, and another result that I want to show is the Web page change model. From the change history data, I wanted to identify if there is any mathematical model that describe the Web page changes very well, so we compared various statistical model against our data and we could learn that a Poisson process with rate lambda matches well with our experimental result. So let me briefly remind what a Poisson model is. Under the Poisson model, each page has its own change frequency which is represented by lambda. For example, the pages on CNN web site may have average change frequency of once every day and my personal home page may have average change frequency of once every year. So different pages have different change frequency. And also under the Poisson model, if we measure the time between changes of a page and plot the distribution of the intervals, the graph should follow a exponential distribution.

Change Interval of Pages
for pages that change every 10 days on average fraction of changes with given interval And that is the exact distribution that we obtained when we plotted the graph using the experimental data. For example, in this graph, I show the distribution for the pages whose average change interval is 10 days. The horizontal axis in the graph is the change interval in days and the vertical axis is the faction of the changes that occurred at the given interval. From the graph we can clearly see that the Poisson model describes the experimental data very well, although there is a small variations. I also plotted similar graphs whose average change interval is every 20 days and so on, and the result was the same. So in the remainder of this presentation, I will assume that the Poisson model is a good model to describe the changes of Web pages. Poisson model interval in days

Change Metrics  Freshness
Freshness of element ei at time t is F ( ei ; t ) = if ei is up-to-date at time t otherwise ei ... web database Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) (Assume “equal importance” of pages)  N 1 i=1 These were some of the results from my web evolution experiment, and now let me move on to the second topic, the definition of freshness. Intuitively we have a certain notion of fresh pages or freshness, so let us go over a simple example, to see exactly what we mean by that. In this example, we consider two crawlers which maintain 100 pages each, exactly the same page, but the first crawler is relatively effective one, so out of the 100 pages, it maintains about 90 pages up to date, on average. And the second crawler is relatively ineffective and out of 100 pages it maintains 10 pages up to day. In this scenario, we clearly think the pages maintained by the first crawler is “fresher” than the second one, and if we think about it, what we mean by freshness of pages is the fraction of the pages that are up to date. And that is the intuition that we try to capture by the metric freshness. Under the metric freshness, if a page is up to date, by up to date, we mean that the page in our local collection is exactly the same as the page on the real Web, so if the page is up to date, we define its freshness to be 1 and if the page is out of date, then we define its freshness to be 0. And the freshness of our entire collection of pages, or our entire database is defined to be the average of freshness of all the pages in the collection. So under this definition if we maintain 90 pages up to date out of 100, then the freshness is 0.9. Note that under this definition, every page is considered equal. Later on we will consider what would happen if a certain page I more important than others.

Change Metrics Age Age of element ei at time t is A( ei ; t ) = if ei is up-to-date at time t t - (modification ei time) otherwise Age of the database S at time t is A( S ; t ) = A( ei ; t ) (Assume “equal importance” of pages)  N 1 i=1 ei ... web database That is the metric freshness, but the freshness metric does not necessarily capture our whole notion of “freshness” To understand what I mean by this, let us go back to our previous example again. We have two crawlers maintaining 100 pages each, but in this case, both of the crawlers are very ineffective and all 100 pages are obsolete in both cases. But even so, if the first crawler updated its collection one day ago and if the second crawler updated its collection one year ago, we clearly think that the first collection is more “current” than the second one. So we also have this notion of “currency” or “age” of the pages. And that is the notion that we try to capture by the metric age. In the age metric, the age of a page is zero if the page is up to date, and if the page is obsolete we use the time from last modification as the age of that page. And the age of the entire collection is the average of age values of all the pages in the collection. So if all pages changed one day ago, and if we have not synchronized our database since then, the age of our database is one day.

Change Metrics Time averages: F(ei) 1 time A(ei) time update refresh
time A(ei) In this graph, I am showing how the freshness and age of a certain page evolves over time. In the beginning of a day, when the page is up to date, the freshness of the page is 1 and the age of that page is zero. But when the page changes in the middle of the day, the freshness of that page drops to zero, and the age of the page increases linearly from that point on. And later when we refresh the page the freshness recovers to 1 and the age goes down zero again. And this process is repeated forever. Because the freshness and age values changes over time, now we take the time average of these values and use this time average as the representative freshness or age of that page. For example, in this graph the yellow lines will be the time average of the freshness and age values, and mathematically we can formulate the averaging by the formula that I show here. We take the integral of the freshness value over time and then divide the integral by the time interval to get the average. time update refresh

Refresh Order Fixed order Random order Purely random
Explicit list of URLs to visit Random order Start from seed URLs & follow links Purely random Refresh pages on demand, as requested by user database web ei ei ... ...

Freshness vs. Revisit Frequency
r =  / f = average change frequency / average visit frequency

Age vs. Revisit Frequency
= Age / time to refresh all N elements r =  / f = average change frequency / average visit frequency

Trick Question Two page database e1 changes daily
e2 changes once a week Can visit one page per week How should we visit pages? e1 e2 e1 e2 e1 e2 e1 e2... [uniform] e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional] e1 e1 e1 e1 e1 e1 ... e2 e2 e2 e2 e2 e2 ... ? e1 e1 e2 e2 web database In order to identify the optimal refresh policy for the metric freshness, we can analyze the Poisson model mathematically, but before we do that let me go over a very simple example to get an intuition on what is happening and what result we may expect. So in this example, I assume a very tiny crawler which maintains only two pages. The page e1 changes once every day and the page e2 changes once every week. And because we are so tiny crawler we can visit only one page every week. In this scenario, how should we refresh pages? One obvious strategy is what we call uniform policy, in which we revisit pages at the same frequency regardless of their change frequencies. So this week we revisit e1 and next week we revisit e2 and so on. Another policy that we can easily come up with is what we call proportional policy, and under this policy we visit a page proportionally more often to its change frequency. So in this example, we visit e1 7 times and e2 and e1 7 times and e2 and so on, because e1 changes 7 times more often. And there are many other alternatives like visiting only e1 because e1 changes more often or visiting only e2 because e2 changes less often and so on. Then out of these policies which policy do you think will perform the best? Which policy will give us the highest value under the metric freshness?

Proportional Often Not Good!
Visit fast changing e1  get 1/2 day of freshness Visit slow changing e2  get 1/2 week of freshness Visiting e2 is a better deal!

Optimal Refresh Frequency
Problem Given and f , find that maximize

Solution All Compute Lagrange multiplier method
So how do we solve this problem? First we can compute F(ei) based on the periodicity of the refresh process and from this formula we can compute F(S) and use Lagrange multiplier method. Fortunately because F(S) is linear combination of F(ei), we can prove that all (li fi) pairs satisfy the same equation.

Optimal Refresh Frequency
Shape of curve is the same in all cases Holds for any change frequency distribution

Optimal Refresh for Age
Shape of curve is the same in all cases Holds for any change frequency distribution

Comparing Policies Based on Statistics from experiment
So now we know how to optimally refresh pages when the pages change at different rates. But how much benefit would we get if we adopt the optimal policy? To answer this question, I compared various policies based on the statistics collected from our experiment and I show expected freshness and age values in this table. From this table, one important conclusion that we can draw is that if we are not carefully we can do really lousy job. The proportional policy results in extremely low freshness and really high age values, so we definitely need to avoid it. Also we can see that the uniform policy does a pretty decent job. However, we can still improve the freshness and age quite significantly by adopting the optimal policy. For example, the age value decreases by about 25% compared to the uniform policy. Based on Statistics from experiment and revisit frequency of every month

Topics to Follow Weighted Freshness Non-Poisson Model
Change Frequency Estimation Thank you for your attention so far, and essentially the main topics that I want to discuss is over. From now on I will discuss some enhancements to the previous topics and some related issues. The first topic is weighted freshness which handles the case when pages have different importance and the second topic is frequency estimation by the crawler.

Not Every Page is Equal!  Some pages are “more important” e1
Accessed by users 10 times/day e2 Accessed by users 20 times/day So far, one thing that we ignored is that not every page is equal. In many cases, certain pages are more important than others. For example, let us assume that there are two pages a and b, and somehow b is more popular than a and users access b twice as often as a. In this case, we may consider that b is twice more important than a, because by maintaining b up to date, we can make the users see fresh pages twice as much. So in this example, we may define the freshness of the database like this formula, in which we weight the freshness of b by 2 and freshness of a by one, to make b is twice as important. In general, we may give different weights wi to different pages and define the freshness of database by this formula. Then under this more general freshness definition, how would the result be different?  In general,

Weighted Freshness f w = 2 w = 1 l
Again, we can analyze the Poisson process model under the new definition of freshness and this is the result that we get. In this graph again, the horizontal axis is the change frequency of the pages and vertical axis is the optimal revisit frequency for weighted freshness. To obtain this graph, we assumed that there are two types of pages, the pages with weight 1 and the pages with weight 2. And the pages with weight one follow the inner curve in this graph and the pages with weight 2 follow the outer curve in this graph. It may not be obvious from this graph, but the outer graph is the exact same shape as the inner graph, but scaled by a factor 2. In general, if a page has weight k, then the page follows a curve which is scaled by factor k compared to a page with weight one. If we look at this graph, we can see that we should revisit a page more often when it has higher weight, but it is not proportionally more often. For example, in this curve, for the pages that changes about twice a day, we have to visit the weight 2 pages about twice as often as weight 1 page, but for the pages that change three times a day, we should not visit the weight-1 page at all, but we should visit the weight-2 page very often. l

Non-Poisson Model Heavy-tail distribution Poisson model
fraction of changes with given interval Heavy-tail distribution Poisson model interval in days

Optimal Revisit Frequency for Heavy-Tail Distribution

Principle of Diminishing Return
T: time to next change : continuous, differentiable Every page changes Definition of change rate l

Change Frequency Estimation
How to estimate change frequency? Naïve Estimator: X/T X: number of detected changes T: monitoring period 2 changes in 10 days: 0.2 times/day Incomplete change history So great! Now we know how we should refresh pages when we know how often the page changes. But still the remaining question is how the crawler can estimate how often a web page changes. In order to implement the refresh policies that I described the crawler has to estimate the change frequency of pages. Given that the crawler repeatedly visits a page and knows its change history, the estimation may look relatively straightforward. For example, if the crawler detected 2 changes in 10 days when it visited the page every day, then we may conclude that the change frequency of the page is 0.2 times/day. However, it is not as straightforward as that. And that is because the crawler has incomplete change history of the page. The straightforward estimator which has been used for a long time and is proven to be very effective by traditional statistical theory does not work very well in our context, because the crawler has limited or incomplete change history To illustrate this issue, let us consider this example. In this example, a page changed 4 times in 10 days, but because the crawler visits the page only once every day, it detects only three changes and thus estimate that its change frequency is 0.3 times/day not 0.4 times/day. Then how can the crawler accommodate these missed changes? 1 day Page visited Page changed Change detected

Improved Estimator Based on the Poisson model
X: number of detected changes N: number of accesses f : access frequency Clearly it is impossible to exactly figure out how many changes we missed, but we can still get some help from the Poisson model that we discovered. Because we know that the page changes follow the Poisson model, based on the detected change history, we can roughly guess how many changes we may have missed. Based on this intuition, we mathematically analyzed the Poisson model and studied various frequency estimators in various scenarios, and this is one estimator that we obtained through this analysis. For example, for the previous case, when we detected 3 changes in 10 days, our new estimator predicts that the page changes about 0.36 times/day, not 0.3 times/day. So our estimate is slightly higher than the naïve one, and this result is because our new estimator accounts for the missed changes. 3 changes in 10 days: times/day  Accounts for “missed” changes

Improved Estimator Bias Efficiency Consistency
So how do we know this new one is indeed better than the previous one? One thing that we did was theoretical analysis of this estimator and we analyzed its bias, efficiency and consistency and showed that it is better in all of these measures.

Improvement Significant?
Application to a Web crawler Visit pages once every week for 5 weeks Estimate change frequency Adjust revisit frequency based on the estimate Uniform: do not adjust Naïve: based on the naïve estimator Ours: based on our improved estimator But in addition, we also studied how much impact our new estimator may have for an actual application. To see this, we ran a simulated crawler on the 4 month change history and measured how much improvement we may get. And the experiment that we did was as follows. Using the data, we ran a simulated crawler which visited pages once every week for the first 5 weeks. And then based on the changes detected during this period, the crawler estimated the change frequencies of pages and adjusted page revisit frequencies accordingly. In doing this, we compared three choices. One is a uniform policy, in which the crawler revisited every page at the same frequency. And another one is naïve policy in which the crawler used the naïve estimator to predict the change frequency. And the last one is our policy in which it used our proposed estimator. One thing that I have to emphasize is that in adjusting the revisit frequency, we made sure that the average revisit frequency over all pages are the same for all three policies in order to make a fair comparison.

Improvement from Our Estimator
Detected changes Ratio to uniform Uniform 2,147,589 100% Naïve 4,145,582 193% Ours 4,892,116 228% And this is the result that we obtained. In this table I am showing how many changes the crawler detected from the same total number of visits, so the best policy is the one that shows highest number. From this table, we can clearly see that our policy makes a significant difference. Compared to the uniform policy we detected twice as many changes, and even compared to the naïve policy, we detected about 35% more changes. So simply by using our new estimator the crawler can be made much more effective than before. (9,200,000 visits in total)

Other Estimators Irregular access interval Last-modified date
Categorization Although I did not go over other estimators in this presentation, in my thesis, I also studied other scenarios and tried to design good estimators for each of those scenarios. For example, in certain cases the crawler may have the last-modified date of a page and may want to use this information to better estimate the change frequency. So in my thesis I propose an estimator which can exploit this last-modified date and can predict the change frequency much better than other cases.

Summary Web evolution experiment Change metric Refresh policy
Frequency estimator In summary, in this presentation we discussed how we can maintain pages up to date, and to address this issue, I went over a Web evolution experiment, change metric, refresh policy and frequency estimator

The End Thank you for your attention For more information visit

Old Dominion University Feburary 1st, 2005

Similar presentations

Presentation on theme: "Old Dominion University Feburary 1st, 2005"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Old Dominion University Feburary 1st, 2005

Similar presentations

Presentation on theme: "Old Dominion University Feburary 1st, 2005"— Presentation transcript:

Similar presentations

About project

Feedback