Presentation on theme: "“Keeping up with Changing Web.” Dartmouth college. Brian E Brewington. George Cybenko. Presented by : Shruthi R Bompelli."— Presentation transcript:
“Keeping up with Changing Web.” Dartmouth college. Brian E Brewington. George Cybenko. Presented by : Shruthi R Bompelli.
Shruthi R Bompelli2 WEB is a huge collection of decentralized Web pages modified at random times. Information Deprecates over time. Is a Commodity. The value of information is subjective and domain specific. Domain decides its initial value. How long the information remains useful. The Rate at which the value deprecates.
Shruthi R Bompelli3 Questions arise… When do our previous observations become stale and need refreshing? How can we schedule these refresh operations to satisfy a required level (bandwidth, computing limitation) How much data can be observed in a given time? When should we check for new oncoming traffic? How can we determine when significant changes have occurred? Emphasis on the magnitude or amount of change.
Shruthi R Bompelli4 Kinds of Changes. Content / Semantic Changes : Refer to modifications of the contents or text in a page. (Tournaments.) Presentation Changes : Changes related to the representation or appearance of the page. Does not reflect to the changes of the content. (modifications to colors, fonts, backgrounds.) Structural Changes : Modifications of URLs, anchor text for the links. Underlying connection of document to the other documents. (Link destination on “Weekly Hot Links.” Page is same except for the Links.) Behavioral Changes : Modifications to the active components of the documents. (Changes in scripts, plug-ins, applets.)
Shruthi R Bompelli5 “Change is a Change, though Minor.” Informant Now known as TracerLock. Takes user specific group of URLs. It runs searches of user queries very 3 days or periodically. Works against the search engines.(Google, Altavista, Excite, Lycos, Infoseek) Notifies the user (by-email) of any new matches that have appeared. Used by Google alias Deja.com to filter results and send back to u. Queries are run at night, to decrease load.
Shruthi R Bompelli6 Informant… Services offered : News monitoring -- be notified within 15 minutes whenever a story matching your text query is published on an online news site.News monitoring Finance -- track stock prices and receive updates in response to rapid price changes or news stories mentioning the company.Finance Personal ads -- receive emails whenever a new ad matching your criteria appears on the online personals sites.Personal ads URL changes -- receive email updates whenever the contents of a particular URL changeURL changes Informant merged with TracerLoack on Nov 5th. www.tracerlock.com
Shruthi R Bompelli7 Search Engines : Keep track of the ever-changing Web by finding, indexing, reindexing Pages. Involved processing of about 100,000 Web pages per day & an overall of 3 Million. Each observation includes “Last Modified” time-stamp (if given). Time of observation (using the remote server’s timestamp). Document Summary information Number of bytes - content length Number of images, tables, links, banner, ads Text, Links and image references.
Shruthi R Bompelli8 Last –Modified time stamps. Show that 65% of the documents are modified during the US working hours (5 am to 5 pm). Poisson processes – The probability of the event (change of a page) in any short time interval is independent of the time since the last event.
Shruthi R Bompelli9 Lifetime Independent, identically-distributed time periods between modifications. Observe the time between successive modifications. Age Time since the present lifetime has began. Observe the time between most recent modifications. L1L2 0 0.5 1 1.5 2 2.5 3 3.54 Lifetime=1.53 Lifetime=1.14 Lifetime=0.62 Lifetime=0.84 Time Age 1
Shruthi R Bompelli10 Measurement of Lifetime. (a) PDF – Probability Density Function (b) CDF – Cumulative Density Function 1 of 5 pages are younger than 12 days. 1 of 4 pages are younger than 20 days.
Shruthi R Bompelli11 Pages that change. time xxooxxx 1. 1. Second observation ( o ) will miss two changes ( x ) x =modification o =observation time xxxooooo 2. 2. Observation window not big enough to see any changes ( x ) o (Observation timespan) (Actual lifetime) (Observed lifetime) quickly – there is no way to know whether the observed change is the only change since last observation. Slowly – less likely to observe the changes if we monitor the page for short time.
Shruthi R Bompelli12 Assumptions for estimation of change. the pages change according to independent Poisson processes the distribution of lifetimes are in distinguishable form. the time for which the pages are observed are independent of that of page’s change rate. Looking at the graph Mean lifetime – 117 days Fastest changing quartile – 62 days Slowest changing quartile – 190 days.
Shruthi R Bompelli13 “Current” -- “up-to-date” web page entry in a SE is B-current – if it has not changes between the last observation and B time units ago. B – grace period.
Shruthi R Bompelli14 ( α -β) – current : A Search Engine is (α, β )-current if the probability of a randomly chosen webpage having a β -current entry is at least α. Any source has a spectrum of possibilities; here are some possible values (guesses) –Newspaper: (0.9, 1 day) –Television news: (0.95, 1 hour) –Broker watching stocks: (0.95, 30 min) –Air traffic controller: (0.95, 20 sec) –Web search engine: (0.6, 1 day) –An old web page’s links: (0.4, 70 day)
Shruthi R Bompelli15 ( α -β) – current : - λ (t- β) α = β + 1-e T λT β - grace period α – probability T – time ( Search Engine visits each doc every T units) λT – realative reindexing time V=β/T -grace period fraction λ – rate of poisson changes
Shruthi R Bompelli16 Reindexing strategy : As the relative reindexing time λ T grows, probability x approaches fraction of time v= β /T. For large λ T, an observation becomes worthless almost immediately. Coz the pages are changing far more quickly than reindexing time. Only the fraction β /T of all reindexes falling within the grace period will be β -current. Extra observations can be seen when λ T is small,so X approaches 1 as λ T approaches 0.
Shruthi R Bompelli17 Bandwidth needed For (0.95, 1 week) currency of this collection: –Must re-index with period around 18 days. –A (0.95, 1-day) index of the whole web (~800 million pages) processes about 104 megabits/sec. –A more “modest” (0.95, 1-week) index of 150 million pages will process 9.4 megabits/sec.
Shruthi R Bompelli18 Summary : About one in five pages has been modified within the last 12 days. (0.95, 1-week) on our collection: must observe every 18 days Ideas: More specialty search engines? Distributed monitoring/remote update? Other work: algorithms for scheduling observation based on source change rate and importance Future Study : Path Manager