Download presentation

Presentation is loading. Please wait.

Published byStephanie Reese Modified over 2 years ago

1
Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia University

2
Panos Ipeirotis – New York University thrombopenia Metasearchers Provide Access to Text Databases Metasearcher NYTimes Archives PubMed USPTO Broadcasting queries to all databases not feasible (~100,000 DBs) Large number of hidden- web databases available Contents not accessible through Google Need to query each database separately

3
Panos Ipeirotis – New York University thrombopenia Metasearchers Provide Access to Text Databases Metasearcher NYTimes Archives PubMed USPTO... thrombopenia 26, thrombopenia 0... thrombopenia ? Database selection relies on simple content summaries: vocabulary, word frequencies

4
Panos Ipeirotis – New York University Extracting Content Summaries from Text Databases For hidden-web databases (query-only access): Send queries to database Retrieve top matching documents Use document sample as database representative For crawlable databases: Retrieve documents by following links (crawling) Stop when all documents retrieved Content summary contains: Words in sample (or crawl) Document frequency of each word in sample (or crawl) PubMed (11,868,552 documents) Word #Documents aids 123,826 cancer1,598,896 heart 706,537 hepatitis 124,320 thrombopenia 26,887 …

5
Panos Ipeirotis – New York University Never-update Policy Current practice: construct summary once, never update Extracted (old) summary may: Miss new words (from new documents) Contain obsolete words (from deleted document) Provide inaccurate frequency estimates NY Times (Oct 29, 2004) Word#Docs … NY Times (Mar 29, 2005) Word#Docs … tsunami(0) recount2,302 grokster2 tsunami250 recount(0) grokster78

6
Panos Ipeirotis – New York University Research Challenge Updating summaries is costly! Challenge: Maintain good quality of summaries, and Minimize number of updates If summaries do not change Problem solved! If summaries change Estimate rate of change and schedule updates

7
Panos Ipeirotis – New York University Outline Do content summaries change over time? Which database properties affect the rate of change? How to schedule updates with constrained resources?

8
Panos Ipeirotis – New York University Randomly picked from Open Directory Multiple domains Multiple topics Searchable (to construct summaries by querying) Crawlable (to retrieve full contents) Data for our Study: 152 Web Databases …www.intellihealth.comwww.fda.govwww.si.edu

9
Panos Ipeirotis – New York University Study period: Oct 2002 – Oct weekly snapshots for each database 5 million pages in each snapshot (approx.) 65 Gb per snapshot (3.3 Tb total) For each week and each database, we built: Complete summary (by scanning all pages) Approximate summary (by query-based sampling) Data for our Study: 152 Web Databases

10
Panos Ipeirotis – New York University Measuring Changes over Time Recall: How many words in current summary also in old (extracted) summary? Shows how well old summaries cover the current (unknown) vocabulary Higher values are better Precision: How many words in old (extracted) summary still in current summary? Shows how many obsolete words exist in the old summaries Higher values are better Results for complete summaries (similar for approximate)

11
Panos Ipeirotis – New York University Summaries over Time: Conclusions Databases (and their summaries) are not static Quality of old summaries deteriorates over time Quality decreases for both complete and approximate content summaries (see paper for details) How often should we refresh the summaries?

12
Panos Ipeirotis – New York University Outline Do content summaries change over time? Which database properties affect the rate of change? How to schedule updates with constrained resources?

13
Panos Ipeirotis – New York University Survival Analysis Initially used to measure length of survival of patients under different treatments (hence the name) Used to measure effect of different parameters (e.g., weight, race) on survival time We want to predict time until next update and find database properties that affect this time Survival Analysis: A collection of statistical techniques for predicting the time until an event occurs

14
Panos Ipeirotis – New York University Survival Analysis for Summary Updates Survival time of summary: Time until current database summary is sufficiently different than the old one (i.e., an update is required) Old summary changes at time t if: KL divergence(current, old) > τ Survival analysis estimates probability that a database summary changes within time t change sensitivity threshold

15
Panos Ipeirotis – New York University Modeling Goals Goal: Estimate database-specific survival time distribution Exponential distribution S(t) = exp(-λt) common for survival times λ captures rate of change Need to estimate λ for each database Preferably, infer λ from database properties (with no training) Intuitive (and wrong) approach: data + multiple regression Study contains a large number of incomplete observations Target variable S(t) typically not Gaussian

16
Panos Ipeirotis – New York University Survival Times and Incomplete Data week Survival times for a database X X X X X Week 52, end of study Censored cases Many observations are incomplete (aka censored) Censored data give partial information (database did not change)

17
Panos Ipeirotis – New York University Using Censored Data S(t), best fit, ignoring censored data S(t), best fit, using censored data By ignoring censored cases we get (under) estimates perform more update operations than needed By using censored cases as-is we get (again) underestimates Survival analysis extends the lifetime of censored cases X X X X X X S(t), best fit, using censored data as-is

18
Panos Ipeirotis – New York University Database Properties and Survival Times For our analysis, we use Cox Proportional Hazards Regression Uses effectively censored data (i.e., database did not change within time T) Derives effect of database properties on rate of change E.g., if you double the size of a database, it changes twice as fast No assumptions about the form of the survival function

19
Panos Ipeirotis – New York University Rate of change increases Rate of change decreases Cox PH Regression Results Examined effect of: Change-sensitivity threshold τ Topic Size Number of words Differences of summaries extracted in consecutive weeks Domain (higher τ longer survival) (details in next slide) (does not matter, except for health-related sites) (larger databases change faster!) (does not matter) (sites that changed frequently in the past, change frequently in the future)

20
Panos Ipeirotis – New York University Baseline Survival Functions by Domain Effect of domain: GOV changes slower than any other domain EDU changes fast in the short term, but slower in the long term COM and other commercial sites change faster than the rest

21
Panos Ipeirotis – New York University Cox PH analysis gives a formula for predicting the time between updates for any database Rate of change depends on: d omain database size history of change threshold τ Results of Cox PH Analysis By knowing time between updates we can schedule update operations better!

22
Panos Ipeirotis – New York University Outline Do content summaries change over time? Which database properties affect the rate of change? How to schedule updates with constrained resources?

23
Panos Ipeirotis – New York University Deriving an Update Policy Naïve policy: Updates all databases at the same time (i.e., assumes identical change rates) Suboptimal use of resources Our policy: Use change rate as predicted by survival analysis Exploit database-specific estimates for rate of change

24
Panos Ipeirotis – New York University Scheduling Updates DatabaseRate of change λ average time between updates 10 weeks40 weeks Toms Hardware weeks46 weeks USPS weeks34 weeks With plentiful resources, we update sites according to their rate of change When resources are constrained, we update less often sites that change too frequently

25
Panos Ipeirotis – New York University Scheduling Results Clever scheduling improves quality of summaries (according to KL, precision and recall) Our policy allows users to select optimally change thresholds according to available resources, or vice versa. (see paper)

26
Panos Ipeirotis – New York University Updating Content Summaries: Contributions Extensive experimental study (1 year, 152 dbases): established the need to update periodically statistics (summaries) for text databases Change frequency model: showed that database characteristics can predict time between updates Scheduling algorithms: devised update policies that exploit survival model and use efficiently available resources

27
Panos Ipeirotis – New York University Current and Future Work Current: Compared with machine learning techniques Applied technique for web crawling Future: Apply survival analysis for refreshing db statistics (materialized views, index statistics, …) Examine efficiency of survival analysis models Create generative models for modeling database changes

28
Panos Ipeirotis – New York University Thank you! ( ) Questions?

29
Panos Ipeirotis – New York University Related Work Brewington & Cybenko, WWW9, Computer 2000 Cho & Molina, VLDB 2000, SIGMOD 2000, TOIT 2003 Coffman, J.Scheduling, 1998 Olston & Widom, SIGMOD 2002

30
Panos Ipeirotis – New York University Measuring Changes over Time KL divergence: How similar is the word distribution in old and current summaries? Identical summaries: KL=0 Higher values are worse Results for complete summaries (similar for approximate)

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google