Presentation is loading. Please wait.

Presentation is loading. Please wait.

Caching with “Good Enough” Currency, Consistency, and Completeness

Similar presentations


Presentation on theme: "Caching with “Good Enough” Currency, Consistency, and Completeness"— Presentation transcript:

1 Caching with “Good Enough” Currency, Consistency, and Completeness
Hongfei Guo University of Wisconsin Per-Åke Larson Microsoft Research Raghu Ramakrishnan University of Wisconsin

2 Motivation — Scaling Google
When the search engine was first created, the requests are small. (From CS department in Stanford). A laptop might suffice. As its popularity increases, it need to scale. They use partition and replications to share the workload

3 Motivation — Scaling A DBMS By Caching
How to tell whether the cached data is “good enough” for an application? NO data quality requirements from the applications! NO data quality guarantees from the caching DBMS! Application Server App specific code Caching DBMS Now let’s look at the DBMS backed service? Heavy workload Internet is in-between of users and the back-end server Two tasks: Workload sharing Better user response time Caching is commonly used to achieve these goals. For scalability reason, updates are propagated asynchronously. Cached data can be out of sync Different applications may have different tolerance to stale data. Problem: no requirements, no guarantee Thus, manually choose the correct data source (Analogous to early days, manually choose indices) (Three drawbacks: People do this, all hand-crafted. They require modification on the applications, eg, to route the queries to the desired replica whenever the underlining strategy changes. No guarantees on data quality. (you get what you get) ) Why not the caching DBMS: lack of understanding of application’s need. If it can understands the applications more, this common task can be moved inside the caching DBMS. My research provides a framework towards this ambitious goal. Updates Updates Asynchronous Backend DBMS

4 View level granularity
The Big Picture Apps: Specifies data quality requirements in queries Cache: Enforces data quality constraint [SIGMOD 2004] [SIGMOD 2004 Demo] Cache admin: Specify local data quality to be maintained by cache (Data quality-aware database caching model) [This presentation] System performance evaluation [dissertation] Caching DBMS Backend DBMS Application Server View level granularity Finer granularity (Partitions of a view) This makes it possible to provide DBMS guarantee. Further, the additional knowledge gives the caching DBMS the ca app’s need App Cache-admin Knowing the requirement and local data quality, Query processing at the Cache…. One step further, with the additional knowledge about data quality requirement from the apps, we can do better cache management. I will talk about the fist three parts.

5 Data Quality Metrics (informal)
Currency: The elapsed time since this copy becomes stale Consistency: A query result is (snapshot) consistent iff it is as if evaluated from a snapshot of the master database C&C: Currency & Consistency How do we measure data quality? We use currency to describe how old an object is. We define it as … And consistency is used to describe the relationship among a group of objects.

6 Roadmap Background Cache data quality properties
Cache property specification Enforcing data quality constraints Experiments Future directions and conclusions In the rest of my talk, I will explain each part in turn. Then I will briefly mention some of my other research and discuss some future directions. Ok. How to specify “good enough”

7 Why Define Cache Properties?
Query processing Cache Properties (= contract) Query processing relys on the contract. Ok, why do we need cache properties. The idea is to add an abstraction layer… Cache maintenance

8 Cache Properties (P+3C)
Presence — per object Consistency — a set of objects Completeness — per predicate Currency — object staleness We defined four fundamental cache properties:…

9 Basic Concepts Object Tables H1 Snapshots H2 Cache View 1
Master Database Before I proceed, let’s first look at the data model. The master database is a set of objects, organized into tables. The cache is a set of select views. The master database evolves due to committed update transactions. We call the database state after each update Xanc a history. Each object of the cache is copied from some snapshot of the master database. Different objects might be copied from different snapshots if the master database. Cache H2

10 Cache Property Examples
Currency = now – stale point Consistent Complete Present View 1 View 2 View 3 H1 Master Database Stale: when a copy is not the same as the master The first time it becomes stale: stalepoint Current is the elapse time. Before I proceed, let’s first look at the data model. The master database is a set of objects, organized into tables. The cache is a set of select views. The master database evolves due to committed update transactions. We call the database state after each update Xanc a history. Each object of the cache is copied from some snapshot of the master database. Different objects might be copied from different snapshots if the master database. Stale point Cache H2

11 Roadmap Background Cache data quality properties
Cache property specification Enforcing data quality constraints Experiments Future directions and conclusions In the rest of my talk, I will explain each part in turn. Then I will briefly mention some of my other research and discuss some future directions. Ok. How to specify “good enough”

12 Specifying Cache Properties
Specified as integrity constraints Presence constraint Consistency constraint Completeness constraint Presence correlation constraint Consistency correlation constraint Ok, we defined a set of fundamental cache properties. Now how can a cache admin specify the desired properties? Our solution is to specify them as integrity constraints. Each such constraint specifies a cache property for a group of objects. Now the question is, how do we specify which group of objects? Conceptually, we do so through a control table mechanism.

13 Presence Constraint Backend DBMS AuthorCopy: authorId name city
1 Alice Madison 2 Bob Madison 3 Cedric Seattle Backend DBMS AuthorList_PCT: authorId 1 2 3 Caching DBMS

14 Partially materialized view
Presence Constraint Partially materialized view [Zhou et al 2005] AuthorCopy: CREATE VIEW AuthorCopy AS SELECT * FROM Authors CREATE TABLE AuthorList_PCT (authorId int) ALTER VIEW AuthorCopy ADD ON authorId IN (SELECT authorId FROM authorId_PCT authorId name city 1 Alice Madison 2 Bob Madison 3 Cedric Seattle control-key control-table AuthorList_PCT: PRESENCE authorId 1 2 3

15 Consistency Constraint
AuthorCopy: Cache Region CREATE TABLE CityList_CsCT (city string) ALTER VIEW AuthorCopy ADD ON city IN (SELECT city FROM cityList_CsCT authorId name city 1 Alice Madison 2 Bob Madison 3 Cedric Seattle Backend DBMS CityList_CsCT: AuthorList_PCT: AuthorList_PCT: Consistency city authorId authorId 1 2 3 Madison 1 2 3

16 Completeness Constraint
AuthorCopy: CREATE TABLE CityList_CpCT (city string) ALTER VIEW AuthorCopy ADD ON city IN (SELECT city FROM cityList_CsCT authorId name city 1 Alice Madison 2 Bob Madison 3 Cedric Seattle Backend DBMS AuthorList_PCT: CityList_CpCT: AuthorList_PCT: Completeness city authorId authorId 1 3 Madison 1 3

17 Presence Correlation Constraint
AuthorList_PCT: AuthorCopy: authorId authorId name city authorId 1 Alice Madison 2 Bob Madison 3 Cedric Seattle 1 2 3 BookCopy: authorId Backend DBMS isbn authorId title So far, the constraints specify desired properties for a single materialized view. How about cache properties across views? 111 1 aaa 222 1 bbb 333 2 ccc 444 3 ddd 555 3 eee ALTER VIEW BookCopy ADD PRESENCE ON authorId IN (SELECT authorId FROM AuthorCopy)

18 Presence Correlation Constraint
AuthorList_PCT: AuthorCopy: authorId authorId name city authorId 1 Alice Madison 2 Bob Madison 3 Cedric Seattle 1 2 3 BookCopy: authorId isbn authorId title AuthorList_PCT So far, the constraints specify desired properties for a single materialized view. How about cache properties across views? Point out, they can have the same parents. 111 1 aaa 222 1 bbb 333 2 ccc 444 3 ddd 555 3 eee authorId AuthorCopy authorId BookCopy

19 Consistency Correlation Constraint
AuthorList_PCT: AuthorCopy: authorId authorId name city authorId 1 Alice Madison 2 Bob Madison 3 Cedric Seattle 1 2 3 BookCopy: authorId Backend DBMS isbn authorId title So far, the constraints specify desired properties for a single materialized view. How about cache properties across views? 111 1 aaa 222 1 bbb 333 2 ccc 444 3 ddd 555 3 eee ALTER VIEW BookCopy ADD CONSISTENCY ROOT

20 Consistency Correlation Constraint
AuthorList_PCT: AuthorCopy: authorId authorId name city authorId 1 Alice Madison 2 Bob Madison 3 Cedric Seattle 1 2 3 BookCopy: authorId isbn authorId title AuthorList_PCT So far, the constraints specify desired properties for a single materialized view. How about cache properties across views? 111 1 aaa 222 1 bbb 333 2 ccc 444 3 ddd 555 3 eee authorId AuthorCopy authorId BookCopy

21 Cache Schema Example authorId reviewerId authorId reviewId isbn
AuthorList_PCT ReviewerList_PCT authorId reviewerId AuthorCopy ReviewerCopy authorId To summarize, a cache schema can be represented by a graph. reviewId ReviewCopy BookCopy isbn

22 Roadmap Background Cache data quality properties
Cache property specification Enforcing data quality constraints Experiments Future directions and conclusions In the rest of my talk, I will explain each part in turn. Then I will briefly mention some of my other research and discuss some future directions. Ok. How to specify “good enough”

23 Changing The Assumptions
Fully materialized views Consistent views Push-based maintenance Partially materialized views Row-level consistency Pull-based maintenance More general algorithms Run-time check for consistency constraints that can not be validated at compile-time Simple case: view level quality control. General caes: fine granularity (row-level) quality control

24 Run-time C&C Checking When view V matches expression E E V
Guard ChoosePlan Local plan using V Remote plan requesting E Explain view matching. Explain SWU. In out special case, what is currency guard. Note our currency guard is at view level instead of query level. One advantage of this is that if a plan involves more than one currency guards, say 2. Then during execution, it is possible that for one SWU, local data is used, but for the other, remote data is used. Currency guard: Check if local view V satisfies currency requirement Consistency guard: Check if local view V satisfies consistency requirement

25 Performance Evaluation Goals
Consistency guards overhead Simple checks A spectrum of checks ranging from simple to complicated We added more functionality to the cache. What is the overhead associated with it?

26 Experimental Setting Back-end hosts a TPCD database tpcd1gh with scale factor 1.0 (~1GB) Cache server has a shadow of tpcd1gh Two local views: custCopy, orderCopy LAN connection between cache and backend server So far, I have described how we enforce C&C constraints in a caching DBMS. Compared to a normal plan, in order to guarantee currency requirements, our plan has to include run-time currency checking. How much overhead does this incur? We run experiments to answer this question. 1 GB TPCD. Local data we have two simple projection views. Take out the region information. Doesn’t use it here.

27 Queries Used Characterize the queries.
Simplest query, index lookup on a primary key returns a single row. (fast possible query) Simple fast index nested loop join query, returns 6 rows. Lookup query on a non-key column, returns about 6000 rows. What plans? For Q1 and Q3, local view of Customer is used with currency guard; for Q2, both local views are used, thus it has two currency guard. Query 1 and query 2 are designed to get worst case overhead since those queries are short.

28 Simple Consistency Guards Overhead
1.6% 1.72% Execution time (ms) How to get the number? How to read this table: Local case: Remote case. Measure the overhead. We measure the absolute overhead and relative overhead. Q1 & Q2: absolute value is small, but a significant percentage, because the query itself is small. Q3, percentage is small, but the absolute value is high. Why? (For remote case, too much variance. Appr. S level. Too much noise.) overhead is small. We want to investigate it furthur. 1.66% 1.59% 16.56% 14.00% Local Remote

29 Single Table Consistency Guard Overhead
2.33% 7.48% 8.79% 6.06% 4.95% (Qa is used) Execution time (ms) So far, I have described how we enforce C&C constraints in a caching DBMS. Compared to a normal plan, in order to guarantee currency requirements, our plan has to include run-time currency checking. How much overhead does this incur? We run experiments to answer this question. 71.41% 62.85% 16.98% 58.32% 23.77% Local Remote

30 Future Directions Adaptive data quality aware caching policies
Control-table content? Refresh intervals? Improve current prototype Read-write transactions? Time-line constraints? Automate cache design/tuning How to get a good cache schema? (i.e., cache region granularity, assignment) Apply “good enough” to other forms of replications Indexing data? We envision two lines of future research. So far given a set of cached data, our techniques guarantee that the query results satisfied C&C requirements. We didn’t touch upon cache management problem. That is: what to cache & how to maintain the cache. Many researches have addressed this problem. But now we know more about the workload. With this additional information of C&C requirements, can we do better?

31 So long, and thanks for all the fish!
Summary Goal: fine-grained data quality-aware cache management A comprehensive solution How the cache tracks data quality? How admin specify cache properties? How to maintain the cache efficiently? How to do enforce C&C constraints for queries? So long, and thanks for all the fish! Four cache properties Dynamic cache model Efficient cache maintenance and “safety” Efficiently enforce C&C checking Questions?

32 Min’s Comments: Motivating cache Emphasize default: (easy to use): for both queries and cache admin. Performance goal: what do you want to test? Overhead. Not performance.

33 Proposed SQL Syntax author title bid text rid bid title author rid
Ullman databases 2 Raghu 1 author title bid BookCopy 3 text rid ReviewCopy SELECT * FROM Books B, Reviews R WHERE B.bid = R.bid AND B.title = “Databases“ Consistency class Currency bound Group by CURRENCY BOUND 10 min ON (B, R) BY B.bid CURRENCY BOUND 10 min ON (B), 30 min ON (R) CURRENCY BOUND 10 min ON (B, R) We extend SQL syntax to include a currency clause for each query block. There are three components in this clause. The pair of parenthesis So this currency clause says: the whole… Let’s look at another example: This currency clause says: In the query results, In this example, we see the third component: group by phrase. It can be added to a consistency class to specify that the scope for the consistency class is only within each group. This currency clause says, if we group the query result by book id, then each group has to be consistent, but different groups don’t have to be mutually consistent. bid title author rid text 1 databases Raghu 2 Ullman 3

34 Pull-Maintenance Refresh a region by pulling query results
When refreshing a region, also refresh the affected closure All overlapping regions All correlated regions

35 Theoretical Results Definition: (Safe partially materialized views)
A partially materialized view V is safe if the following two conditions hold for every instance of the cache that satisfies all integrity constraints: For any pair of regions in V, either they don’t overlap or one is contained in the other. If V is gray, let X denote the set of regions in V defined by presence control-key values. X is a partitioning of V and no pair of regions in X is contained in any one region defined on V. Cache schema design rules: Rule 1: A cache graph is a DAG. Rule 2: Only red nodes can have independent completeness or consistency control-tables. Rule 3: Every PMV with more than one parent must be a red circle. Rule 4: If a PMV has the shared-row problem according to Lemma 5.2, then it cannot be gray. Rule 5: A PMV cannot have non-compatible control-tables. Syntactically checkable conditions (polynomial) Property held for every instance Theorem: Given a cache schema <W, E>, if it satisfies the design rules, then every PMV in W is safe. Conversely, if the schema violates one of these rules, there is an instance of the cache satisfying all specified integrity constraints in which some PMV is unsafe.

36 Pull-Maintenance 111 1 aaa BookCopy: isbn authorId title 222 1 bbb
AuthorList_PCT: authorId isbn authorId title 1 3 4 authorId 111 1 aaa 222 1 bbb 333 1 ccc 444 3 aaa 555 4 eee TitleList_CsCT: title aaa

37 Pull-Maintenance 1 Alice Madison authorId 3 Cedric Seattle authorId
AuthorCopy: AuthorList_PCT authorId name city authorId 1 Alice Madison 3 Cedric Seattle AuthorCopy BookCopy: authorId authorId isbn authorId title BookCopy 111 1 aaa 222 1 bbb 333 1 ccc 444 3 aaa 555 3 eee

38 Inefficient Pulling Shared-row problem 1 Alice Madison
AuthorCopy: authorId name city Shared-row problem 1 Alice Madison 3 Cedric Seattle AuthorBookCopy: authorId authorId isbn 1 111 1 222 1 333 3 111 3 555 BookCopy: isbn price title aaa bbb ccc eee isbn

39 Issues Inefficient pulling: Efficient pulling:
Calculation of the affected closure requires checking the rows Efficient pulling: The affected closure does NOT depend on the instance of a view Only requires forward pull among correlated views Analogous to database schema design, good schemas, bad schemas

40 Related Work Uniqueness of our approach (query-centric):
Relaxing data quality Distributed databases Read-only transactions [Garcia-Monina et al. 1982] Demarcation protocol [Barbará et al 1992] TACC [Yu et al. 2000] Epsilon-serilizability [Pu et al. 1992] Warehousing and web views WebViews [Labrinidis et al 2003] FAS [Röhm et al. 2002] Obsolescent views [Gal 1999] Distributed views [Segev et al 1990] Freshness-driven web caching [Li et al 2003] Replica management Quasi-copies [Alonso et al. 1998], [Gallersdörfer et al. 1995] Good-enough views [Seligman et al. 1997] TRAPP [Olson et al. 2000] Caching Database caching DBCache [Altinel et al. 2003] Constraint-based database caching [Härder et al. 2004] Mid-Tier caching [TimesTen 2002] Shared-storage caching [Khalil et al 2002] Others Semantic caching [Dar et al 1996] Cache in Postgres [Stonebraker et al 1990] Predicate-based caching [Keller et al 1996] WATCHMAN [Scheuermann et al 1996] Cache investment [Kossmann et al 2000] DECAF [Kiernan et al 2000] Proxy caching [Luo et al 2001] Uniqueness of our approach (query-centric): Query: Specifies fine-grained C&C constraints Admin: Flexible local data quality control in terms of granularity and properties Caching DBMS: Provides C&C guarantees for individual query


Download ppt "Caching with “Good Enough” Currency, Consistency, and Completeness"

Similar presentations


Ads by Google