Adaptive Cache Mode Selection for Queries over Raw Data

Adaptive Cache Mode Selection for Queries over Raw Data
Hi, Today I am going to talk about the importance of choosing the correct cache mode in order to run queries efficiently over raw data. I will also describe a heuristic algorithm we came up with to decide the optimal cache mode adaptively at runtime. Tahir Azim, Azqa Nadeem and Anastasia Ailamaki

Which cache mode is better for performance?
Cache Mode Tradeoffs Lazy Mode Cache only offsets of satisfying tuples Lower caching overhead (space+time) Higher scan cost Eager Mode Materialize and cache satisfying tuples Higher caching overhead (space+time) Lower scan cost When I say “cache modes”, I am talking about what form the cache contains data in. It could contain just the offsets of tuples satisfying a query operator. We call this lazy caching. Alternatively, it could contain the tuples themselves that satisfy an operator. We call this eager caching. Lazy caching has significantly lower caching overhead in terms of both space and time requirements. But it is more expensive to reuse a lazy cache because the data has to be read back and parsed from the original file. Eager caching, on the other hand, has higher space and time overhead because it caches in-memory entire tuples satisfying a query operator. But it is very cheap to reuse because the cached data does not have to be parsed again. But the question that arises from this trade-off, and that we try to answer, is: which cache mode achieves better performance on a query workload? Which cache mode is better for performance?

Caches: Crucial for Queries on Raw Data
Reading and parsing raw files incurs high IO and CPU cost SQL queries on raw files can be ~6x slower than on a pre- loaded database* Positional maps, indexes and caches crucial for performance We focus on maximizing cache efficiency This is an important question to answer because running queries over raw data is expensive due to the cost of reading and parsing this data. According to the NoDB paper, SQL queries running over raw data are almost 6 times slower than if they were run on a pre-loaded database. So raw data analytics only becomes feasible using techniques such as positional maps, dynamically created indexes and caches… And optimal use of these techniques is crucial to extract the best performance. We focus in this talk on the question of choosing the optimal cache mode for best performance. * Alagiannis, Ioannis, et al. "NoDB: Efficient query execution on raw data files." SIGMOD 2012.

Experiment on TPC-H CSV and JSON data
10 GB SF-10 CSV data: SELECT agg(attr_1), …, agg(attr_n) FROM subset of {customer, orders, lineitem, partsupp, part} WHERE <join clauses on selected tables> AND <range predicates on all selected tables with random selectivity> 2.5GB JSON generated from SF-1 data: SELECT agg(attr_1), …, agg(attr_n) FROM orderLineItems WHERE <range predicate with random selectivity over a randomly chosen numeric attribute> To explore this question further, we ran a pair of experiments on CSV and JSON data derived from the TPC-H standard. On SF-10 CSV data, we ran a sequence of 100 select-project-join queries. Each query operates on a randomly selected subset of tables. There is a range predicate on each table and the tables get joined on their primary keys. For JSON, we generated a 2.5 GB JSON dataset from SF-1 data. Each record in the JSON file associates an order with a list of line items included in the order. We then ran 100 select-project-aggregate queries on this dataset. As our caching policy, we admit the outputs of selection operators into the cache, and we evict using a state-of-the-art cost-based eviction policy. Caching Policy: Cache outputs of selection operators Use a cost-based eviction policy* * Azim, Tahir, et al. "ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data”. PVLDB 11(3), 2017.

Better Caching Mode Depends on Workload and Cache Size
100 queries on Proteus/ReCache CSV Workload JSON Workload We ran these experiments on Proteus, our query engine for in-situ raw data analytics. Our results show that neither eager nor lazy caching is ideal for all workloads and all cache sizes. Instead, on both the CSV and JSON datasets, lazy caching performs better than eager caching for small cache sizes. But as the cache size increases, the performance of eager caching improves until it outperforms lazy caching. So choosing a fixed caching mode beforehand is not always a good idea, because if you choose the wrong caching mode, it can result in a median performance penalty of 27% and a worst-case penalty of up to 200%. In hindsight, this is somewhat intuitive because with lazy caching on a small cache, you get many fewer evictions than with eager caching. So cache hits become more likely. With eager caching, you have to evict more frequently, so you get more frequent cache evictions and fewer cache hits. In any case, choosing the appropriate caching mode is not a trivial task because you most likely don’t know the workload in advance. So you cannot always determine the crossover point for the two curves in advance. Max slowdown of 200% and median of 27% using wrong caching mode

Adaptive Cache Mode Selection
Use a “shadow” cache to simulate caching in alternate mode Keep a running estimate of total benefit in both modes Use statistical significance test (t-test) to decide if alternate mode is more beneficial To solve this problem, we have developed an algorithm that automatically selects the optimal cache mode at runtime. The basic idea is to use a low-overhead shadow cache that simulates caching in the alternate mode to whatever is the current mode. The shadow cache enables us to keep a running estimate of the total benefit being accrued in both modes. Finally, a test of statistical significance helps decide whether it makes sense to switch to the alternate mode.

Shadow Cache Design Everything added to actual cache also added to shadow cache Shadow cache makes its own eviction decisions Shadow cache entry stores no duplicate cached data Only maintains pointer to actual cache entry Estimates shadow cache benefits based on actual cache’s metadata The shadow cache is designed to mimic the behavior of the actual cache. Everything that gets added to the actual cache also gets added to the shadow cache. However, for eviction, the shadow cache runs its eviction algorithm independent of the actual cache. (The idea is that the shadow cache may get full and require evictions at a completely different time than the actual cache). The shadow cache is also lightweight, both in terms of space and time overhead. It stores no duplicate cached data. Instead it just maintains a pointer to the actual cache entry. It then uses the metadata of the actual cache entry to estimate its own potential benefits.

Benefit Metric Cache item benefit metric: (t+c)-(s+l) Time t c l s
Cache Hit t c l s Cost of operator execution: t Cost of materialization: c We quantify the benefit of each cached item using a benefit metric based on dynamically measured timing information. The benefit metric for a cache item is meant to capture the cost of reconstructing a cached item in case it is evicted. Assume t is the cost of executing the operator whose results are being cached. That includes the cost of parsing fields relevant to the operator. c is the cost of parsing and caching the full tuples. Then (t+c) is the cost of reconstructing the item in the cache if it is evicted. This is the benefit that it is adding to the cache. On the other hand, assume that after a cache hit, l was the time taken to lookup the element in the cache and s is the time it takes to scan the cache. Then (s+l) is the overhead of using a cache item. Subtracting these two gives the net benefit of a cache item. We can use this metric to compute the net benefit of an item that is in the actual cache. But it’s not possible to get these timing values for items that are merely in the shadow cache. So we estimate these values from those of the actual cache item. Cost of finding a match: l Cost of scanning the cache: s Cache item benefit metric: (t+c)-(s+l)

Estimating Shadow Cache Benefits
If shadow cache in lazy mode: Size = sizeof(int) * NumTuples s = teager * NumTuples / N c ≃ 0 If shadow cache in eager mode: Size = NumTuples * AvgTupleSize s ≃ 0 c = constant * clazy We estimate the timing values for the shadow cache for the two cases separately. If the shadow cache is in lazy mode, its size is simply the number of tuples times the size of an offset. The scan cost is the total operator cost for the relation scaled by the fraction of tuples in the cache. We assume the cost of materializing the results to be very small and close to zero. On the other hand, if the shadow cache is in eager mode, we estimate its size as the number of tuples times the average size of a tuple we have seen. We assume the scan cost to be close to zero. Finally, we scale the cost of cache creation by a fixed constant. In our experience and based on the ReCache paper, a constant value of 3 yields sufficiently good results. After summing up these estimates for all items in the shadow cache, we run a significance test to compare the benefits of the shadow cache and the actual cache. If the shadow cache has a significantly higher benefit, we switch to the mode of the shadow cache. Use significance test to compare total benefit of shadow and actual cache

Cache Performance using Adaptive Cache Mode Selection (Acme)
CSV Workload JSON Workload To evaluate this approach, we compare its performance on the same workloads we showed earlier. We find that our approach is able to quickly choose the appropriate cache mode and achieve performance close to the optimal. The only cases where it doesn’t switch to the more efficient cache mode is when it deems the difference between the two modes to be insignificant (e.g. cache size of 2048 MB). Overall, its median performance is within 2% of the optimal, and within 16% of the optimal in the worst case (the average difference from the optimal is 4.7%). (We found the overhead of the additional shadow cache monitoring and significance testing to be at most 0.1%) Max difference of 16% and median of 2% from the optimal

Adapting to a Changing Workload
We also test whether our approach can switch to a frequently changing workload. We run a query sequence on the CSV dataset with 100 queries, where queries 25 to 75 have very low selectivity, while others have random selectivity between 0 and 100%. So, between queries 25 and 75, complete tuples can fit in the cache: that makes eager caching the more efficient option. The graph shows that Acme is able to make this adjustment. Between queries 30 and 90 (when the mode switches happen), it closely follows the eager caching curve: otherwise, it tracks the curve for lazy caching. Acme adapts effectively to a changing query workload

Conclusion Neither lazy nor eager caching is ideal for all workloads and cache sizes Maximum performance penalty of 200% and median of 27% Dynamically selecting the caching mode gets you closer to optimal performance Reduces maximum performance penalty to 16% and median to 2% In conclusion, neither eager nor lazy caching is optimal for all workloads and cache sizes. Choosing the wrong caching mode causes a worst-case performance difference from the optimal of up to 200% and a median of 27%. We solve this problem using an algorithm that dynamically chooses the optimal caching mode by observing the workload and the behavior of the cache. This enables us to reduce the worst case performance penalty to 16% and median to just 2%.

Thanks!

Backup Slides

Cache Design Challenges
Cache Admission Policy Is admitting an item into the cache worth it? Cache Eviction Policy When cache is full, which item to evict? Cache Mode Selection Should data be cached as full tuples or just offsets? Now, designing a cache comes with its own series of challenges. Of course, choosing a cache admission policy and a cache eviction policy are the two classic problems in cache design…

Cache Design Challenges
Cache Admission Policy Is admitting an item into the cache worth it? Cache Eviction Policy When cache is full, which item to evict? Cache Mode Selection Should data be cached as full tuples or just offsets? But we will be focusing on a third design question: how to select the correct caching mode? Or in other words, should data satisfying a query operator be cached as full tuples, or as just file offsets of the satisfying tuples?

Performance compared to eager and lazy caching
This graph shows cumulative workload execution time with no limit on cache size on a sequence of 100 queries of the following format: SELECT agg(attr_1), ..., agg(attr_n) FROM subset of {customer,orders,lineitem,partsupp, part} of size n WHERE <equijoin clauses on selected tables> AND <range predicates on each selected table with random selectivity> The outputs of the selection operator are admitted to the cache. ReCache is configured to cache and, where possible, re-use the outputs of the selection operators in each query. This includes support for reuse of subsuming queries. Overall, the query response time improves by 62% compared to a system with no cache and 47% compared to a lazy cache. Compared to an eager cache, response time for the cost-adaptive cache over the entire workload is virtually the same, with a difference of just 3%. Eager caching performs well over a long query sequence touching a small set of tables due its aggressive caching strategy, which may be expensive in the short term but eventually pays off over the long run. ReCache’s admission policy adds overhead on such a workload because it uses lazy caching if the short-term overhead is large. The cache then only switches to eager mode when it is first reused. Despite this overhead, ReCache performs almost identically to the eager caching strategy in the long term. With unlimited cache size, eager caching improves performance by 62% than a system with no cache.

Adaptive Cache Mode Selection for Queries over Raw Data

Similar presentations

Presentation on theme: "Adaptive Cache Mode Selection for Queries over Raw Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Cache Mode Selection for Queries over Raw Data

Similar presentations

Presentation on theme: "Adaptive Cache Mode Selection for Queries over Raw Data"— Presentation transcript:

Similar presentations

About project

Feedback