Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.

Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu Malik, Randal Burns Johns Hopkins University Stratos Papadomanolakis, Anastassia Ailamaki Carnegie Mellon University

Hopkins Storage Systems Lab, Department of Computer Science Overview Motivation – Data intensive scientific database federations – Mid-tier caching improves scalability Choosing the unit of cache replacement – Minimize aggregate network traffic – Improve query execution performance Query prototypes – Cache groups of columns – Adapts to changes in the workload

Hopkins Storage Systems Lab, Department of Computer Science OpenSkyQuery Federation of sky surveys (a virtual telescope) – Expected to grow from 30 sites to over 100 Available over the Internet (community of astronomers, educational users) Sites are autonomous, heterogeneous, and geographically distributed Data intensive workload (large data sets, network- bound) Scaling through mid-tier caching – Minimize network traffic – Offload query processing

Hopkins Storage Systems Lab, Department of Computer Science Caching Schema Difficult to achieve good query performance – Caches employ commodity hardware – An index-free environment Both network and query performance are sensitive to granularity of cache replacement Fine granularity (column) – Poor network performance at small cache sizes – High I/O overhead Coarse granularity (table) – Groups unrelated columns – Inefficient query and network performance

Hopkins Storage Systems Lab, Department of Computer Science Contributions Cache workload-defined groups of columns (query prototypes) Adaptive – candidate query prototypes are discovered incrementally from the request stream Self-organizing – each prototype describes a physical schema optimized for a specific class of queries Improve in-cache query execution performance without sacrificing network savings

Hopkins Storage Systems Lab, Department of Computer Science Caching for Network Savings Identify and cache database objects that provide network savings – Requests that access these objects are serviced from the cache – Reduces contention for network bandwidth Bypass Yield Caching (Malik et al., ICDE’05) – Caching framework that uses economic principles to maximize network savings – Database objects are ranked by yield (expected network savings per unit of cache space utilized)

Hopkins Storage Systems Lab, Department of Computer Science Choosing the Unit of Cache Replacement Semantic caching is unsuitable for Astronomy – Lack locality (objects are rarely reused) – Evaluating query containment is difficult (nested queries, complex joins, and user-defined functions are common) Employ schema-based caching – Queries reuse the same set of columns – Derive popular columns from the workload – Analogous materialized views

Hopkins Storage Systems Lab, Department of Computer Science File-Bundling (Otoo et al., SC’04) Loading only columns with high yield at small cache sizes ABCDEFGHIJ Q1Q2Q3Q4 B C Cache H I Caching columns B, C, H, and I results in no cache hits Solution: cache groups of columns

Hopkins Storage Systems Lab, Department of Computer Science Caching Groups of Columns Existing schema-based caching models are static (e.g. CacheTables, MTCache, TimesTen) – Do not account for dynamic workload access patterns – Physical schema of backend database or defined a priori – May group columns that are rarely used together Query prototypes caching – Identifies the best groupings from the workload – Minimizes query execution cost against prototypes without sacrificing network savings

Hopkins Storage Systems Lab, Department of Computer Science Query Prototype Given a query q i, define the Query Access Set, QAS(q i ), as the set of attributes accessed by q i q i and q j share the same query prototype if they access the same attributes (QAS(q i ) = QAS(q j )) Example: SELECT objID FROM Galaxy, SpecObj WHERE objID = bestobjID and specclass = 2 and z between 0.121 and 0.127 QAS = {Galaxy:objID, SpecObj:bestobjID, SpecObj:specclass, SpecObj:z}

Hopkins Storage Systems Lab, Department of Computer Science Query Prototype B1 A1A2A3 A2A3 B1B2B3 R1 Base Tables R2 Q1 Cache Prototype QAS(Q1) = {R1:A2, R1:A3, R2:B1} QAS(Q2) = {R2:B1, R2:B2, R2:B3} Q2 B1B2B3 Prototype B1 is replicated in the cache

Hopkins Storage Systems Lab, Department of Computer Science Workload Properties Read-only queries One-month trace against the Sloan Digital Sky Survey (SDSS) Data Release 4 – 2TB 1.4 million queries generating 360GB of network traffic 1176 query prototypes describe the entire workload 11 prototypes capture 91% of the queries 6 prototypes generate 89% of the network traffic

Hopkins Storage Systems Lab, Department of Computer Science Experiments Evaluate caching of tables, columns, vertical partitions, and query prototypes AutoPart (Papadomanolakis et al., SSDBM’04) – An automated partitioning algorithm for large scientific databases – Groups columns in order to improve query execution performance – Produces the best workload-driven, static grouping

Hopkins Storage Systems Lab, Department of Computer Science Network Savings

Hopkins Storage Systems Lab, Department of Computer Science Cache Pollution

Hopkins Storage Systems Lab, Department of Computer Science Query Performance

Hopkins Storage Systems Lab, Department of Computer Science Discussion Improving network and query execution performance are complementary goals Columns should be grouped together at small cache sizes (cache hits suffer due to file-bundling) Column groupings should be adaptive because – Workload access pattern is dynamic – Indexes are not available

Hopkins Storage Systems Lab, Department of Computer Science Questions ???

Hopkins Storage Systems Lab, Department of Computer Science Schema Reuse Localized to a small subset of tables

Hopkins Storage Systems Lab, Department of Computer Science Schema Reuse Similar reuse among columns

Hopkins Storage Systems Lab, Department of Computer Science Object Reuse Few objects are reused

Hopkins Storage Systems Lab, Department of Computer Science SkyQuery Federation middleware built at Hopkins Wrapper/Mediator architecture using web services

Hopkins Storage Systems Lab, Department of Computer Science Load Cost

Hopkins Storage Systems Lab, Department of Computer Science Scan Cost Scanning large tables, the useful region is a small fraction Incur IO overhead for accessing data from extraneous columns Spatial locality among related columns Q

Hopkins Storage Systems Lab, Department of Computer Science Join Cost Joining results for queries that access multiple fragments Access should be localized to few fragments to minimize join overhead Q

Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.

Similar presentations

Presentation on theme: "Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.

Similar presentations

Presentation on theme: "Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu."— Presentation transcript:

Similar presentations

About project

Feedback