Presentation is loading. Please wait.

Presentation is loading. Please wait.

Department of Computer Science Johns Hopkins University Xiaodan Wang Advisor: Randal Burns Processing Data-Intensive Queries in Petabyte-Scale Scientific.

Similar presentations


Presentation on theme: "Department of Computer Science Johns Hopkins University Xiaodan Wang Advisor: Randal Burns Processing Data-Intensive Queries in Petabyte-Scale Scientific."— Presentation transcript:

1 Department of Computer Science Johns Hopkins University Xiaodan Wang Advisor: Randal Burns Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

2 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Big Picture Ensure high throughput for concurrent accesses to peta- scale Scientific datasets Data-Intensive analysis queries – Correlate, mine, and extract features – Batch workloads with multiple simultaneous queries – Join data partitioned and distributed across multiple nodes Scale of exploration limited – I/O: Scanning vast amounts of data over hours or days – Network: Transferring lots of data over large distances

3 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Querying on Global-Scale SkyQuery database federation for Astronomy – Publicly accessible virtual telescope – Sharing of heterogeneous data – Geographically dispersed (30 across NA, EA, EU) High network cost for federated join queries – Joins on terabyte datasets between nodes – Queries last minutes producing hundreds of MB in results – Network transfers consume up to 70% of the time Data volume and geography limit scale

4 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Incorporating Network Structure

5 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Network-Aware Join Scheduling Capture network heterogeneity – Metric that exploits excess capacity for routing – Decentralized local optimizations – Two-approximate, MST-based solution – Supports parallelism and trade-offs with I/O cost Ten-fold reduction in network utilization for SkyQuery (ICDE’08)

6 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Scanning Peta-Scale Data Data intensive scan queries – Executed against a clustered index – Span multiple nodes (partitioned by space/time in cluster) Incredibly I/O bound – Full DB scans lasting hours or days – Multiple concurrent queries (millions/month) – Significant data reuse between queries TurbulenceAstronomy

7 Processing Data Intensive Queries in Petabyte Scale Scientific Databases LifeRaft: Data-Driven Batch Scheduling Schedule queries greedily based on contention – Contentious regions amortize I/O over more queries – Two-fold improvement in throughput (CIDR’09) SELECT... FROM … WHERE region(‘circle 181.3 -0.76 6.5’) and specclass = 2 and … and specclass = 2 and … SELECT... FROM … WHERE region(‘circle 181.3 -0.76 6.5’) and specclass = 2 and … and specclass = 2 and … Pre-processing & Decomposition Pre-processing & Decomposition HTM Query Q1Q1Q1Q1 Q1Q1Q1Q1 Q2Q2Q2Q2 Q2Q2Q2Q2 Q3Q3Q3Q3 Q3Q3Q3Q3 LifeRaft Scheduling LifeRaft Scheduling Sub-query regions AstronomersAstronomers Query Results Query Results Reordering & Co-scheduling

8 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Job-Aware Batch Scheduling Sequence of queries related to the same experiment – Predict I/O for long-running experiments – Queries may be order dependent Batch interface for Scientists – Session IDs to explicitly link queries – Pre-declare time/space regions of interest – Pre-package operations – Submit all queries at once Pre-fetching to improve response time – Bounding box over the data accessed – Extrapolate trajectory of job based on time/space

9 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Job-Aware Batch Scheduling Delays evaluation of time-steps that are accessed in the future T5T5T5T5 T4T4T4T4 T1T1T1T1 T2T2T2T2 T3T3T3T3 Time Steps T1T1T1T1 JOBSJOBS J1J1J1J1 J2J2J2J2 J3J3J3J3 J4J4J4J4 J5J5J5J5 T2T2T2T2 T3T3T3T3 T4T4T4T4 T5T5T5T5 J1J1J1J1 J2J2J2J2 J3J3J3J3 J4J4J4J4 J5J5J5J5 Job-Aware LifeRaft Revisit LifeRaft T4T4T4T4 T2T2T2T2 T1T1T1T1 T2T2T2T2 T3T3T3T3 T4T4T4T4 T3T3T3T3 T4T4T4T4 T5T5T5T5 T4T4T4T4 T3T3T3T3 J1J1J1J1 J2J2J2J2 J5J5J5J5 J4J4J4J4 J3J3J3J3

10 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Extending Batch Scheduling Provide starvation resistance – Short interactive queries that focus on small region – Soft constraints on completion order – Hard constraints on response time User Perceived Delay (Turbulence July 22 nd ) 4x overhead

11 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Extending Batch Scheduling Cooperative LifeRaft – Beyond single node LifeRaft – Coordinate scheduling across multiple nodes Communicate to refine local decisions Avoid delaying a query that spans multiple nodes Heterogeneity in workload allocation and performance

12 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Thank You!

13 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Supplementary Slides

14 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Extending Batch Scheduling Query buffering – Large intermediate results – May need to page results to disk

15 Processing Data Intensive Queries in Petabyte Scale Scientific Databases A Case for Batch Processing 70% of queries reuse turbulence simulation results from a dozen timesteps Varied query sizes ranging from <1s to several hours <1 sec >1hr

16 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Scheduling Behavior Q i – Q i1, Q i2, Q i3 B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk Sub-divide queries by bucket: Q j – Q j3, Q j4, Q j5, Q j6, Q j7, Q j8 Assumptions: - Inter-query time of 1 sec - I/O for each bucket of 1 sec - Cache size of 2 - Join cost is negligible Q j – Q j5, Q j6, Q j7, Q j8 QkQk

17 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Arrival order with no sharing Qi1Qi1 B1B1 Q i Arr Qi2Qi2 B2B2 Qi3Qi3 B3B3 Qj1Qj1 B1B1 Q j ArrQ k Arr Qj3Qj3 B3B3 Q i End Qj4Qj4 B4B4 Qj6Qj6 B6B6 Qj7Qj7 B7B7 Qj8Qj8 B8B8 Q j End Qk1Qk1 B1B1 Qk4Qk4 B4B4 Qk8Qk8 B8B8 Q k End Q i – 3 sec Completion Times: Q j – 8 secQ k – 13 secAvg – 8 sec B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk QkQk … Tp –.2 qry/sec

18 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Age based scheduling (bias 1) Qi1Qi1 B1B1 Q i Arr Qi2Qi2 B2B2 Qi5Qi5 B5B5 Qi3Qj3Qi3Qj3 B3B3 Q j ArrQ k ArrQ i End Q j End Q k End Qj1Qk1Qj1Qk1 B1B1 Qj4Qk4Qj4Qk4 B4B4 Qj6Qk6Qj6Qk6 B6B6 Q i – 3 sec Completion Times: Q j – 7 secQ k – 7 secAvg – 5.6 secTp –.33 qry/sec B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk QkQk Qj8Qk8Qj8Qk8 B8B8 Qj7Qk7Qj7Qk7 B7B7

19 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Contention based scheduling (bias 0) Qi1Qi1 B1B1 Q i Arr Qi2Qi2 B2B2 Qi3Qj3Qi3Qj3 B3B3 Q j ArrQ k Arr Q i End Q j End Qk5Qk5 B5B5 Q k End Q j1 Q k1 Q j4 Q k4 B 1 B 4 Qj6Qk6Qj6Qk6 B6B6 Qj7Qk7Qj7Qk7 B7B7 Q i – 7 sec Completion Times: Q j – 5 secQ k – 6 secAvg – 6 secTp –.38 qry/sec B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk QkQk Qj8Qk8Qj8Qk8 B8B8 (5.6) (.33)

20 Processing Data Intensive Queries in Petabyte Scale Scientific Databases Minimize cost of query execution and transitioning – 40% reduction in I/O Reducing I/O: Adaptive Physical Design


Download ppt "Department of Computer Science Johns Hopkins University Xiaodan Wang Advisor: Randal Burns Processing Data-Intensive Queries in Petabyte-Scale Scientific."

Similar presentations


Ads by Google