IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated Nodes Vijayshankar Raman, Wei Han, Inderpal Narang IBM Almaden Research Center.

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Properties of a relational database  Ease of schema evolution  Declarative Querying  Transparent scalability does not quite work

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Today: Partitioning is basis for parallelism  static partitioning (on the base tables)  Dynamic partitioning via exchange operators  Claim: partitioning does not handle non-dedicated nodes well L1L1 O1O1 SaSa L3L3 O3O3 ScSc L2L2 O2O2 SbSb

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Problems of partitioning  Hard to scale incrementally –Data must be re-partitioned –Disk and CPU must be scaled together DBA must ensure partition-cpu affinity  Homogeneity Assumptions –Same plan runs on each node –Identical software needed on all nodes  Susceptible to load variations, node failures / stalls, … –Response time is dictated by speed of slowest processor –Bad for transient compute resources E.g. we want ability to interrupt query work by higher-priority local work exchange initial partitioning

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 GOAL: A more graceful scale-out solution Sacrifice partitioning for scalability –Avoid initial partitioning –No exchange New means for work allocation in absence of partitioning –Handles heterogeneity and load variations better  Two Design Features –Data In The Network (DITN) Shared files on high speed networks (e.g SAN) –Intra-Fragment Parallelism Send SQL fragments to heterogeneous join processors: each performs the same join, over a different subset of cross-product space Easy fault-tolerance Can use heterogeneous nodes -- whatever is available at that time

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 DITN Architecture 1.Find idle coprocessors P 1, P 2, P 3, P 4, P 5, P 6 2.Prepare O, L, C 3.Logically divide OxLxC into workunits W i 4.In Parallel, Run SQL queries for W i at P i 5.Property: SPJAG(OxLxC) = AG (  i SPJAG(W i )) Restrictions (will return to this at the end)  P i cannot use indexes at info. Integrator  Isolation issues

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Why Data in the Network  Observation: Network bandwidth >> Query Operator Bandwidth –N/W bandwidth: in Gbps (SAN/LAN), Scan: 10-100 Mbps, Sort: about 10 Mbps –Interconnect transfers data faster than query operators can process it  But, exploiting this fast interconnect via SQL is tricky –E.g. ODBC Scan: 10x slower than local scan  Instead, keep temp files in a shared storage system (e.g. SAN-FS) –Allows exploitation of full n/w bandwidth  immediate benefits –Fast data transfer –DBMS doesn’t have to worry about disks, i/o || ism, || scans, etc. –Independent scaling of CPU and I/O

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Work Allocation without Partitioning  For each join: we now have to join the off-diagonal rectangles also  Minimize Response time= max(RT of each work-unit) = max i,j JoinCost(|L i |, |O j |)  How to optimize the Work allocation? –~ cut join hyper-rectangle into n pieces to minimize max perimeter –Simplification: assume that the join is cut into a grid Choices: number of cuts on each table, size of each cut, allocation of work-units to processors

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Allocation to homogenous processors  Theorem: For monotonic JoinCost, RT is minimized when each cut (on a table) is of same size  So allocation done into rectangles of size |T 1 |/p 1, |T 2 |/p 2, … |T n |/p n  Theorem: For symmetric JoinCost, RT is minimized when |T 1 |/p 1 = |T 2 |/p 2 = … |T n |/p n  E.g., with 10 processors, cut Lineitem into 5 parts and Orders into 2  Note: cutting each table into same number of partitions (as is done usually) is sub-optimal

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Allocation to heterogeneous co-processors  Response time of query RT = max(RT of each work-unit) Choose size of each work-unit, and allocation of work-units to co-processor, so as to minimize RT  Like a bin packing problem –Solve for number of cuts on each table, assuming homogeneity –Then solve a Linear Program to find the optimal size of each cut –Have to make some approximations in order to avoid Integer Program (see paper)

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Failure/Stall Resiliency by Work-Unit Reassignment Without tuple shipping between plans, failure handling is easy  If co-processor’s A,B,C finished by time X, and co-processor D has not finished by time X(1+f) –Take D’s work unit and assign to fastest among A,B,C – say A –When either of D or A returns, close the cursor on the other  Can generalize to a work-stealing scheme –E.g. with 10 coprocessors, assign each to 1/20 th of the cross-product space –When a coprocessor returns with a result, assign it more work  Tradeoff: Finer work allocation => more flexible work-stealing BUT, more redundant work

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Analysis: What do we lose by not partitioning  Say join of L x O x C (TPC-H) with 12 processors: 12 = p 1 p 2 p 3  RT without partitioning ~ JoinCost(|L|/p 1, |O|/p 2, |C|/p 3 )  RT with partitioning ~ JoinCost(|L|/p 1 p 2 p 3, |O|/p 1 p 2 p 3, |C|/p 1 p 2 p 3 )  At p 1 =6, p 2 =2, p 3 =1, loss in CPU speedup is JoinCost(|L|/6, |O|/2, |C| ) ~ 2 JoinCost(|L|/12, |O|/12, |C|/12)  Note: I/O speedup is unaffected  Can close the gap with partitioning further  Sort the largest tables of the join: e.g. |L|, |O| on their join column –Now, loss is: JoinCost(|L|/12,|O|/12,|C|) / JoinCost(|L|/12, |O|/12,|C|/12)  Still avoids exchange => can use heterogeneous, non-dedicated nodes, but causes problems with isolation Optimization: selective clustering

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Lightweight Join Processor  Work Allocation via Query Fragments => co-processors can be heterogeneous  Need not have a full DBMS; join processor is enough  E.g. screen saver for join processing  We use a trimmed down version of Apache Derby –Parse CSV files –Predicates, projections, sort-merge joins, aggregates, group by

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Performance degradation due to not partitioning O  L SOLSOL SOLCNR  At 10 nodes on SxOxLxCxNxR, DITN is about 2.1x slower than PBP (Work alloc: L/5, O/2, S, C, N, R)  DITN2PART has very little slowdown –But needs total clustering  Slow-down oscillates due to discreteness of work-allocation

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Failure/Stall Resiliency by Work-Unit Reassignment  Orders x Lineitem group by o_orderpriority 5 co-processors  Impose high load on one co-processor as soon as query begins  At 60% load (50% wait), DITN times out and switches to alternative DITN2PART PBP

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Importance of Asymmetric Allocation  Initially 2 fast nodes: then add 4 slow nodes  With symmetric allocation: adding slow nodes can slow down system Contrast between DITN-symmetric and DITN-asymmetric

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Danger of Tying partition to CPU  Repeated execution of O  L  Impose 75% CPU load on one of the 5 co-processors during 3 rd iteration  PBP continues to use this slow node throughout  DITN switches to another node after two iterations

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Related Work  Parallel query processing – Gamma, XPRS, many commercial systems –Mostly shared-nothing –Shared-disk: IBM Sysplex Queries done via tuple shipping between co-processors –Oracle Shared disk, but hash joins done via partitioning (static/dynamic)  Mariposa – similar query fragment level work allocation  Load Balancing Exchange, Flux, River, Skew-avoidance in hash joins  Fault-tolerant exchange (FLUX)  Polar*, OGSA-DQP  Distributed Eddies  Query Execution on P2P systems

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated ComputersAug 30 2005 Summary and Future work  Partitioning-based parallelism does not handle non-dedicated nodes  Proposal: Avoid partitioning –Share data via storage system –Intra-fragment parallelism instead of exchange –Careful work-allocation to optimize response time  Promising initial results: only 2x slowdown with 10 nodes  Index scans: want shared reads without latching  Isolation: DITN: uncommitted read; DITN2PART: read-only  Scaling to large numbers of nodes  Multi-query optimization to reuse shared temp tables Open Questions

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated Nodes Vijayshankar Raman, Wei Han, Inderpal Narang IBM Almaden Research Center.

Similar presentations

Presentation on theme: "IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated Nodes Vijayshankar Raman, Wei Han, Inderpal Narang IBM Almaden Research Center."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated Nodes Vijayshankar Raman, Wei Han, Inderpal Narang IBM Almaden Research Center.

Similar presentations

Presentation on theme: "IBM Research © 2005 IBM Corporation Parallel Querying with Non-Dedicated Nodes Vijayshankar Raman, Wei Han, Inderpal Narang IBM Almaden Research Center."— Presentation transcript:

Similar presentations

About project

Feedback