1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 Jim Gray 310.

1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft.com

2 Bell & Gray 4/15 / 95 Outline Cyberspace Pep Talk: Databases are the dirt of Cyberspace Billions of clients mean millions of servers Parallel Imperative: Hardware trend: Many little devices Consequence: Servers are arrays of commodity components PCs are the bricks of Cyberspace Must automate parallel {design / operation / use} Software parallelism via dataflow & Data Partitioning Parallel database techniques Parallel execution of many little jobs (OLTP) Data Partitioning Pipeline Execution Automation techniques) Summary

3 Bell & Gray 4/15 / 95 Kinds Of Information Processing Point-to-PointBroadcast Immediate Time Shifted conversation money lecture concert mail book newspaper Network DataBase Its ALL going electronic Immediate is being stored for analysis (so ALL database) Analysis & Automatic Processing are being added

4 Bell & Gray 4/15 / 95 Why Put Everything in Cyberspace? Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Network Data Base Locate Process Analyze Summarize

5 Bell & Gray 4/15 / 95 Databases: Information At Your Fingertips Information Network Knowledge Navigator All information will be in an online database (somewhere) You might record everything you read: 10MB/day, 400 GB/lifetime (two tapes) hear: 400MB/day, 16 TB/lifetime (a tape per decade) see: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday) Data storage, organization, and analysis is a challenge. That is what databases are about DBs do a good job on records Now working on text, spatial, image, and sound. This talk is about automatic parallel search (the outer loop) Techniques work for ALL kinds of data

6 Bell & Gray 4/15 / 95 Database Store ALL Data Types The New World: Billions of objects Big objects (1MB) Objects have behavior (methods) The Old World: –Millions of objects –100-byte objects People Name AddressPapersPicture Voice Mike Won David NY Berk Austin People NameAddress Mike Won David NY Berk Austin Paperless office Library of congress online All information online entertainment publishing business Information Network, Knowledge Navigator, Information at your fingertips

7 Bell & Gray 4/15 / 95 Magnetic Storage Cheaper than Paper File Cabinet : cabinet (4 drawer)250$ paper (24,000 sheets)250$ space (2x3 @ 10$/ft 2 )180$ total700$ 3 ¢/sheet Disk :disk (8 GB =) 4,000$ ASCII: 4 m pages 0. 1 ¢/sheet (30x cheaper) Image : 200 k pages 2 ¢/sheet (similar to paper) Store everything on disk

8 Bell & Gray 4/15 / 95 Cyberspace Demographics Computer History: most computers are small NEXT: 1 Billion X for some X ( phone? ) most of the money is in clients and wiring 1990: 50% desktop 1995: 75% desktop 1950 National Computer 1960 Corporate Computer 1970 Site Computer 1980 Departmental Computer 1990 Personal Computer 2000 ? 1B$

9 Bell & Gray 4/15 / 95 Billions of Clients Every device will be intelligent Doors, rooms, cars,... Computing will be ubiquitous

10 Bell & Gray 4/15 / 95 Billions of Clients Need Millions of Servers mobile clients fixed clients server super server Clients Servers Super Servers Large Databases High Traffic shared data All clients are networked to servers may be nomadic or on-demand Fast clients want faster servers Servers provide data, control, coordination communication

11 Bell & Gray 4/15 / 95 If Hardware is Free, Where Will The Money Go? All clients and servers will be based on PC technology economies of scale give lowest price. Traditional budget: 40% vendor, 60% staff If hardware_price = software_price = 0 then what? Money will go to CONTENT (databases) NEW APPLICATIONS AUTOMATION analogy to 1920 telephone operators Systems programmer per MIPS DBA per 10GB

12 Bell & Gray 4/15 / 95 The New Computer Industry Horizontal integration is new structure Each layer picks best from lower layer. Desktop market 1991: 50% 1995: 75% Compaq is biggest computer company Intel & Seagate Silicon & Oxide Systems Baseware Middleware Applications SAP Oracle Microsoft Compaq Integration EDS Operation AT&T Function Example

13 Bell & Gray 4/15 / 95 Constant Dollars vs Constant Work Constant Work : One SuperServer can do all the worlds computations. Constant Dollars : The world spends 10% on information processing Computers are moving from 5% penetration to 50% 300 B$ to 3T$ We have the patent on the byte and algorithm

14 Bell & Gray 4/15 / 95 The Seven Price Tiers 10$: wrist watch computers 100$: pocket/ palm computers 1,000$: portable computers 10,000$: personal computers (desktop) 100,000$: departmental computers (closet) 1,000,000$: site computers (glass house) 10,000,000$: regional computers (glass castle) SuperServer: Costs more than 100,000 $ Mainframe Costs more than 1M$ Must be an array of processors, disks, tapes comm ports

15 Bell & Gray 4/15 / 95 Software Economics: Bills Law Bill Joys law (Sun ): Dont write software for less than 100,000 platforms. @10M$ engineering expense, 1,000$ price Bill Gates law : Dont write software for less than 1,000,000 platforms. @10M$ engineering expense, 100$ price Examples: UNIX vs NT: 3,500$ vs 500$ Oracle vs SQL-Server: 100,000$ vs 6,000$ No Spreadsheet or Presentation pack on Unix/VMS/... Commoditization of base Software & Hardware

16 Bell & Gray 4/15 / 95 What comes next MANY new clients Applications to enable clients & servers super-servers

17 Bell & Gray 4/15 / 95 Outline Cyberspace Pep Talk: Databases are the dirt of Cyberspace Billions of clients mean millions of servers Parallel Imperative: Hardware trend: Many little devices Consequence: Server arrays of commodity parts PCs are the bricks of Cyberspace Must automate parallel {design / operation / use} Software parallelism via dataflow & Data Partitioning Parallel database techniques Parallel execution of many little jobs (OLTP) Data Partitioning Pipeline Execution Automation techniques) Summary

18 Bell & Gray 4/15 / 95 Hardware trends: Few generic parts:CPU RAM Disk & Tape arrays ATM for LAN/WAN ?? for CAN ?? for OS These parts will be inexpensive (commodity components) Systems will be arrays of these parts Software challenge: how to program arrays 1 M$ 100 K$10 K$ Mainframe Mini Micro Nano 9" 5.25" 3.5" 2.5" 1.8" Moores Law Restated Many Little Won over Few Big

19 Bell & Gray 4/15 / 95 Future SuperServer Array of processors, disks, tapes comm lines Challenge: How to program it Must use parallelism Pipeline hide latency Partition bandwidth scaleup 1,000 discs = 10 Terrorbytes 100 Tape Transports = 1,000 tapes = 1 PetaByte 100 Nodes 1 Tips High Speed Network ( 10 Gb/s)

20 Bell & Gray 4/15 / 95 Great Debate : Shared What? Shared Memory (SMP) Shared Disk Shared Nothing (network) Easy to program Difficult to build Difficult to scaleup Hard to program Easy to build Easy to scaleup Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2 Winner will be a synthesis of these ideas Distributed shared memory (DASH, Encore) blurs distinction between Network and Bus (locality still important) But gives Shared memory message cost.

21 Bell & Gray 4/15 / 95 The Hardware is in Place and Then A Miracle Occurs SNAP Scaleable Network And Platforms Commodity Distributed OS built on Commodity Platforms Commodity Network Interconnect ?

22 Bell & Gray 4/15 / 95 Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.3 minute SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. BANDWIDTH

23 Bell & Gray 4/15 / 95 DataFlow Programming Prefetch & Postwrite Hide Latency Can't wait for the data to arrive Need a memory that gets the data in advance ( 100MB/S) Solution: Pipeline from source (tape, disc, ram...) to cpu cache Pipeline results to destination LATENCY

24 Bell & Gray 4/15 / 95 The New Law of Computing Grosch's Law: Parallel Law: Needs Linear Speedup and Linear Scaleup Not always possible 1 MIPS 1 $ 1,000 $ 1,000 MIPS 2x $ is 2x performance 1 MIPS 1 $ 1,000 MIPS 32 $.03$/MIPS 2x $ is 4x performance

25 Bell & Gray 4/15 / 95 Parallelism: Performance is the Goal Goal is to get 'good' performance. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both. Parallelism is faster, not cheaper: trades money for time.

26 Bell & Gray 4/15 / 95 Parallelism: Speedup & Scaleup Speedup: Same Job, More Hardware Less time Scaleup: Bigger Job, More Hardware Same time Transaction Scaleup: more clients/servers Same response time 100GB Server 1 k clients 10 k clients 1 TB 100GB 1 TB

27 Bell & Gray 4/15 / 95 The Perils of Parallelism Startup: Creating processes Opening files Optimization Interference: Device (cpu, disc, bus) logical (lock, hotspot, server, log,...) Skew:If tasks get very small, variance > service time Processors & Discs A Bad Speedup Curve Linearity No Parallelism Benefit

28 Bell & Gray 4/15 / 95 Outline Cyberspace Pep Talk: Databases are the dirt of Cyberspace Billions of clients mean millions of servers Parallel Imperative: Hardware trend: Many little devices Consequence: Server arrays of commodity parts PCs are the bricks of Cyberspace Must automate parallel {design / operation / use} Software parallelism via dataflow & Data Partitioning Parallel database techniques Parallel execution of many little jobs (OLTP) Data Partitioning Pipeline Execution Automation techniques Summary

29 Bell & Gray 4/15 / 95 Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Sequential Any Sequential Program Any Sequential Program

30 Bell & Gray 4/15 / 95 Data Rivers Split + Merge Streams River M Consumers N producers Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Exchange operator in Volcano. N X M Data Streams

31 Bell & Gray 4/15 / 95 Partitioned Data and Execution Spreads computation and IO among processors Partitioned data gives NATURAL execution parallelism

32 Bell & Gray 4/15 / 95 Partitioned + Merge + Pipeline Execution Pure dataflow programming Gives linear speedup & scaleup But, top node may be bottleneck So....

33 Bell & Gray 4/15 / 95 N xM way Parallelism N inputs, M outputs, no bottlenecks.

34 Bell & Gray 4/15 / 95 Why are Relational Operators Successful for Parallelism? Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM

35 Bell & Gray 4/15 / 95 SQL a NonProcedural Programming Language SQL: functional programming language describes answer set. Optimizer picks best execution plan Picks data flow web (pipeline), degree of parallelism (partitioning) other execution parameters (process placement, memory,...) GUI Schema Plan Monitor Optimizer Execution Planning Rivers Executors

36 Bell & Gray 4/15 / 95 Database Systems Hide Parallelism Automate system management via tools data placement data organization (indexing) periodic tasks (dump / recover / reorganize) Automatic fault tolerance duplex & failover transactions Automatic parallelism among transactions (locking) within a transaction (parallel execution)

37 Bell & Gray 4/15 / 95 Success Stories Online Transaction Processing many little jobs SQL systems support 3700 tps-A (24 cpu, 240 disk) SQL systems support 21,000 tpm-C (110 cpu, 800 disk) Batch (decision support and Utility) few big jobs, parallelism inside Scan data at 100 MB/s Linear Scaleup to 50 processors transactions / sec hardware recs/ sec hardware

38 Bell & Gray 4/15 / 95 Kinds of Partitioned Data Split a SQL table to subset of nodes & disks Partition within set: RangeHashRound Robin Good for equijoins, range queries group-by Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning Good for equijoins Good to spread load

39 Bell & Gray 4/15 / 95 Index Partitioning Hash indices partition by hash B-tree indices partition as a forest of trees. One tree per range Primary index clusters data 0...910..1920..2930..3940.. A..C D..F G...M N...R S..

40 Bell & Gray 4/15 / 95 Secondary Index Partitioning In shared nothing, secondary indices are Problematic Partition by base table key ranges Insert: completely local (but what about unique?) Lookup: examines ALL trees (see figure) Unique index involves lookup on insert. Partition by secondary key ranges Insert: two nodes (base and index) Lookup: two nodes (index -> base) Uniqueness is easy Teradata solution : Partition non-unique by base table Partition Unique by secondary key A..C D..F G...M N...R S.. Base Table A..Z Base Table A..Z

41 Bell & Gray 4/15 / 95 Picking Data Ranges Disk Partitioning For range partitioning, sample load on disks. Cool hot disks by making range smaller For hash partitioning, Cool hot disks by mapping some buckets to others River Partitioning Use hashing and assume uniform If range partitioning, sample data and use histogram to level the bulk Teradata, Tandem, Oracle use these tricks

42 Bell & Gray 4/15 / 95 Parallel Data Scan Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Temporal Spatial Image datelocimage Landsat 1/2/72.... 4/8/95 33N 120W. 34N 120W Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it image Answer date, location, & image tests

43 Bell & Gray 4/15 / 95 Simple Aggregates (sort or hash?) Simple aggregates (count, min, max,...) can use indices More compact Sometimes have aggregate info. GROUP BY aggregates scan in category order if possible (use indices) Else If categories fit in RAM use RAM category hash table Else make temp of sort by category, do math in merge step.

44 Bell & Gray 4/15 / 95 Parallel Aggregates For aggregate function, need a decomposition strategy: count(S) = count(s(i)), ditto for sum() avg(S) = ( sum(s(i))) / count(s(i)) and so on... For groups, sub-aggregate groups close to the source drop sub-aggregates into a hash river.

45 Bell & Gray 4/15 / 95 Sort Used for loading and reorganization (sort makes them sequential) building B-trees reports non-equijoins Rarely used for aggregates or equi-joins (if hash available) Should run at 10MB/s or better Faster than a disk, so need striped scratch files In memory sort is about 250Kr/s Sort Runs Input Data Sorted Data Merge

46 Bell & Gray 4/15 / 95 Sub-sorts generate runs Merge runs Range or Hash Partition River River is range or hash partitioned Scan or other source Parallel Sort M input N output Sort design Disk and merge not needed if sort fits in memory Scales linearly because 6 12 = => 2x slower log(10 ) 6 log(10 ) 12 Sort is benchmark from hell for shared nothing machines net traffic = disk bandwidth, no data filtering at the source

47 Bell & Gray 4/15 / 95 Blocking Operators =Short Piplelines An operator is blocking, if it does not produce any output, until it has consumed all its input Examples: Sort, Aggregates, Hash-Join (reads all of one operand) Blocking operators kill pipeline parallelism Make partition parallelism all the more important. Database Load Template has three blocked phases

48 Bell & Gray 4/15 / 95 Nested Loops Join Outer Table Inner Table If inner table indexed on join cols (b-tree or hash) then sequential scan outer (from start key) For each outer record probe inner table for matching recs Works best if inner is in RAM (=> small inner ) Works great if inner is B-tree or hash in RAM Partitions well: replicate inner at each outer partition. (if outer partitioned on join col, dont replicate inner, partition it) Works for all joins ( outer, non-equijoins, cartesian, exclusion,...)

49 Bell & Gray 4/15 / 95 Merge Join (and sort-merge join) Left Table Right Table NxM case cartesian product Partitions well: partition smaller to larger partition. Works for all joins ( outer, non-equijoins, cartesian, exclusion,... ) If tables sorted on join cols (b-tree or hash) then sequential scan each (from start key) left right advance leftmatchadvance right Nice sequential scan of data (disk speed) (MxN case may cause backwards rescan) Sort-merge join sorts before doing the merge

50 Bell & Gray 4/15 / 95 Hash Join Hash smaller table into N buckets (hope N=1) If N=1 read larger table, hash to smaller Else, hash outer to disk then bucket-by-bucket hash join. Purely sequential data behavior Always beats sort-merge and nested unless data is clustered. Good for equi, outer, exclusion join Lots of papers, products just appearing (what went wrong?) Hash reduces skew Right Table Left Table Hash Buckets

51 Bell & Gray 4/15 / 95 Observation: Execution easy Automation hard It is easy to build a fast parallel execution environment (no one has done it, but it is just programming) It is hard to write a robust and world-class query optimizer. There are many tricks One quickly hits the complexity barrier Common approach: Pick best sequential plan Pick degree of parallelism based on bottleneck analysis Bind operators to process Place processes at nodes Place scratch files near processes Use memory as a constraint

52 Bell & Gray 4/15 / 95 Systems That Work This Way Shared Nothing Teradata: 400 nodes Tandem: 110 nodes IBM / SP2 / DB2: 48 nodes ATT & Sybase112 nodes Informix/SP2 48 nodes Shared Disk Oracle170 nodes Rdb 24 nodes Shared Memory Informix 9 nodes RedBrick ? nodes

53 Bell & Gray 4/15 / 95 Research Problems Automatic data placement (partition: random or organized) Automatic parallel programming (process placement) Parallel concepts, algorithms & tools Parallel Query Optimization Execution Techniques load balance, checkpoint/restart, pacing, 1,000 discs = 10 Terrorbytes 100 Tape Transports = 1,000 tapes = 1 PetaByte 100 Nodes 1 Tips High Speed Network ( 10 Gb/s)

54 Bell & Gray 4/15 / 95 Summary Cyberspace is Growing Databases are the dirt of cybersspace PCs are the bricks, Networks are the morter. Many little devices: Performance via Arrays of {cpu, disk,tape} Then a miracle occurs: a scaleable distributed OS and net SNAP: Scaleable Networks and Platforms Then parallel database systems give software parallelism OLTP: lots of little jobs run in parallel Batch TP: data flow & data partitioning Automate processor & storage array administration Automate processor & storage array programming 2000 platforms as easy as 1 platform.

1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 Jim Gray 310.

Similar presentations

Presentation on theme: "1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 Jim Gray 310."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 Jim Gray 310.

Similar presentations

Presentation on theme: "1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 Jim Gray 310."— Presentation transcript:

Similar presentations

About project

Feedback