Presentation on theme: "1 What Happens When Processing Storage Bandwidth are Free and Infinite? Jim Gray Microsoft Research."— Presentation transcript:
1 What Happens When Processing Storage Bandwidth are Free and Infinite? Jim Gray Microsoft Research
2Outline l Hardware CyberBricks –all nodes are very intelligent l Software CyberBricks –standard way to interconnect intelligent nodes l What next? –Processing migrates to where the power is Disk, network, display controllers have full-blown OS Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them Computer is a federated distributed system.
3 A Hypothetical Question Taking things to the limit l Moores law 100x per decade: –Exa-instructions per second in 30 years –Exa-bit memory chips –Exa-byte disks l Gilders Law of the Telecosom 3x /year more bandwidth 60,000x per decade! –40 Gbps per fiber today
4 Groves Law l Link Bandwidth doubles every 100 years! l Not much has happened to telephones lately l Still twisted pair
5 Gilders Telecosom Law: 3x bandwidth/year for 25 more years l Today: –10 Gbps per channel –4 channels per fiber: 40 Gbps –32 fibers/bundle = 1.2 Tbps/bundle l In lab 3 Tbps/fiber (400 x WDM) l In theory 25 Tbps per fiber l 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps
6 Thesis Many little beat few big Smoking, hairy golf ball Smoking, hairy golf ball How to connect the many little parts? How to connect the many little parts? How to program the many little parts? How to program the many little parts? Fault tolerance? Fault tolerance? $1 million $100 K $10 K Mainframe Mini Micro Nano 14" 9" 5.25" 3.5" 2.5" 1.8" 1 M SPEC marks, 1TFLOP 10 6 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 10 microsecond ram 10 millisecond disc 10 second tape archive 10 nano-second ram Pico Processor 10 pico-second ram 1 MM TB 1 TB 10 GB 1 MB 100 MB
7 Year B Machine l The Year 2000 commodity PC l B illion Instructions/Sec l.1 B illion Bytes RAM l B illion Bits/s Net l 10 B Bytes Disk l B illion Pixel display –3000 x 3000 x 24 l 1,000 $ 10 GB byte Disk.1 B byte RAM 1 Bips Processor 1 B bits/sec LAN/WAN
8 4 B PCs: The Bricks of Cyberspace l Cost 1,000 $ l Come with –OS (NT, POSIX,..) –DBMS –High speed Net –System management –GUI / OOUI –Tools l Compatible with everyone else l CyberBricks
9 Super Server: 4T Machine Array of 1,000 4B machines Array of 1,000 4B machines 1 b ips processors 1 b ips processors 1 B B DRAM 1 B B DRAM 10 B B disks 10 B B disks 1 Bbps comm lines 1 Bbps comm lines 1 TB tape robot 1 TB tape robot A few megabucks A few megabucks Challenge: Challenge: Manageability Manageability Programmability Programmability Security Security Availability Availability Scaleability Scaleability Affordability Affordability As easy as a single system As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work CPU 50 GB Disc 5 GB RAM Cyber Brick a 4B machine
10 Functionally Specialized Cards l Storage l Network l Display M MB DRAM P mips processor ASIC Today: P=50 mips M= 2 MB In a few years P= 200 mips M= 64 MB
11 Its Already True of Printers Peripheral = CyberBrick l You buy a printer l You get a –several network interfaces –A Postscript engine cpu, memory, software, a spooler (soon) –and… a print engine.
12 System On A Chip l Integrate Processing with memory on one chip –chip is 75% memory now –1MB cache >> 1960 supercomputers –256 Mb memory chip is 32 MB! –IRAM, CRAM, PIM,… projects abound l Integrate Networking with processing on one chip –system bus is a kind of network –ATM, FiberChannel, Ethernet,.. Logic on chip. –Direct IO (no intermediate bus) l Functionally specialized cards shrink to a chip.
13 Tera Byte Backplane l TODAY –Disk controller is 10 mips risc engine with 2MB DRAM –NIC is similar power l SOON –Will become 100 mips systems with 100 MB DRAM. l They are nodes in a federation (can run Oracle on NT in disk controller). l Advantages –Uniform programming model –Great tools –Security –economics (cyberbricks) –Move computation to data (minimize traffic) All Device Controllers will be Cray 1s Central Processor & Memory
14 With Tera Byte Interconnect and Super Computer Adapters l Processing is incidental to –Networking –Storage –UI l Disk Controller/NIC is –faster than device –close to device –Can borrow device package & power l So use idle capacity for computation. l Run app in device. Tera Byte Backplane
15Implications l Offload device handling to NIC/HBA l higher level protocols: I2O, NASD, VIA… l SMP and Cluster parallelism is important. Tera Byte Backplane l Move app to NIC/device controller l higher-higher level protocols: CORBA / DCOM. l Cluster parallelism is VERY important. Central Processor & Memory ConventionalRadical
16 How Do They Talk to Each Other? l Each node has an OS l Each node has local resources: A federation. l Each node does not completely trust the others. l Nodes use RPC to talk to each other l CORBA? DCOM? IIOP? RMI? l One or all of the above. l Huge leverage in high-level interfaces. l Same old distributed system story. Wire(s) VIAL/VIPL streams datagrams RPC? Applications VIAL/VIPL streams datagrams RPC? Applications
17Outline l Hardware CyberBricks –all nodes are very intelligent l Software CyberBricks –standard way to interconnect intelligent nodes l What next? –Processing migrates to where the power is Disk, network, display controllers have full-blown OS Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them Computer is a federated distributed system.
18Objects! l Its a zoo l ORBs, COM, CORBA,.. l Object Relationa1 Databases l Objects and 3-tier computing
19 Solaris UNIX International OSF DCE Open software Foundation (OSF) NT ODBC XA / TX Object Management Group (OMG) CORBA Open Group History and Alphabet Soup X/Open DCE RPC GUIDs IDL DNS Kerberos COM Microsoft DCOM based on OSF-DCE Technology DCOM and ActiveX extend it COM
20 The Promise l Objects are Software CyberBricks –productivity breakthrough (plug ins) –manageability breakthrough (modules) l Microsoft Promises Cairo distributed objects, secure, transparent, fast invocation l IBM/Sun/Oracle/Netscape promise CORBA + Open Doc + Java Beans + l All will deliver l Customers can pick the best one Both camps Share key goals: l Encapsulation: hide implementation l Polymorphism: generic ops key to GUI and reuse l Uniform Naming l Discovery: finding a service l Fault handling: transactions l Versioning: allow upgrades l Transparency: local/remote l Security: who has authority l Shrink-wrap: minimal inheritance l Automation: easy
21 The OLE-COM Experience l Macintosh had Publish & Subscribe l PowerPoint needed graphs: –plugged MS Graph in as an component. l Office adopted OLE –one graph program for all of office l Internet arrived –URLs are object references, –Office is Web Enabled right away! l Office97 smaller than Office95 because of shared components l It works!!
22 Linking And Embedding Objects are data modules; transactions are execution modules l Link: pointer to object somewhere else –Think URL in Internet l Embed: bytes are here l Objects may be active ; can callback to subscribers
23 Objects Meet Databases basis for universal data servers, access, & integration DBMSengine Object-oriented (COM oriented) interface to data Breaks DBMS into components Anything can be a data source Optimization/navigation on top of other data sources Makes an RDBMS an O-R DBMS assuming optimizer understands objects Database Spreadsheet Photos Mail Map Document
24 The BIG Picture Components and transactions l Software modules are objects l Object Request Broker (a.k.a., Transaction Processing Monitor) connects objects (clients to servers) l Standard interfaces allow software plug-ins l Transaction ties execution of a job into an atomic unit: all-or-nothing, durable, isolated l ActiveX Components are a 250M$/year business. Object Request Broker
25 Transaction Object Request Broker (ORB) Orchestrates RPC l Registers Servers l Manages pools of servers l Connects clients to servers l Does Naming, request-level authorization, l Provides transaction coordination l Direct and queued invocation l Old names: –Transaction Processing Monitor, –Web server, –NetWare Object-Request Broker
26 The OO Points So Far l Objects are software Cyber Bricks l Object interconnect standards are emerging l Cyber Bricks become Federated Systems. l Next points: –put processing close to data –do parallel processing.
27 Three Tier Computing l Clients do presentation, gather input l Clients do some workflow (Xscript) l Clients send high-level requests to ORB l ORB dispatches work-flows and business objects -- proxies for client, orchestrate flows & queues l Server-side workflow scripts call on distributed business objects to execute task Database Business Objects workflow Presentation
28 The Three Tiers Web Client HTML VB or Java Script Engine VB or Java Virt Machine VBscritpt JavaScrpt VB Java plug-ins Internet ORB HTTP+ DCOM Object server Pool Middleware ORB TP Monitor Web Server... DCOM (oleDB, ODBC,...) Object & Data server. LU6.2 IBM Legacy Gateways
29 Transaction Processing Evolution to Three Tier Intelligence migrated to clients l Mainframe Batch processing (centralized) l Dumb terminals & Remote Job Entry l Intelligent terminals database backends l Workflow Systems Object Request Brokers Application Generators Mainframe cards Active green screen 3270 Server TP Monitor ORB
30 Web Evolution to Three Tier Intelligence migrated to clients (like TP) l Character-mode clients, smart servers l GUI Browsers - Web file servers l GUI Plugins - Web dispatchers - CGI l Smart clients - Web dispatcher (ORB) pools of app servers (ISAPI, Viper) workflow scripts at client & server archie ghopher green screen Web Server Mosaic WAIS NS & IE Active
31 PC Evolution to Three Tier Intelligence migrated to server l Stand-alone PC (centralized) l PC + File & print server message per I/O l PC + Database server message per SQL statement l PC + App server message per transaction l ActiveX Client, ORB ActiveX server, Xscript disk I/O IO request reply SQL Statement Transaction
32 Why Did Everyone Go To Three- Tier? l Manageability –Business rules must be with data –Middleware operations tools l Performance (scaleability) –Server resources are precious –ORB dispatches requests to server pools l Technology & Physics –Put UI processing near user –Put shared data processing near shared data –Minimizes data moves –Encapsulate / modularity Database Business Objects workflow Presentation
33 DADsRaw Data Customer comes to store Takes what he wants Fills out invoice Leaves money for goods Easy to build No clerks Why Put Business Objects at Server? Customer comes to store with list Gives list to clerk Clerk gets goods, makes invoice Customer pays clerk, gets goods Easy to manage Clerks controls access Encapsulation MOMs Business Objects
34 The OO Points So Far l Objects are software Cyber Bricks l Object interconnect standards are emerging l Cyber Bricks become Federated Systems. l Put processing close to data l Next point: –do parallel processing.
35 Parallelism: the OTHER half of Super-Servers l Clusters of machines allow two kinds of parallelism –Many little jobs: Online transaction processing TPC A, B, C,… –A few big jobs: data search & analysis TPC D, DSS, OLAP l Both give automatic Parallelism
36 Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 100 second SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. BANDWIDTH
37 Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Sequential Any Sequential Program Any Sequential Program
38 Why are Relational Operators Successful for Parallelism? Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM
39 Database Systems Hide Parallelism l Automate system management via tools –data placement –data organization (indexing) –periodic tasks (dump / recover / reorganize) l Automatic fault tolerance –duplex & failover –transactions l Automatic parallelism –among transactions (locking) –within a transaction (parallel execution)
40 SQL a Non-Procedural Programming Language l SQL: functional programming language describes answer set. l Optimizer picks best execution plan –Picks data flow web (pipeline), –degree of parallelism (partitioning) –other execution parameters (process placement, memory,...) GUI Schema Plan Monitor Optimizer Execution Planning Rivers Executors
41 Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: RangeHashRound Robin Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning Good for equijoins, range queries group-by Good for equijoins Good to spread load
42 N x M way Parallelism N inputs, M outputs, no bottlenecks.
43 Parallel Objects? l How does all this DB parallelism connect to hardware/software Cyber Bricks? l To scale to large client sets –need lots of independent parallel execution. –Comes for from from ORB. l To scale to large data sets –need intra-program parallelism (like parallel DBs) –Requires some invention.
44Outline l Hardware CyberBricks –all nodes are very intelligent l Software CyberBricks –standard way to interconnect intelligent nodes l What next? –Processing migrates to where the power is Disk, network, display controllers have full-blown OS Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them Computer is a federated distributed system. Parallel execution is important
45 MORE SLIDES but there is only so much time. Too bad
46 The Disk Farm On a Card The 100GB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs....etc LOTS of accesses/second bandwidth 14" Life is cheap, its the accessories that cost ya. Processors are cheap, its the peripherals that cost ya (a 10k$ disc card).
47 Parallelism: Performance is the Goal Goal is to get 'good' performance. Trade time for money. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both. Parallel DBMSs obey these laws
48 Success Stories l Online Transaction Processing –many little jobs –SQL systems support 50 k tpm-C (44 cpu, 600 disk 2 node ) l Batch (decision support and Utility) –few big jobs, parallelism inside –Scan data at 100 MB/s –Linear Scaleup to 1,000 processors transactions / sec hardware recs/ sec hardware
49 The New Law of Computing Grosch's Law: Parallel Law: Needs Linear Speedup and Linear Scaleup Not always possible 1 MIPS 1 $ 1,000 $ 1,000 MIPS 2x $ is 2x performance 1 MIPS 1 $ 1,000 MIPS 32 $.03$/MIPS 2x $ is 4x performance
50 Clusters being built l Teradata 1,000 nodes (30k$/slice) l Tandem,VMScluster 150 nodes (100k$/slice) l Intel, 9,000 55M$ ( 6k$/slice) l Teradata, Tandem, DEC moving to NT+low slice price l IBM: 512 nodes 100m$ (200k$/slice) l PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB servers l KEY TECHNOLOGY HERE IS THE APPS. –Apps distribute data –Apps distribute execution
51 Great Debate : Shared What? SMP or Cluster? Shared Memory (SMP) Shared Disk Shared Nothing (network) Easy to program Difficult to build Difficult to scaleup Hard to program Easy to build Easy to scaleup Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2 Winner will be a synthesis of these ideas Distributed shared memory (DASH, Encore) blurs distinction between Network and Bus (locality still important) But gives Shared memory message cost.
52 BOTH SMP and Cluster? Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs
53 Clusters Have Advantages l Clients and Servers made from the same stuff. l Inexpensive: –Built with commodity components l Fault tolerance: –Spare modules mask failures l Modular growth –grow by adding small modules
54 Meta-Message: Technology Ratios Are Important l If everything gets faster & cheaper at the same rate THEN nothing really changes. l Things getting MUCH BETTER: –communication speed & cost 1,000x –processor speed & cost 100x –storage size & cost 100x l Things staying about the same –speed of light (more or less constant) –people (10x more expensive) –storage speed (only 10x better)
55 Storage Ratios Changed l 10x better access time l 10x more bandwidth l 4,000x lower media price l DRAM/DISK 100:1 to 10:10 to 50:1
56 Todays Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Typical System (bytes) Size vs Speed Access Time (seconds) Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape $/MB Price vs Speed Access Time (seconds) Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape
57 Network Speeds l Speed of light did not change l Link bandwidth grew 60% / year l WAN speeds limited by politics –if voice is X$/minute, how much is video? l Gbps to desktop today! l 10 Gbps channel is coming. l 3Tbps fibers in laboratory thru parallelism (WDM). l Paradox: –WAN link has 40Gbps –Processor bus is Gbps 1e 9 1e 8 1e 7 1e 6 1e 5 1e 4 1e Processors (i/s) Year Comm Speedups LANs & WANs (b/s)
58 MicroProcessor Speeds Went Up l Clock rates went from 10Khz to 400Mhz l Processors now 6x issue l SPECInt fits in Cache, – it tracks cpu speed l Peak Advertised Performance (PAP) is 1.2 BIPS l Real Application Performance (RAP) is 100 MIPS l Similar curves for –DEC VAX & Alpha –HP/PA –IBM R6000/ PowerPC –MIPS & SGI –SUN
59 Performance = Storage Accesses not Instructions Executed l In the old days we counted instructions and IOs l Now we count memory references l Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components Sort Disc Wait Sort Disc Wait OS Memory Wait D-Cache Miss I-Cache Miss B-Cache Data Miss 70 MIPS real apps have worse Icache misses so run at 60 MIPS if well tuned, 20 MIPS if not
60 Storage Latency: How Far Away is the Data?
61 Tape Farms for Tertiary Storage Not Mainframe Silos Scan in 27 hours. many independent tape robots (like a disc farm) 10K$ robot 14 tapes 500 GB 5 MB/s 20$/GB 30 Maps 100 robots 50TB 50$/GB 3K Maps 27 hr Scan 1M$
,000 10, ,000 1,, 1000 xDisc Farm STC Tape Robot 6,000 tapes, 8 readers 100x DLTTape Farm GB/K$ Maps SCANS/Day Kaps The Metrics: Disk and Tape Farms Win Data Motel: Data checks in, but it never checks out
63 Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (2x cheaper than disc) Tape is cheap:30 $/tape 20 GB/tape => 1.5 $/GB (100x cheaper than disc).
64 Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$... 3 m$ ) tapes (at 20GB each) => 20$/GB $/GB (1x…10x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB ( more expensive than mag disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!
65 The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server –shorter queues –parallel transfer –lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for tape arrays
66 Billions Of Clients l Every device will be intelligent l Doors, rooms, cars… l Computing will be ubiquitous
67 Billions Of Clients Need Millions Of Servers Mobile clients Fixed clients Server Superserver Clients Servers All clients networked to servers All clients networked to servers May be nomadic or on-demand May be nomadic or on-demand Fast clients want faster servers Fast clients want faster servers Servers provide Servers provide Shared Data Shared Data Control Control Coordination Coordination Communication Communication
: 256 tps Benchmark l 14 M$ computer (Tandem) l A dozen people l False floor, 2 rooms of machines Simulate 25,600 clients A 32 node processor array A 40 GB disk array (80 drives) OS expert Network expert DB expert Performance expert Hardware experts Admin expert Auditor Manager
: DB2 + CICS Mainframe 65 tps l IBM 4391 l Simulated network of 800 clients l 2m$ computer l Staff of 6 to do benchmark 2 x 3725 network controllers 16 GB disk farm 4 x 8 x.5GB Refrigerator-sized CPU
: 10 years later 1 Person and 1 box = 1250 tps l 1 Breadbox ~ 5x 1987 machine room l 23 GB is hand-held l One person does all the work l Cost/tps is 1,000x less 25 micro dollars per transaction 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays
71 What Happened? l Moores law: Things get 4x better every 3 years (applies to computers, storage, and networks) l New Economics: Commodity classprice/mips software $/mips k$/year mainframe 10, minicomputer microcomputer 10 1 l GUI: Human - computer tradeoff optimize for people, not computers mainframe mini micro time price
72 What Happens Next l Last 10 years: 1000x improvement l Next 10 years: ???? Today: text and image servers are free 25 $/hit => advertising pays for them l Future: video, audio, … servers are free You aint seen nothing yet! performance
73 Smart Cards Bull CP8 two chip card first public demonstration 1979 Then (1979) EMV card with dynamic authentication (EMV=Europay, MasterCard, Visa standard) door key, vending machines, photocopiers Now (1997) Courtesy of Dennis Roberson NCR.
74 Smart Card Smart Card Memory Capacity Applications Cards will be able to store data (e.g. medical) books, movies,… money Source: PIN/Card -Tech/ Courtesy of Dennis Roberson NCR Memory Size (Bits) 300 M 1 M 3 K 10 K You are here KB today but growing super-exponentially