Presentation on theme: "What Happens When Processing Storage Bandwidth are Free and Infinite?"— Presentation transcript:
1 What Happens When Processing Storage Bandwidth are Free and Infinite? Jim GrayMicrosoft Research
2 Outline Hardware CyberBricks Software CyberBricks What next? all nodes are very intelligentSoftware CyberBricksstandard way to interconnect intelligent nodesWhat next?Processing migrates to where the power isDisk, network, display controllers have full-blown OSSend RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to themComputer is a federated distributed system.
3 A Hypothetical Question Taking things to the limit Moore’s law 100x per decade:Exa-instructions per second in 30 yearsExa-bit memory chipsExa-byte disksGilder’s Law of the Telecosom 3x/year more bandwidth ,000x per decade!40 Gbps per fiber today
4 Grove’s Law Link Bandwidth doubles every 100 years! Not much has happened to telephones latelyStill twisted pair
5 Gilder’s Telecosom Law: 3x bandwidth/year for 25 more years Today:10 Gbps per channel4 channels per fiber: 40 Gbps32 fibers/bundle = 1.2 Tbps/bundleIn lab 3 Tbps/fiber (400 x WDM)In theory 25 Tbps per fiber1 Tbps = USA 1996 WAN bisection bandwidth1 fiber = 25 Tbps
6 Thesis Many little beat few big $1 million1 MM3$100 K$10 KPico ProcessorMicroNano1 MB10 pico-second ramMainframeMini10 microsecond ram10 millisecond disc10 second tape archive10 nano-second ram10MB10 GB1TB100 TB1.8"3.5"2.5"5.25"1 M SPEC marks, 1TFLOP106 clocks to bulk ramEvent-horizon on chipVM reincarnatedMulti-program cache,On-Chip SMP9"14"Smoking, hairy golf ballHow to connect the many little parts?How to program the many little parts?Fault tolerance?Gray OGI 12/11/97
7 Billion Instructions/Sec .1 Billion Bytes RAM Billion Bits/s Net Year B Machine10 GB byte Disk.1 B byte RAM1 Bips Processor1 B bits/sec LAN/WANThe Year 2000 commodity PCBillion Instructions/Sec.1 Billion Bytes RAMBillion Bits/s Net10 B Bytes DiskBillion Pixel display3000 x 3000 x 241,000 $Gray OGI 12/11/97
8 4 B PC’s: The Bricks of Cyberspace Cost 1,000 $Come withOS (NT, POSIX,..)DBMSHigh speed NetSystem managementGUI / OOUIToolsCompatible with everyone elseCyberBricksGray OGI 12/11/97
9 Super Server: 4T Machine Array of 1,000 4B machines1 b ips processors1 B B DRAM10 B B disks1 Bbps comm lines1 TB tape robotA few megabucksChallenge:ManageabilityProgrammabilitySecurityAvailabilityScaleabilityAffordabilityAs easy as a single systemCPU50 GB Disc5 GB RAMCyber Bricka 4B machineFuture servers are CLUSTERSof processors, discsDistributed database techniquesmake clusters workGray OGI 12/11/97
10 Functionally Specialized Cards P mips processorStorageNetworkDisplayToday:P=50 mipsM= 2 MBASICM MB DRAMIn a few yearsP= 200 mipsM= 64 MBASICASIC
11 It’s Already True of Printers Peripheral = CyberBrick You buy a printerYou get aseveral network interfacesA Postscript enginecpu,memory,software,a spooler (soon)and… a print engine.
12 System On A Chip Integrate Processing with memory on one chip chip is 75% memory now1MB cache >> 1960 supercomputers256 Mb memory chip is 32 MB!IRAM, CRAM, PIM,… projects aboundIntegrate Networking with processing on one chipsystem bus is a kind of networkATM, FiberChannel, Ethernet,.. Logic on chip.Direct IO (no intermediate bus)Functionally specialized cards shrink to a chip.
13 All Device Controllers will be Cray 1’s TODAYDisk controller is 10 mips risc engine with 2MB DRAMNIC is similar powerSOONWill become 100 mips systems with 100 MB DRAM.They are nodes in a federation (can run Oracle on NT in disk controller).AdvantagesUniform programming modelGreat toolsSecurityeconomics (cyberbricks)Move computation to data (minimize traffic)CentralProcessor & MemoryTera ByteBackplane
14 With Tera Byte Interconnect and Super Computer Adapters Processing is incidental toNetworkingStorageUIDisk Controller/NIC isfaster than deviceclose to deviceCan borrow device package & powerSo use idle capacity for computation.Run app in device.Tera ByteBackplane
15 Implications Conventional Radical Move app to NIC/device controller higher-higher level protocols: CORBA / DCOM.Cluster parallelism is VERY important.Offload device handling to NIC/HBAhigher level protocols: I2O, NASD, VIA…SMP and Cluster parallelism is important.CentralProcessor & MemoryTera ByteBackplane
16 How Do They Talk to Each Other? Each node has an OSEach node has local resources: A federation.Each node does not completely trust the others.Nodes use RPC to talk to each otherCORBA? DCOM? IIOP? RMI?One or all of the above.Huge leverage in high-level interfaces.Same old distributed system story.ApplicationsApplicationsdatagramsstreamsRPC??RPCstreamsdatagramsVIAL/VIPLVIAL/VIPLWire(s)
17 Outline Hardware CyberBricks Software CyberBricks What next? all nodes are very intelligentSoftware CyberBricksstandard way to interconnect intelligent nodesWhat next?Processing migrates to where the power isDisk, network, display controllers have full-blown OSSend RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to themComputer is a federated distributed system.
18 Objects! It’s a zoo ORBs, COM, CORBA,.. Object Relationa1 Databases Objects and 3-tier computing
19 History and Alphabet Soup DCERPCGUIDsIDLKerberosDNSCOMMicrosoft DCOM based on OSF-DCE TechnologyDCOM and ActiveX extend it1985SolarisInternationalUNIXOSFDCEFoundation (OSF)Open softwareNTX/Open1990ManagementGroup (OMG)ObjectCORBAODBCXA / TX1995OpenGroupCOMGray OGI 12/11/97
20 The Promise Both camps Objects are Software CyberBricks productivity breakthrough (plug ins)manageability breakthrough (modules)Microsoft Promises Cairo distributed objects, secure, transparent, fast invocationIBM/Sun/Oracle/Netscape promise CORBA + Open Doc + Java Beans +All will deliverCustomers can pick the best oneBoth campsShare key goals:Encapsulation: hide implementationPolymorphism: generic ops key to GUI and reuseUniform NamingDiscovery: finding a serviceFault handling: transactionsVersioning: allow upgradesTransparency: local/remoteSecurity: who has authorityShrink-wrap: minimal inheritanceAutomation: easy
21 The OLE-COM Experience Macintosh had Publish & SubscribePowerPoint needed graphs:plugged MS Graph in as an component.Office adopted OLEone graph program for all of officeInternet arrivedURLs are object references,Office is Web Enabled right away!Office97 smaller than Office95 because of shared componentsIt works!!
22 Linking And Embedding Objects are data modules; transactions are execution modules Link: pointer to object somewhere elseThink URL in InternetEmbed: bytes are hereObjects may be active; can callback to subscribers
23 Objects Meet Databases basis for universal data servers, access, & integration Object-oriented (COM oriented) interface to dataBreaks DBMS into componentsAnything can be a data sourceOptimization/navigation “on top of” other data sourcesMakes an RDBMS an O-R DBMS assuming optimizer understands objectsDatabaseSpreadsheetPhotosMailMapDocumentDBMSengine
24 The BIG Picture Components and transactions Software modules are objectsObject Request Broker (a.k.a., Transaction Processing Monitor) connects objects (clients to servers)Standard interfaces allow software plug-insTransaction ties execution of a “job” into an atomic unit: all-or-nothing, durable, isolatedActiveX Components are a 250M$/year business.Object Request Broker
25 Object Request Broker (ORB) Orchestrates RPC Registers ServersManages pools of serversConnects clients to serversDoes Naming, request-level authorization,Provides transaction coordinationDirect and queued invocationOld names:Transaction Processing Monitor,Web server,NetWareTransactionObject-Request Broker
26 The OO Points So Far Next points: Objects are software Cyber Bricks Object interconnect standards are emergingCyber Bricks become Federated Systems.Next points:put processing close to datado parallel processing.
27 Three Tier Computing Clients do presentation, gather input Clients do some workflow (Xscript)Clients send high-level requests to ORBORB dispatches work-flows and business objects -- proxies for client, orchestrate flows & queuesServer-side workflow scripts call on distributed business objects to execute taskPresentationworkflowBusinessObjectsDatabase
29 Transaction Processing Evolution to Three Tier Intelligence migrated to clients MainframecardsMainframe Batch processing (centralized)Dumb terminals & Remote Job EntryIntelligent terminals database backendsWorkflow Systems Object Request Brokers Application Generatorsgreenscreen3270ServerTP MonitorORBActive
30 Web Evolution to Three Tier Intelligence migrated to clients (like TP) ServerWAISCharacter-mode clients, smart serversGUI Browsers - Web file serversGUI Plugins - Web dispatchers - CGISmart clients - Web dispatcher (ORB) pools of app servers (ISAPI, Viper) workflow scripts at client & serverarchieghophergreen screenMosaicNS & IEActive
31 PC Evolution to Three Tier Intelligence migrated to server Stand-alone PC (centralized)PC + File & print server message per I/OPC + Database server message per SQL statementPC + App server message per transactionActiveX Client, ORB ActiveX server, XscriptIO requestreplydisk I/OSQLStatementTransaction
32 Why Did Everyone Go To Three-Tier? ManageabilityBusiness rules must be with dataMiddleware operations toolsPerformance (scaleability)Server resources are preciousORB dispatches requests to server poolsTechnology & PhysicsPut UI processing near userPut shared data processing near shared dataMinimizes data movesEncapsulate / modularityPresentationworkflowBusinessObjectsDatabase
33 Why Put Business Objects at Server? Customer comes to store with listGives list to clerkClerk gets goods, makes invoiceCustomer pays clerk, gets goodsEasy to manageClerks controls accessEncapsulationMOM’s Business ObjectsDAD’sRaw DataCustomer comes to storeTakes what he wantsFills out invoiceLeaves money for goodsEasy to buildNo clerksGray OGI 12/11/97
34 The OO Points So Far Put processing close to data Next point: Objects are software Cyber BricksObject interconnect standards are emergingCyber Bricks become Federated Systems.Put processing close to dataNext point:do parallel processing.
35 Parallelism: the OTHER half of Super-Servers Clusters of machines allow two kinds of parallelismMany little jobs: Online transaction processingTPC A, B, C,…A few big jobs: data search & analysisTPC D, DSS, OLAPBoth give automatic Parallelism
36 Why Parallel Access To Data? At 10 MB/s1.2 days to scan1,000 x parallel100 second SCAN.BANDWIDTHParallelism:divide a big probleminto many smaller onesto be solved in parallel.Gray OGI 12/11/97
37 Kinds of Parallel Execution AnyAnySequentialSequentialPipelineProgramProgramSequentialPartitionoutputs split N waysinputs merge M waysSequentialAnyAnySequentialSequentialSequentialSequentialProgramProgramGray OGI 12/11/97
38 Why are Relational Operators Successful for Parallelism? Relational data model uniform operatorson uniform data streamClosed under compositionEach operator consumes 1 or 2 input streamsEach stream is a uniform collection of dataSequential data in and out: Pure dataflowpartitioning some operators (e.g. aggregates, non-equi-join, sort,..)requires innovationAUTOMATIC PARALLELISMGray OGI 12/11/97
39 Database Systems “Hide” Parallelism Automate system management via toolsdata placementdata organization (indexing)periodic tasks (dump / recover / reorganize)Automatic fault toleranceduplex & failovertransactionsAutomatic parallelismamong transactions (locking)within a transaction (parallel execution)Gray OGI 12/11/97
40 SQL a Non-Procedural Programming Language SQL: functional programming language describes answer set.Optimizer picks best execution planPicks data flow web (pipeline),degree of parallelism (partitioning)other execution parameters (process placement, memory,...)PlanningExecutionMonitorSchemaExecutorsGUIPlanOptimizerRiversGray OGI 12/11/97
41 Automatic Data Partitioning Split a SQL table to subset of nodes & disksPartition within set:Range Hash Round RobinGood for equijoins,range queriesgroup-byGood for equijoinsGood to spread loadShared disk and memory less sensitive to partitioning,Shared nothing benefits from "good" partitioningGray OGI 12/11/97
42 N x M way Parallelism N inputs, M outputs, no bottlenecks. Gray OGI 12/11/97
43 Parallel Objects?How does all this DB parallelism connect to hardware/software Cyber Bricks?To scale to large client setsneed lots of independent parallel execution.Comes for from from ORB.To scale to large data setsneed intra-program parallelism (like parallel DBs)Requires some invention.
44 Outline Hardware CyberBricks Software CyberBricks What next? all nodes are very intelligentSoftware CyberBricksstandard way to interconnect intelligent nodesWhat next?Processing migrates to where the power isDisk, network, display controllers have full-blown OSSend RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to themComputer is a federated distributed system.Parallel execution is important
45 MORE SLIDES but there is only so much time. Too bad
46 The Disk Farm On a Card The 100GB disc card An array of discs Can be used as100 discs1 striped disc10 Fault Tolerant discs....etcLOTS of accesses/secondbandwidth14"Life is cheap, its the accessories that cost ya.Processors are cheap, it’s the peripherals that cost ya(a 10k$ disc card).
47 Parallelism: Performance is the Goal Goal is to get 'good' performance.Trade time for money.Law 1: parallel system should befaster than serial systemLaw 2: parallel system should givenear-linear scaleup ornear-linear speedup orboth.Parallel DBMSs obey these lawsGray OGI 12/11/97
48 Success Stories Online Transaction Processing many little jobsSQL systems support50 k tpm-C (44 cpu, 600 disk 2 node )Batch (decision support and Utility)few big jobs, parallelism insideScan data at 100 MB/sLinear Scaleup to 1,000 processorstransactions / sechardwarerecs/ sechardwareGray OGI 12/11/97
49 The New Law of Computing Grosch's Law:Parallel Law:NeedsLinear Speedup and Linear ScaleupNot always possible1 MIPS1 $1,000 MIPS32 $.03$/MIPS2x $ is 4x performance1 MIPS1 $1,000 $1,000 MIPS2x $ is2x performanceGray OGI 12/11/97
50 Clusters being built Teradata 1,000 nodes (30k$/slice) Tandem,VMScluster 150 nodes (100k$/slice)Intel, 9,000 55M$ ( 6k$/slice)Teradata, Tandem, DEC moving to NT+low slice priceIBM: 512 nodes ASCI @ 100m$ (200k$/slice)PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB serversKEY TECHNOLOGY HERE IS THE APPS.Apps distribute dataApps distribute executionGray OGI 12/11/97
51 Great Debate: Shared What? SMP or Cluster? Shared Memory(SMP)Shared DiskShared Nothing(network)Easy to programDifficult to buildDifficult to scaleupHard to programEasy to buildEasy to scaleupSequent, SGI, SunVMScluster, SysplexTandem, Teradata, SP2Winner will be a synthesis of these ideasDistributed shared memory (DASH, Encore) blurs distinctionbetween Network and Bus (locality still important)But gives Shared memory message cost.Gray OGI 12/11/97
52 BOTH SMP and Cluster? Cluster of PCs Grow Up with SMP 4xP6 is now standardGrow Out with ClusterCluster has inexpensive partsClusterof PCsGray OGI 12/11/97
53 Clusters Have Advantages Clients and Servers made from the same stuff.Inexpensive:Built with commodity componentsFault tolerance:Spare modules mask failuresModular growthgrow by adding small modules
54 Meta-Message: Technology Ratios Are Important If everything gets faster & cheaper at the same rate THEN nothing really changes.Things getting MUCH BETTER:communication speed & cost 1,000xprocessor speed & cost 100xstorage size & cost 100xThings staying about the samespeed of light (more or less constant)people (10x more expensive)storage speed (only 10x better)
55 Storage Ratios Changed 10x better access time10x more bandwidth4,000x lower media priceDRAM/DISK 100:1 to 10:10 to 50:1
56 Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Size vs SpeedPrice vs Speed1015129631042-2-4CacheNearlineTapeOfflineMainTapeDiscSecondaryOnlineOnline$/MBSecondaryTapeTapeTypical System (bytes)DiscMainOfflineNearlineTapeTapeCache10-910-610-31010310-910-610-310103Access Time (seconds)Access Time (seconds)
57 Network Speeds Speed of light did not change Link bandwidth grew 60% / yearWAN speeds limited by politicsif voice is X$/minute, how much is video?Gbps to desktop today!10 Gbps channel is coming.3Tbps fibers in laboratory thru parallelism (WDM).Paradox:WAN link has 40GbpsProcessor bus is GbpsComm Speedups1e 91e 81e 71e 61e 51e 41e 3Processors (i/s)LANs & WANs (b/s)19601970198019902000YearGray OGI 12/11/97
58 MicroProcessor Speeds Went Up Clock rates went from 10Khz to 400MhzProcessors now 6x issueSPECInt fits in Cache,it tracks cpu speedPeak Advertised Performance (PAP) is 1.2 BIPSReal Application Performance (RAP) is 100 MIPSSimilar curves forDEC VAX & AlphaHP/PAIBM R6000/ PowerPCMIPS & SGISUN0.111010010001980199020008088286386486PentiumP6Intel MicroProcessorSpeeds (mips)source: IntelGray OGI 12/11/97
59 Performance = Storage Accesses not Instructions Executed In the “old days” we counted instructions and IO’sNow we count memory referencesProcessors wait most of the timeWhere the time goes:clock ticks used by AlphaSort Components70 MIPS“real” apps have worse Icache misses so run at 60 MIPSif well tuned, 20 MIPS if notSortDisc WaitOSMemory WaitD-CacheMissI-CacheB-CacheData MissGray OGI 12/11/97
60 Storage Latency: How Far Away is the Data? Gray OGI 12/11/97
61 Tape Farms for Tertiary Storage Not Mainframe Silos 100 robots1M$50TB50$/GB3K Maps10K$ robot14 tapes27 hr Scan500 GB5 MB/s20$/GBScan in 27 hours.many independent tape robots(like a disc farm)30 Maps
62 The Metrics: Disk and Tape Farms Win Data Motel:Data checks in,but it never checks outGB/K$1,000,000Kaps100,000Maps10,000SCANS/Day1,0001001010.10.011000 xDisc FarmSTC Tape Robot100x DLTTape Farm6,000 tapes, 8 readers
63 Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter2 GB/platter=> 100$/GB (2x cheaper than disc)Tape is cheap: 30 $/tape20 GB/tape=> 1.5 $/GB (100x cheaper than disc).
64 Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$ m$ )tapes (at 20GB each) => 20$/GB $/GB(1x…10x cheaper than disc)Optical needs a robot (100 k$ )100 platters = 200GB ( TODAY ) => 400 $/GB( more expensive than mag disc )Robots have poor access timesNot good for Library of Congress (25TB)Data motel: data checks in but it never checks out!
65 The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates(2) Transfer dominates BLOBs(3) Disk seeks often shortImplication: many cheap servers better than one fast expensive servershorter queuesparallel transferlower cost/access and cost/byteThis is now obvious for disk arraysThis will be obvious for tape arrays
66 Billions Of Clients Every device will be “intelligent” Doors, rooms, cars…Computing will be ubiquitousGray OGI 12/11/97
67 Billions Of Clients Need Millions Of Servers All clients networked to serversMay be nomadic or on-demandFast clients want faster serversServers provideShared DataControlCoordinationCommunicationClientsMobile clientsFixed clientsServersServerSuperserverGray OGI 12/11/97
68 1987: 256 tps Benchmark 14 M$ computer (Tandem) A dozen people False floor, 2 rooms of machinesAdmin expertHardware expertsA 32 node processor arrayAuditorNetwork expertSimulate 25,600 clientsManagerPerformance expertOS expertDB expertA 40 GB disk array (80 drives)
69 1988: DB2 + CICS Mainframe 65 tps IBM 4391Simulated network of 800 clients2m$ computerStaff of 6 to do benchmark2 x 3725network controllersRefrigerator-sizedCPU16 GB disk farm4 x 8 x .5GB
70 1997: 10 years later 1 Person and 1 box = 1250 tps 1 Breadbox ~ 5x 1987 machine room23 GB is hand-heldOne person does all the workCost/tps is 1,000x less 25 micro dollars per transaction4x200 Mhz cpu1/2 GB DRAM12 x 4GB diskHardware expertOS expertNet expertDB expertApp expert3 x7 x 4GBdisk arrays
71 What Happened?Moore’s law: Things get 4x better every 3 years (applies to computers, storage, and networks)New Economics: Commodity class price/mips software $/mips k$/year mainframe , minicomputer microcomputerGUI: Human - computer tradeoff optimize for people, not computersmainframeminimicrotimeprice
72 What Happens Next ? Last 10 years: 1000x improvement 198520051995performance?Last 10 years: x improvementNext 10 years: ????Today: text and image servers are free 25 m$/hit => advertising pays for themFuture: video, audio, … servers are free “You ain’t seen nothing yet!”
73 Smart Cards Then (1979) Now (1997) EMV card with dynamic authentication(EMV=Europay, MasterCard, Visa standard)door key, vending machines, photocopiersNow (1997)Bull CP8 two chip cardfirst public demonstration 1979Courtesy of Dennis Roberson NCR.
74 Applications Memory Capacity 16 KB today but growing Smart CardMemory CapacitySource: PIN/Card -Tech/ Courtesy of Dennis Roberson NCR199019921996199820002002Memory Size (Bits)300 M1 M3 K10 KYou are here200416 KB todaybut growingsuper-exponentiallyApplicationsCards will be able to storedata (e.g. medical)books, movies,…moneyOne of the factors limiting smart card deployment is the limited memory size that can be stored on the card. Smart cards today with 3 to 10 kilobytes of storage have advantages over magnetic stripe cards, but are limited in their ability to carry massive amounts of application data.As memory costs continue to improve, and miniaturization of the chips continues to improve, smart cards will move to a few hundred megabytes thus having the ability to store sufficient amounts of data to perform practical applications.The past tens years of smart card evolution has taught us that no single application is significantly strong to drive the market acceptance. Multifunction cards with massive memory capabilities can and will change that.Gray OGI 12/11/972057