Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Building PetaByte Servers Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10.

Similar presentations


Presentation on theme: "1 Building PetaByte Servers Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10."— Presentation transcript:

1 1 Building PetaByte Servers Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10 12 today, we are here Peta10 15 Exa10 18

2 2Outline The challenge: Building GIANT data storesThe challenge: Building GIANT data stores –for example, the EOS/DIS 15 PB system Conclusion 1Conclusion 1 –Think about MOX and SCANS Conclusion 2:Conclusion 2: –Think about Clusters –SMP report –Cluster report

3 3 The Challenge -- EOS/DIS Antarctica is melting -- 77% of fresh water liberatedAntarctica is melting -- 77% of fresh water liberated –sea level rises 70 meters –Chico & Memphis are beach-front property –New York, Washington, SF, LA, London, Paris Lets study it! Mission to Planet EarthLets study it! Mission to Planet Earth EOS: Earth Observing System (17B$ => 10B$)EOS: Earth Observing System (17B$ => 10B$) –50 instruments on 10 satellites –Landsat (added later) EOS DIS: Data Information System:EOS DIS: Data Information System: –3-5 MB/s raw, MB/s processed. –4 TB/day, –15 PB by year 2007

4 4 The Process Flow Data arrives and is pre-processed.Data arrives and is pre-processed. –instrument data is calibrated, gridded averaged –Geophysical data is derived Users ask for stored data ORto analyze and combine data.Users ask for stored data ORto analyze and combine data. Can make the pull-push split dynamicallyCan make the pull-push split dynamically Pull Processing Push Processing Other Data

5 5 Designing EOS/DIS Expect that millions will use the system (online) Three user categories:Expect that millions will use the system (online) Three user categories: –NASA funded by NASA to do science –Global Change 10 k - other dirt bags –Internet 20 m - everyone else Grain speculators Environmental Impact Reports New applications => discovery & access must be automatic Allow anyone to set up a peer- node (DAAC & SCF)Allow anyone to set up a peer- node (DAAC & SCF) Design for Ad Hoc queries, Not Standard Data Products If push is 90%, then 10% of data is read (on average).Design for Ad Hoc queries, Not Standard Data Products If push is 90%, then 10% of data is read (on average). => A failure: no one uses the data, in DSS, push is 1% or less. => computation demand is enormous (pull:push is 100: 1) => computation demand is enormous (pull:push is 100: 1)

6 6 Obvious Points: EOS/DIS will be a cluster of SMPs It needs 16 PB storageIt needs 16 PB storage –= 1 M disks in current technology –= 500K tapes in current technology It needs 100 TeraOps of processingIt needs 100 TeraOps of processing –= 100K processors (current technology) –and ~ 100 Terabytes of DRAM 1997 requirements are 1000x smaller1997 requirements are 1000x smaller –smaller data rate –almost no re-processing work

7 7 The architecture 2+N data center design2+N data center design Scaleable OR-DBMSScaleable OR-DBMS Emphasize Pull vs Push processingEmphasize Pull vs Push processing Storage hierarchyStorage hierarchy Data PumpData Pump Just in time acquisitionJust in time acquisition

8 8 2+N data center design duplex the archive (for fault tolerance)duplex the archive (for fault tolerance) let anyone build an extract (the +N)let anyone build an extract (the +N) Partition data by time and by space (store 2 or 4 ways).Partition data by time and by space (store 2 or 4 ways). Each partition is a free-standing OR-DBBMS (similar to Tandem, Teradata designs).Each partition is a free-standing OR-DBBMS (similar to Tandem, Teradata designs). Clients and Partitions interact via standard protocolsClients and Partitions interact via standard protocols –OLE-DB, DCOM/CORBA, HTTP,…

9 9 Hardware Architecture 2 Huge Data Centers2 Huge Data Centers Each has 50 to 1,000 nodes in a clusterEach has 50 to 1,000 nodes in a cluster –Each node has about 25…250 TB of storage –SMP.5Bips to 50 Bips 20K$ –DRAM50GB to 1 TB 50K$ –100 disks 2.3 TB to 230 TB 200K$ –10 tape robots25 TB to 250 TB 200K$ –2 Interconnects1GBps to 100 GBps 20K$ Node costs 500K$Node costs 500K$ Data Center costs 25M$ (capital cost)Data Center costs 25M$ (capital cost)

10 10 Scaleable OR-DBMS Adopt cluster approach (Tandem, Teradata, VMScluster, DB2/PE, Informix,....)Adopt cluster approach (Tandem, Teradata, VMScluster, DB2/PE, Informix,....) System must scale to many processors, disks, linksSystem must scale to many processors, disks, links OR DBMS based on standard object modelOR DBMS based on standard object model –CORBA or DCOM (not vendor specific) Grow by adding componentsGrow by adding components System must be self-managingSystem must be self-managing

11 11 Storage Hierarchy Cache hot 10% (1.5 PB) on disk.Cache hot 10% (1.5 PB) on disk. Keep cold 90% on near-line tape.Keep cold 90% on near-line tape. Remember recent results on speculationRemember recent results on speculation 15 PB of Tape Robot 1 PB of Disk 10-TB RAM 500 nodes 10,000 drives 4x1,000 robots

12 12 Data Pump Some queries require reading ALL the data (for reprocessing)Some queries require reading ALL the data (for reprocessing) Each Data Center scans the data every 2 weeks.Each Data Center scans the data every 2 weeks. –Data rate 10 PB/day = 10 TB/node/day = 120 MB/s Compute on demand small jobsCompute on demand small jobs less than 1,000 tape mounts less than 1,000 tape mounts less than 100 M disk accessesless than 100 M disk accesses less than 100 TeraOps.less than 100 TeraOps. (less than 30 minute response time)(less than 30 minute response time) For BIG JOBS scan entire 15PB databaseFor BIG JOBS scan entire 15PB database Queries (and extracts) snoop this data pump.Queries (and extracts) snoop this data pump.

13 13 Just-in-time acquisition 30% Hardware prices decline 20%-40%/yearHardware prices decline 20%-40%/year So buy at last momentSo buy at last moment Buy best product that day: commodityBuy best product that day: commodity Depreciate over 3 years so that facility is fresh.Depreciate over 3 years so that facility is fresh. (after 3 years, cost is 23% of original). 60% decline peaks at 10M$(after 3 years, cost is 23% of original). 60% decline peaks at 10M$ 1996 EOS DIS Disk Storage Size and Cost Storage Cost M$ Data Need TB assume 40% price decline/year

14 14 Problems HSMHSM Design and Meta-dataDesign and Meta-data IngestIngest Data discovery, search, and analysisData discovery, search, and analysis reorg-reprocessreorg-reprocess disaster recoverydisaster recovery costcost

15 15 Trends: New Applications The paperless office Library of congress online (on your campus) All information comes electronically entertainment publishing business Information Network, Knowledge Navigator, Information at Your Fingertips Multimedia: Text, voice, image, video,... The Old World: –Millions of objects –100-byte objects The New World:The New World: –Billions of objects –Big objects (1MB)

16 16 What's a Terabyte 1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images Library of Congress (in ASCI) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes 1994: 1 M$ of magnetic disc 120 discs 500 K$ of optical disc robot 250 platters 50 K$ of tape silo 50 tapes Terror Byte !!.1% of a PetaByte!!!!!!!!!!!!!!!!!! 150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video

17 17 The Cost of Storage & Access File Cabinet: cabinet (4 drawer)250$ paper (24,000 sheets)250$ space 10$/ft2)180$ total700$ 3.0 ¢/sheetFile Cabinet: cabinet (4 drawer)250$ paper (24,000 sheets)250$ space 10$/ft2)180$ total700$ 3.0 ¢/sheet Disk:disk (9 GB =) 2,000$ ASCII: 5 m pages 0.04 ¢/sheet (100x cheaper)Disk:disk (9 GB =) 2,000$ ASCII: 5 m pages 0.04 ¢/sheet (100x cheaper) Image: 200 k pages 1 ¢/sheet (similar to paper)Image: 200 k pages 1 ¢/sheet (similar to paper)

18 18 Standard Storage Metrics Capacity:Capacity: –RAM: MB and $/MB: today at 100 MB & 10 $/MB –Disk:GB and $/GB: today at 10 GB and 200 $/GB –Tape: TB and $/TB: today at.1 TB and 100 k$/TB (nearline) Access time (latency)Access time (latency) –RAM:100 ns –Disk: 10 ms –Tape: 30 second pick, 30 second position Transfer rateTransfer rate –RAM: 1 GB/s –Disk: 5 MB/s Arrays can go to 1GB/s –Tape: 3 MB/s not clear that striping works

19 19 New Storage Metrics: KOXs, MOXs, GOXs, SCANs? KOX: How many kilobyte objects served per secondKOX: How many kilobyte objects served per second –the file server, transaction processing metric MOX: How many megabyte objects served per secondMOX: How many megabyte objects served per second –the Mosaic metric GOX: How many gigabyte objects served per hourGOX: How many gigabyte objects served per hour –the video & EOSDIS metric SCANS: How many scans of all the data per daySCANS: How many scans of all the data per day –the data mining and utility metric

20 20 Summary (of new ideas) Storage accesses are the bottleneckStorage accesses are the bottleneck Accesses are getting larger (MOX, GOX, SCANS)Accesses are getting larger (MOX, GOX, SCANS) Capacity and cost are improvingCapacity and cost are improving BUTBUT Latencies and bandwidth are not improving muchLatencies and bandwidth are not improving much SOSO Use parallel access (disk and tape farms)Use parallel access (disk and tape farms)

21 21 How To Get Lots of MOX, GOX, SCANS parallelism: use many little devices in parallelparallelism: use many little devices in parallel Beware of the media mythBeware of the media myth Beware of the access time mythBeware of the access time myth 1 Terabyte 10 MB/s At 10 MB/s: 1.2 days to scan 1 Terabyte 1,000 x parallel: 1.5 minute SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel.

22 22 Meta-Message: Technology Ratios Are Important If everything gets faster&cheaper at the same rate then nothing really changes.If everything gets faster&cheaper at the same rate then nothing really changes. Some things getting MUCH BETTER:Some things getting MUCH BETTER: –communication speed & cost 1,000x –processor speed & cost 100x –storage size & cost 100x Some things staying about the sameSome things staying about the same –speed of light (more or less constant) –people (10x worse) –storage speed (only 10x better)

23 23Outline The challenge: Building GIANT data storesThe challenge: Building GIANT data stores –for example, the EOS/DIS 15 PB system Conclusion 1Conclusion 1 –Think about MOX and SCANS Conclusion 2:Conclusion 2: –Think about Clusters –SMP report –Cluster report

24 24 Scaleable Computers BOTH SMP and Cluster SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

25 25 TPC-C Current Results Best Performance is 30,390 $305/tpmC (Oracle/DEC) Best Price/Perf. is 7,693 $43.5/tpmC ( MS SQL/Dell) Graphs show – –UNIX high price – –UNIX scaleup diseconomy

26 26 Compare SMP Performance

27 27 TPC C improved fast 40% hardware, 100% software, 100% PC Technology

28 28 Where the money goes

29 29 What does this mean? PC Technology is 3x cheaper than high-end SMPsPC Technology is 3x cheaper than high-end SMPs PC nodes performance are 1/2 of high-end SMPsPC nodes performance are 1/2 of high-end SMPs –4xP6 vs 20xUltraSparc Peak performance is a clusterPeak performance is a cluster –Tandem 100 node cluster –DEC Alpha 4x8 cluster Commodity solutions WILL come to this marketCommodity solutions WILL come to this market

30 30 Cluster: Shared What? Shared Memory MultiprocessorShared Memory Multiprocessor –Multiple processors, one memory –all devices are local –DEC, SG, Sun Sequent nodes –easy to program, not commodity Shared Disk ClusterShared Disk Cluster –an array of nodes –all shared common disks –VAXcluster + Oracle Shared Nothing ClusterShared Nothing Cluster –each device local to a node –ownership may change –Tandem, SP2, Wolfpack

31 31 Clusters being built Teradata 1500 nodes +24 TB disk (50k$/slice)Teradata 1500 nodes +24 TB disk (50k$/slice) Tandem,VMScluster 150 nodes (100k$/slice)Tandem,VMScluster 150 nodes (100k$/slice) Intel, 9,000 55M$ ( 6k$/slice)Intel, 9,000 55M$ ( 6k$/slice) Teradata, Tandem, DEC moving to NT+low slice priceTeradata, Tandem, DEC moving to NT+low slice price IBM: m$ (200k$/slice)IBM: m$ (200k$/slice) PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB serversPC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB servers KEY TECHNOLOGY HERE IS THE APPS.KEY TECHNOLOGY HERE IS THE APPS. –Apps distribute data –Apps distribute execution

32 32 Cluster Advantages Clients and Servers made from the same stuff.Clients and Servers made from the same stuff. –Inexpensive: Built with commodity components Fault tolerance:Fault tolerance: –Spare modules mask failures Modular growthModular growth –grow by adding small modules Parallel data searchParallel data search –use multiple processors and disks

33 33 Clusters are winning the high end You saw that a 4x8 cluster has best TPC-C performanceYou saw that a 4x8 cluster has best TPC-C performance This year, a 95xUltraSparc cluster won the MinuteSort Speed Trophy (see NOWsort at year, a 95xUltraSparc cluster won the MinuteSort Speed Trophy (see NOWsort at Ordinal 16x on SGI Origin is close (but the loser!).Ordinal 16x on SGI Origin is close (but the loser!).

34 34 Clusters (Plumbing) Single system imageSingle system image –naming –protection/security –management/load balance Fault ToleranceFault Tolerance –Wolfpack Demo Hot Pluggable hardware & SoftwareHot Pluggable hardware & Software

35 35 So, Whats New? When slices cost 50k$, you buy 10 or 20.When slices cost 50k$, you buy 10 or 20. When slices cost 5k$ you buy 100 or 200.When slices cost 5k$ you buy 100 or 200. Manageability, programmability, usability become key issues (total cost of ownership).Manageability, programmability, usability become key issues (total cost of ownership). PCs are MUCH easier to use and programPCs are MUCH easier to use and program New MPP & NewOS New App New MPP & NewOS New App New MPP & NewOS New App New MPP & NewOS New App Standard OS & Hardware Apps Customers MPP Vicious Cycle No Customers! CP/Commodity Virtuous Cycle: Standards allow progress and investment protection

36 36 Windows NT Server Clustering High Availability On Standard Hardware Standard API for clusters on many platforms No special hardware required. Resource Group is unit of failover Typical resources: shared disk, printer,... IP address, NetName Service (Web,SQL, File, Print Mail,MTS API to define resource groups, dependencies,resources, GUI administrative interface A consortium of 60 HW & SW vendors (everybody who is anybody ) 2-Node Cluster in beta test now. Available 97H1 >2 node is next SQL Server and Oracle Demo on it today Key concepts System: a node Cluster: systems working together Resource: hard/ soft-ware module Resource dependency: resource needs another Resource group: fails over as a unit Dependencies: do not cross group boundaries

37 37 Wolfpack NT Clusters 1.0 Shared SCSI Disk Strings B etty A lice Private Disks Private Disks Clients Two node file and print failover GUI admin interface

38 38 What is Wolfpack? Cluster Api DLL Database Manager Event Processor Node Manager Failover Mgr ResourceMgr Communication Manager Resource Monitors Cluster Service Cluster Management Tools Physical Resource DLL Logical Resource DLL App Resource DLL Resource Management Interface App Resource DLL Non Aware App Cluster Aware App RPC Global Update Manager Open Online IsAlive LooksAlive Offline Close Other Nodes

39 39 Where We Are Today Clusters moving fastClusters moving fast –OLTP –Sort –WolfPack Technology ahead of scheduleTechnology ahead of schedule –cpus, disks, tapes,wires,.. OR Databases are evolvingOR Databases are evolving Parallel DBMSs are evolvingParallel DBMSs are evolving HSM still immatureHSM still immature

40 40Outline The challenge: Building GIANT data storesThe challenge: Building GIANT data stores –for example, the EOS/DIS 15 PB system Conclusion 1Conclusion 1 –Think about MOX and SCANS Conclusion 2:Conclusion 2: –Think about Clusters –SMP report –Cluster report

41 41 Building PetaByte Servers Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10 12 today, we are here Peta10 15 Exa10 18


Download ppt "1 Building PetaByte Servers Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10."

Similar presentations


Ads by Google