Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Building Peta-Byte Servers Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10.

Similar presentations


Presentation on theme: "1 Building Peta-Byte Servers Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10."— Presentation transcript:

1 1 Building Peta-Byte Servers Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10 12 today, we are here Peta10 15 Exa10 18

2 2Outline The challenge: Building GIANT data storesThe challenge: Building GIANT data stores –for example, the EOS/DIS 15 PB system Conclusion 1Conclusion 1 –Think about Maps and SCANS Conclusion 2:Conclusion 2: –Think about Clusters

3 3 The Challenge -- EOS/DIS Antarctica is melting -- 77% of fresh water liberatedAntarctica is melting -- 77% of fresh water liberated –sea level rises 70 meters –Chico & Memphis are beach-front property –New York, Washington, SF, SB, LA, London, Paris –New York, Washington, SF, SB, LA, London, Paris Lets study it! Mission to Planet EarthLets study it! Mission to Planet Earth EOS: Earth Observing System (17B$ => 10B$)EOS: Earth Observing System (17B$ => 10B$) –50 instruments on 10 satellites –Landsat (added later) EOS DIS: Data Information System:EOS DIS: Data Information System: –3-5 MB/s raw, MB/s cooked. –4 TB/day, –15 PB by year 2007

4 4 The Process Flow Data arrives and is pre-processed.Data arrives and is pre-processed. –instrument data is calibrated, gridded averaged –Geophysical data is derived Users ask for stored data ORto analyze and combine data.Users ask for stored data ORto analyze and combine data. Can make the pull-push split dynamicallyCan make the pull-push split dynamically Pull Processing Push Processing Other Data

5 5 Designing EOS/DIS (for success) Expect that millions will use the system (online) Three user categories:Expect that millions will use the system (online) Three user categories: –NASA funded by NASA to do science –Global Change 10 k - other dirt bags –Internet 20 m - everyone else Grain speculators Environmental Impact Reports school kids New applications => discovery & access must be automatic Allow anyone to set up a peer- node (DAAC & SCF)Allow anyone to set up a peer- node (DAAC & SCF) Design for Ad Hoc queries, Not Just Standard Data Products If push is 90%, then 10% of data is read (on average).Design for Ad Hoc queries, Not Just Standard Data Products If push is 90%, then 10% of data is read (on average). => A failure: no one uses the data, in DSS, push is 1% or less. => computation demand is enormous (pull:push is 100: 1) => computation demand is enormous (pull:push is 100: 1)

6 6 The (UC alternative) Architecture 2+N data center design2+N data center design Scaleable DBMS to manage the dataScaleable DBMS to manage the data Emphasize Pull vs Push processingEmphasize Pull vs Push processing Storage hierarchyStorage hierarchy Data PumpData Pump Just in time acquisitionJust in time acquisition

7 7 2+N Data Center Design Duplex the archive (for fault tolerance)Duplex the archive (for fault tolerance) Let anyone build an extract (the +N)Let anyone build an extract (the +N) Partition data by time and by space (store 2 or 4 ways).Partition data by time and by space (store 2 or 4 ways). Each partition is a free-standing DBMS (similar to Tandem, Teradata designs).Each partition is a free-standing DBMS (similar to Tandem, Teradata designs). Clients and Partitions interact via standard protocolsClients and Partitions interact via standard protocols –DCOM/CORBA, OLE-DB, HTTP,… Use the (Next Generation) InternetUse the (Next Generation) Internet

8 8 Obvious Point: EOS/DIS will be a Cluster of SMPs It needs 16 PB storageIt needs 16 PB storage = 1 M disks in current technology = 500K tapes in current technology It needs 100 TeraOps of processingIt needs 100 TeraOps of processing = 100K processors (current technology) and ~ 100 Terabytes of DRAM 1997 requirements are 1000x smaller1997 requirements are 1000x smaller –smaller data rate –almost no re-processing work

9 9 Hardware Architecture 2 Huge Data Centers2 Huge Data Centers Each has 50 to 1,000 nodes in a clusterEach has 50 to 1,000 nodes in a cluster –Each node has about 25…250 TB of storage (FY00 prices) –SMP.5Bips to 50 Bips 20K$ –DRAM50GB to 1 TB 50K$ –100 disks 2.3 TB to 230 TB 200K$ –10 tape robots50 TB to 500 TB 100K$ –2 Interconnects1GBps to 100 GBps 20K$ Node costs 500K$Node costs 500K$ Data Center costs 25M$ (capital cost)Data Center costs 25M$ (capital cost)

10 10 Scaleable DBMS Adopt cluster approach (Tandem, Teradata, VMScluster,..)Adopt cluster approach (Tandem, Teradata, VMScluster,..) System must scale to many processors, disks, linksSystem must scale to many processors, disks, links Organize data as a Database, not a collection of filesOrganize data as a Database, not a collection of files –SQL rather than FTP as the metaphor –add object types unique to EOS/DIS (Object Relational DB) DBMS based on standard object modelDBMS based on standard object model –CORBA or DCOM (not vendor specific) Grow by adding componentsGrow by adding components System must be self-managingSystem must be self-managing

11 11 Storage Hierarchy Cache hot 10% (1.5 PB) on disk.Cache hot 10% (1.5 PB) on disk. Keep cold 90% on near-line tape.Keep cold 90% on near-line tape. Remember recent results on speculation| research challenge: how trade push +store vs. pull.Remember recent results on speculation| research challenge: how trade push +store vs. pull. (more on this later Maps & SCANS)(more on this later Maps & SCANS) 15 PB of Tape Robot 1 PB of Disk 10-TB RAM 500 nodes 10,000 drives 4x1,000 robots

12 12 Data Pump Some queries require reading ALL the data (for reprocessing)Some queries require reading ALL the data (for reprocessing) Each Data Center scans the data every 2 days.Each Data Center scans the data every 2 days. –Data rate 10 PB/day = 10 TB/node/day = 120 MB/s Compute on demand small jobsCompute on demand small jobs less than 1,000 tape mounts less than 1,000 tape mounts less than 100 M disk accessesless than 100 M disk accesses less than 100 TeraOps.less than 100 TeraOps. (less than 30 minute response time)(less than 30 minute response time) For BIG JOBS scan entire 15PB databaseFor BIG JOBS scan entire 15PB database Queries (and extracts) snoop this data pump.Queries (and extracts) snoop this data pump.

13 13 Just-in-time acquisition 30% Hardware prices decline 20%-40%/yearHardware prices decline 20%-40%/year So buy at last momentSo buy at last moment Buy best product that day: commodityBuy best product that day: commodity Depreciate over 3 years so that facility is fresh.Depreciate over 3 years so that facility is fresh. (after 3 years, cost is 23% of original). 60% decline peaks at 10M$(after 3 years, cost is 23% of original). 60% decline peaks at 10M$ 1996 EOS DIS Disk Storage Size and Cost Storage Cost M$ Data Need TB assume 40% price decline/year

14 14 Just-in-time acquisition 50%!!!!!!! Hardware prices decline 50%/year latelyHardware prices decline 50%/year lately The PC revolution!The PC revolution! Its amazing!Its amazing!

15 15 TPC C improved fast (250%/year!) 40% hardware, 100% software, 100% PC Technology

16 16 Problems HSM (hierarchical storage management)HSM (hierarchical storage management) Design and Meta-dataDesign and Meta-data IngestIngest Data discovery, search, and analysisData discovery, search, and analysis reorganize-reprocessreorganize-reprocess disaster recoverydisaster recovery management/operations costmanagement/operations cost

17 17

18 18Outline The challenge: Building GIANT data storesThe challenge: Building GIANT data stores –for example, the EOS/DIS 15 PB system Conclusion 1Conclusion 1 –Think about Maps and SCANS Conclusion 2:Conclusion 2: –Think about Clusters

19 19 Meta-Message: Technology Ratios Are Important If everything gets faster & cheaper at the same rate THEN nothing really changes.If everything gets faster & cheaper at the same rate THEN nothing really changes. Things getting MUCH BETTER:Things getting MUCH BETTER: –communication speed & cost 1,000x –processor speed & cost 100x –storage size & cost 100x Things staying about the sameThings staying about the same –speed of light (more or less constant) –people (10x more expensive) –storage speed (only 10x better)

20 20 Todays Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Typical System (bytes) Size vs Speed Access Time (seconds) Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape $/MB Price vs Speed Access Time (seconds) Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape

21 21 Storage Ratios Changed 10x better access time10x better access time 10x more bandwidth10x more bandwidth 4,000x lower media price4,000x lower media price DRAM/DISK 100:1 to 10:10 to 50:1DRAM/DISK 100:1 to 10:10 to 50:1

22 22 What's a Terabyte 1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images Library of Congress (in ASCI) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes 1997: 200 K$ of magnetic disc 120 discs 250 K$ of optical disc robot 200 platters 25 K$ of tape silo 25 tapes Terror Byte !!.1% of a PetaByte!!!!!!!!!!!!!!!!!! 150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video

23 23 The Cost of Storage & Access File Cabinet: cabinet (4 drawer)250$ paper (24,000 sheets)250$ space 10$/ft2)180$ total700$ 3 ¢/sheetFile Cabinet: cabinet (4 drawer)250$ paper (24,000 sheets)250$ space 10$/ft2)180$ total700$ 3 ¢/sheet Disk:disk (9 GB =) 2,000$ ASCII: 5 m pages 0.2 ¢/sheet (15x cheaperDisk:disk (9 GB =) 2,000$ ASCII: 5 m pages 0.2 ¢/sheet (15x cheaper Image: 200 k pages 1 ¢/sheet (similar to paper)Image: 200 k pages 1 ¢/sheet (similar to paper)

24 24 Trends: Application Storage Demand Grew The New World:The New World: –Billions of objects –Big objects (1MB) • The Old World: –Millions of objects –100-byte objects

25 25 Trends: New Applications The paperless office Library of congress online (on your campus) All information comes electronically entertainment publishing business Information Network, Knowledge Navigator, Information at Your Fingertips Multimedia: Text, voice, image, video,...

26 26 Thesis: Performance =Storage Accesses not Instructions Executed In the old days we counted instructions and IOsIn the old days we counted instructions and IOs Now we count memory referencesNow we count memory references Processors wait most of the timeProcessors wait most of the time

27 27 The Pico Processor 1 M SPECmarks 10 6 clocks/ fault to bulk ram Event-horizon on chip. VM reincarnated Multi-program cache Terror Bytes!

28 28 Storage Latency: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk Tape /Optical Robot Sacramento This Campus This Room My Head 10 min 1.5 hr 2 Years 1 min Pluto 2,000 Years Andromeda

29 29 The Five Minute Rule Trade DRAM for Disk AccessesTrade DRAM for Disk Accesses Cost of an access (DriveCost / Access_per_second)Cost of an access (DriveCost / Access_per_second) Cost of a DRAM page ( $/MB / pages_per_MB)Cost of a DRAM page ( $/MB / pages_per_MB) Break even has two terms:Break even has two terms: Technology term and an Economic termTechnology term and an Economic term Grew page size to compensate for changing ratios.Grew page size to compensate for changing ratios. Still at 5 minute for random, 1 minute sequentialStill at 5 minute for random, 1 minute sequential

30 30 Shows Best Page Index Page Size ~16KB

31 31 Standard Storage Metrics Capacity:Capacity: –RAM: MB and $/MB: today at 10MB & 100$/MB –Disk:GB and $/GB: today at 10 GB and 200$/GB –Tape: TB and $/TB: today at.1TB and 25k$/TB (nearline) Access time (latency)Access time (latency) –RAM:100 ns –Disk: 10 ms –Tape: 30 second pick, 30 second position Transfer rateTransfer rate –RAM: 1 GB/s –Disk: 5 MB/s Arrays can go to 1GB/s –Tape: 5 MB/s striping is problematic

32 32 New Storage Metrics: Kaps, Maps, SCAN? Kaps: How many kilobyte objects served per secondKaps: How many kilobyte objects served per second –The file server, transaction processing metric –This is the OLD metric. Maps: How many megabyte objects served per secondMaps: How many megabyte objects served per second –The Multi-Media metric SCAN: How long to scan all the dataSCAN: How long to scan all the data –the data mining and utility metric AndAnd –Kaps/$, Maps/$, TBscan/$

33 33 For the Record (good 1997 devices) X 14

34 34 How To Get Lots of Maps, SCANs parallelism: use many little devices in parallelparallelism: use many little devices in parallel Beware of the media mythBeware of the media myth Beware of the access time mythBeware of the access time myth At 10 MB/s: 1.2 days to scan 1,000 x parallel: 100 seconds SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel.

35 35 The Disk Farm On a Card The 100GB disc card An array of discs Can be used as 100 discs 100 discs 1 striped disc 1 striped disc 10 Fault Tolerant discs 10 Fault Tolerant discs....etc....etc LOTS of accesses/second bandwidth bandwidth 14" Life is cheap, its the accessories that cost ya. Processors are cheap, its the peripherals that cost ya (a 10k$ disc card).

36 36 Tape Farms for Tertiary Storage Not Mainframe Silos Scan in 27 hours. many independent tape robots (like a disc farm) 10K$ robot 14 tapes 500 GB 5 MB/s 20$/GB 30 Maps 100 robots 50TB 50$/GB 3K Maps 27 hr Scan 1M$

37 ,000 10, ,000 1,, 1000 xDisc Farm STC Tape Robot 6,000 tapes, 8 readers 100x DLTTape Farm GB/K$ Maps SCANS/Day Kaps The Metrics: Disk and Tape Farms Win Data Motel: Data checks in, but it never checks out

38 38 Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (2x cheaper than disc) Tape is cheap:30 $/tape 20 GB/tape => 1.5 $/GB (100x cheaper than disc).

39 39 Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$... 3 m$ ) tapes (at 20GB each) => 20$/GB $/GB (1x…10x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB ( more expensive than mag disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!

40 40 The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (2) Transfer dominates BLOBs (3) Disk seeks often short (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server –shorter queues –parallel transfer –lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for tape arrays

41 41Outline The challenge: Building GIANT data storesThe challenge: Building GIANT data stores –for example, the EOS/DIS 15 PB system Conclusion 1Conclusion 1 –Think about Maps and SCAN & 5 minute rule Conclusion 2:Conclusion 2: –Think about Clusters

42 42 Scaleable Computers BOTH SMP and Cluster SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

43 43 What do TPC results say? Mainframes do not compete on performance or price They have great legacy code (MVS)Mainframes do not compete on performance or price They have great legacy code (MVS) PC nodes performance is 1/3 of high-end UNIX nodesPC nodes performance is 1/3 of high-end UNIX nodes –6xP6 vs 48xUltraSparc PC Technology is 3x cheaper than high-end UNIXPC Technology is 3x cheaper than high-end UNIX Peak performance is a clusterPeak performance is a cluster –Tandem 100 node cluster –DEC Alpha 4x8 cluster Commodity solutions WILL come to this marketCommodity solutions WILL come to this market

44 44 Cluster Advantages Clients and Servers made from the same stuff.Clients and Servers made from the same stuff. –Inexpensive: Built with commodity components Fault tolerance:Fault tolerance: –Spare modules mask failures Modular growthModular growth –grow by adding small modules Parallel data searchParallel data search –use multiple processors and disks

45 45 Clusters being built Teradata 500 nodes (50k$/slice)Teradata 500 nodes (50k$/slice) Tandem,VMScluster 150 nodes (100k$/slice)Tandem,VMScluster 150 nodes (100k$/slice) Intel, 9,000 55M$ ( 6k$/slice)Intel, 9,000 55M$ ( 6k$/slice) Teradata, Tandem, DEC moving to NT+low slice priceTeradata, Tandem, DEC moving to NT+low slice price IBM: 512 nodes 100m$ (200k$/slice)IBM: 512 nodes 100m$ (200k$/slice) PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB serversPC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB servers KEY TECHNOLOGY HERE IS THE APPS.KEY TECHNOLOGY HERE IS THE APPS. –Apps distribute data –Apps distribute execution

46 46 Clusters are winning the high end Until recently a 4x8 cluster has best TPC-C performanceUntil recently a 4x8 cluster has best TPC-C performance Clusters have best data mining story (TPC-D)Clusters have best data mining story (TPC-D) This year, a 32xUltraSparc cluster won the MinuteSortThis year, a 32xUltraSparc cluster won the MinuteSort

47 47 Clusters (Plumbing) Single system imageSingle system image –naming –protection/security –management/load balance Fault ToleranceFault Tolerance Hot Pluggable hardware & SoftwareHot Pluggable hardware & Software

48 48 So, Whats New? When slices cost 50k$, you buy 10 or 20.When slices cost 50k$, you buy 10 or 20. When slices cost 5k$ you buy 100 or 200.When slices cost 5k$ you buy 100 or 200. Manageability, programmability, usability become key issues (total cost of ownership).Manageability, programmability, usability become key issues (total cost of ownership). PCs are MUCH easier to use and programPCs are MUCH easier to use and program New MPP & NewOS New App New MPP & NewOS New App New MPP & NewOS New App New MPP & NewOS New App Standard OS & Hardware Apps Customers MPP Vicious Cycle No Customers! CP/Commodity Virtuous Cycle: Standards allow progress and investment protection

49 49 Windows NT Server Clustering High Availability On Standard Hardware Standard API for clusters on many platforms No special hardware required. Resource Group is unit of failover Typical resources: shared disk, printer,... IP address, NetName Service (Web,SQL, File, Print Mail,MTS …) API to define resource groups, dependencies,resources, GUI administrative interface A consortium of 60 HW & SW vendors (everybody who is anybody ) 2-Node Cluster in beta test now. Available 97H1 >2 node is next SQL Server and Oracle Demo on it today Key concepts System: a node Cluster: systems working together Resource: hard/ soft-ware module Resource dependency: resource needs another Resource group: fails over as a unit Dependencies: do not cross group boundaries

50 50 Where We Are Today Clusters moving fastClusters moving fast –OLTP –Sort –WolfPack Technology ahead of scheduleTechnology ahead of schedule –cpus, disks, tapes,wires,.. Databases are evolvingDatabases are evolving Parallel DBMSs are evolvingParallel DBMSs are evolving Operations (batch) has a long way to go on Unix/PC.Operations (batch) has a long way to go on Unix/PC.

51 51Outline The challenge: Building GIANT data storesThe challenge: Building GIANT data stores –for example, the EOS/DIS 15 PB system Conclusion 1Conclusion 1 –Think about Maps and SCANs & 5 minute rule Conclusion 2:Conclusion 2: –Think about Clusters Slides & paper: December SIGMOD RECORD & paper: December SIGMOD RECORD


Download ppt "1 Building Peta-Byte Servers Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10."

Similar presentations


Ads by Google