Presentation on theme: "Building Peta-Byte Servers"— Presentation transcript:
1 Building Peta-Byte Servers Jim GrayMicrosoft ResearchKilo 103Mega 106Giga 109Tera today, we are herePeta 1015Exa 1018
2 Outline The challenge: Building GIANT data stores Conclusion 1 for example, the EOS/DIS 15 PB systemConclusion 1Think about Maps and SCANSConclusion 2:Think about Clusters
3 The Challenge -- EOS/DIS Antarctica is melting -- 77% of fresh water liberatedsea level rises 70 metersChico & Memphis are beach-front propertyNew York, Washington, SF, SB, LA, London, Paris Let’s study it! Mission to Planet EarthEOS: Earth Observing System (17B$ => 10B$)50 instruments on 10 satellitesLandsat (added later)EOS DIS: Data Information System:3-5 MB/s raw, MB/s cooked.4 TB/day,15 PB by year 2007
4 The Process Flow Data arrives and is pre-processed. instrument data is calibrated, gridded averagedGeophysical data is derivedUsers ask for stored data OR to analyze and combine data.Can make the pull-push split dynamicallyPull ProcessingPush ProcessingOther Data
5 Designing EOS/DIS (for success) Expect that millions will use the system (online) Three user categories:NASA funded by NASA to do scienceGlobal Change 10 k - other dirt bagsInternet 20 m - everyone elseGrain speculatorsEnvironmental Impact Reportsschool kidsNew applications => discovery & access must be automaticAllow anyone to set up a peer- node (DAAC & SCF)Design for Ad Hoc queries, Not Just Standard Data Products If push is 90%, then 10% of data is read (on average).=> A failure: no one uses the data, in DSS, push is 1% or less.=> computation demand is enormous (pull:push is 100: 1)
6 The (UC alternative) Architecture 2+N data center designScaleable DBMS to manage the dataEmphasize Pull vs Push processingStorage hierarchyData PumpJust in time acquisition
7 2+N Data Center Design Duplex the archive (for fault tolerance) Let anyone build an extract (the +N)Partition data by time and by space (store 2 or 4 ways).Each partition is a free-standing DBMS (similar to Tandem, Teradata designs).Clients and Partitions interact via standard protocolsDCOM/CORBA, OLE-DB, HTTP,…Use the (Next Generation) Internet
8 Obvious Point: EOS/DIS will be a Cluster of SMPs It needs 16 PB storage= 1 M disks in current technology= 500K tapes in current technologyIt needs 100 TeraOps of processing= 100K processors (current technology)and ~ 100 Terabytes of DRAM1997 requirements are 1000x smallersmaller data ratealmost no re-processing work
9 Hardware Architecture 2 Huge Data CentersEach has 50 to 1,000 nodes in a clusterEach node has about 25…250 TB of storage (FY00 prices)SMP Bips to 50 Bips K$DRAM 50GB to 1 TB K$100 disks TB to 230 TB 200K$10 tape robots 50 TB to 500 TB 100K$2 Interconnects 1GBps to 100 GBps 20K$Node costs 500K$Data Center costs 25M$ (capital cost)
10 Scaleable DBMSAdopt cluster approach (Tandem, Teradata, VMScluster,..)System must scale to many processors, disks, linksOrganize data as a Database, not a collection of filesSQL rather than FTP as the metaphoradd object types unique to EOS/DIS (Object Relational DB)DBMS based on standard object modelCORBA or DCOM (not vendor specific)Grow by adding componentsSystem must be self-managing
11 Storage Hierarchy Cache hot 10% (1.5 PB) on disk. Keep cold 90% on near-line tape.Remember recent results on speculation| research challenge: how trade push +store vs. pull.(more on this later Maps & SCANS)15 PB of Tape Robot1 PB of Disk10-TB RAM500 nodes10,000 drives4x1,000 robots
12 Data Pump Some queries require reading ALL the data (for reprocessing) Each Data Center scans the data every 2 days.Data rate 10 PB/day = 10 TB/node/day = 120 MB/sCompute on demand small jobsless than 1,000 tape mountsless than 100 M disk accessesless than 100 TeraOps.(less than 30 minute response time)For BIG JOBS scan entire 15PB databaseQueries (and extracts) “snoop” this data pump.
13 Just-in-time acquisition 30% Hardware prices decline 20%-40%/yearSo buy at last momentBuy best product that day: commodityDepreciate over 3 years so that facility is fresh.(after 3 years, cost is 23% of original). 60% decline peaks at 10M$EOS DIS Disk Storage Size and Cost1102345assume 40% price decline/yearData Need TBStorage Cost M$19941996199820002002200420062008
18 Outline The challenge: Building GIANT data stores Conclusion 1 for example, the EOS/DIS 15 PB systemConclusion 1Think about Maps and SCANSConclusion 2:Think about Clusters
19 Meta-Message: Technology Ratios Are Important If everything gets faster & cheaper at the same rate THEN nothing really changes.Things getting MUCH BETTER:communication speed & cost 1,000xprocessor speed & cost 100xstorage size & cost 100xThings staying about the samespeed of light (more or less constant)people (10x more expensive)storage speed (only 10x better)
20 Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Size vs SpeedPrice vs Speed1015129631042-2-4CacheNearlineTapeOfflineMainTapeDiscSecondaryOnlineOnline$/MBSecondaryTapeTapeTypical System (bytes)DiscMainOfflineNearlineTapeTapeCache10-910-610-31010310-910-610-310103Access Time (seconds)Access Time (seconds)
21 Storage Ratios Changed 10x better access time10x more bandwidth4,000x lower media priceDRAM/DISK 100:1 to 10:10 to 50:1
22 What's a Terabyte Terror Byte !! .1% of a PetaByte!!!!!!!!!!!!!!!!!! 1,000,000,000 business letters100,000,000 book pages50,000,000 FAX images10,000,000 TV pictures (mpeg)4,000 LandSat imagesLibrary of Congress (in ASCI) is 25 TB1980: 200 M$ of disc ,000 discs5 M$ of tape silo ,000 tapes1997: K$ of magnetic disc discs250 K$ of optical disc robot platters25 K$ of tape silo tapesTerror Byte !!.1% of a PetaByte!!!!!!!!!!!!!!!!!!150 miles of bookshelf15 miles of bookshelf7 miles of bookshelf10 days of video
23 The Cost of Storage & Access File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space 10$/ft2) 180$ total $ ¢/sheetDisk: disk (9 GB =) ,000$ ASCII: m pages ¢/sheet (15x cheaperImage: 200 k pages ¢/sheet (similar to paper)
24 Trends: Application Storage Demand Grew The New World:Billions of objectsBig objects (1MB)The Old World:Millions of objects100-byte objects
25 Trends: New Applications Multimedia: Text, voice, image, video, ...The paperless officeLibrary of congress online (on your campus)All information comes electronicallyentertainmentpublishingbusinessInformation Network,Knowledge Navigator,Information at Your Fingertips
26 Thesis: Performance =Storage Accesses not Instructions Executed In the “old days” we counted instructions and IO’sNow we count memory referencesProcessors wait most of the time
27 Terror Bytes! The Pico Processor 1 M SPECmarks 106 clocks/ fault to bulk ramEvent-horizon on chip.VM reincarnatedMulti-program cacheTerror Bytes!
28 Storage Latency: How Far Away is the Data? Andromeda910Tape /Optical2,000 YearsRobot6Pluto10Disk2 YearsSacramento1.5 hr100MemoryThis Campus10On Board Cache10 min2On Chip CacheThis Room1RegistersMy Head1 min
29 The Five Minute Rule Trade DRAM for Disk Accesses Cost of an access (DriveCost / Access_per_second)Cost of a DRAM page ( $/MB / pages_per_MB)Break even has two terms:Technology term and an Economic termGrew page size to compensate for changing ratios.Still at 5 minute for random, 1 minute sequential
31 Standard Storage Metrics Capacity:RAM: MB and $/MB: today at 10MB & 100$/MBDisk: GB and $/GB: today at 10 GB and 200$/GBTape: TB and $/TB: today at .1TB and 25k$/TB (nearline)Access time (latency)RAM: 100 nsDisk: msTape: second pick, 30 second positionTransfer rateRAM: GB/sDisk: MB/s Arrays can go to 1GB/sTape: MB/s striping is problematic
32 New Storage Metrics: Kaps, Maps, SCAN? Kaps: How many kilobyte objects served per secondThe file server, transaction processing metricThis is the OLD metric.Maps: How many megabyte objects served per secondThe Multi-Media metricSCAN: How long to scan all the datathe data mining and utility metricAndKaps/$, Maps/$, TBscan/$
34 How To Get Lots of Maps, SCANs parallelism: use many little devices in parallelBeware of the media mythBeware of the access time mythAt 10 MB/s: 1.2 days to scan1,000 x parallel: 100 seconds SCAN.Parallelism: divide a big problem into many smaller ones to be solved in parallel.
35 The Disk Farm On a Card The 100GB disc card An array of discs Can be used as100 discs1 striped disc10 Fault Tolerant discs....etcLOTS of accesses/secondbandwidth14"Life is cheap, its the accessories that cost ya.Processors are cheap, it’s the peripherals that cost ya(a 10k$ disc card).
36 Tape Farms for Tertiary Storage Not Mainframe Silos 100 robots1M$50TB50$/GB3K Maps10K$ robot14 tapes27 hr Scan500 GB5 MB/s20$/GBScan in 27 hours.many independent tape robots(like a disc farm)30 Maps
37 The Metrics: Disk and Tape Farms Win Data Motel:Data checks in,but it never checks outGB/K$1,000,000Kaps100,000Maps10,000SCANS/Day1,0001001010.10.011000 xDisc FarmSTC Tape Robot100x DLTTape Farm6,000 tapes, 8 readers
38 Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter2 GB/platter=> 100$/GB (2x cheaper than disc)Tape is cheap: 30 $/tape20 GB/tape=> 1.5 $/GB (100x cheaper than disc).
39 Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$ m$ )tapes (at 20GB each) => 20$/GB $/GB(1x…10x cheaper than disc)Optical needs a robot (100 k$ )100 platters = 200GB ( TODAY ) => 400 $/GB( more expensive than mag disc )Robots have poor access timesNot good for Library of Congress (25TB)Data motel: data checks in but it never checks out!
40 The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates(2) Transfer dominates BLOBs(3) Disk seeks often shortImplication: many cheap servers better than one fast expensive servershorter queuesparallel transferlower cost/access and cost/byteThis is now obvious for disk arraysThis will be obvious for tape arrays
41 Outline The challenge: Building GIANT data stores Conclusion 1 for example, the EOS/DIS 15 PB systemConclusion 1Think about Maps and SCAN & 5 minute ruleConclusion 2:Think about Clusters
42 Scaleable Computers BOTH SMP and Cluster Grow Up with SMP4xP6 is now standardGrow Out with ClusterCluster has inexpensive partsSMPSuper ServerDepartmentalClusterof PCsServerPersonalSystem
43 What do TPC results say?Mainframes do not compete on performance or price They have great legacy code (MVS)PC nodes performance is 1/3 of high-end UNIX nodes6xP6 vs 48xUltraSparcPC Technology is 3x cheaper than high-end UNIXPeak performance is a clusterTandem 100 node clusterDEC Alpha 4x8 clusterCommodity solutions WILL come to this market
44 Cluster Advantages Clients and Servers made from the same stuff. Inexpensive: Built with commodity componentsFault tolerance:Spare modules mask failuresModular growthgrow by adding small modulesParallel data searchuse multiple processors and disks
45 Clusters being built Teradata 500 nodes (50k$/slice) Tandem,VMScluster 150 nodes (100k$/slice)Intel, 9,000 55M$ ( 6k$/slice)Teradata, Tandem, DEC moving to NT+low slice priceIBM: 512 nodes ASCI @ 100m$ (200k$/slice)PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB serversKEY TECHNOLOGY HERE IS THE APPS.Apps distribute dataApps distribute execution
46 Clusters are winning the high end Until recently a 4x8 cluster has best TPC-C performanceClusters have best data mining story (TPC-D)This year, a 32xUltraSparc cluster won the MinuteSort
47 Clusters (Plumbing) Single system image Fault Tolerance namingprotection/securitymanagement/load balanceFault ToleranceHot Pluggable hardware & Software
48 So, What’s New? When slices cost 50k$, you buy 10 or 20. Manageability, programmability, usability become key issues (total cost of ownership).PCs are MUCH easier to use and programMPPVicious CycleNo Customers!NewMPP &NewOSAppAppsCP/CommodityVirtuous Cycle:Standards allow progressand investment protectionStandardOS & HardwareCustomers
49 Windows NT Server Clustering High Availability On Standard Hardware Standard API for clusters on many platformsNo special hardware required.Resource Group is unit of failoverTypical resources:shared disk, printer, ...IP address, NetNameService (Web,SQL, File, Print Mail,MTS …)API to defineresource groups,dependencies,resources,GUI administrative interfaceA consortium of 60 HW & SW vendors (everybody who is anybody)2-Node Cluster in beta test now.Available 97H1>2 node is nextSQL Server and Oracle Demo on it todayKey conceptsSystem: a nodeCluster: systems working togetherResource: hard/ soft-ware moduleResource dependency: resource needs anotherResource group: fails over as a unitDependencies: do not cross group boundariesThe Wolfpack program has three goals: (1) To be the most reliable way to run Windows NT Server, (2) to be the most cost-effective high-availability platform, and (3) to be the easiest platform for developing cluster-aware solutions. Let’s look at each of those three in more detail.Wolfpack will be the most reliable way to run Windows NT Server. Out of the box, it will provide automatic recovery for file sharing, printer sharing, and Internet/Intranet services. It will be able to provide basic recovery services for virtually any existing server application without coding changes, and will feature an administrator’s console that makes it easy to take a server off-line for maintenance without disrupting your mission-critical business applications. The other server can deliver services while one is being changed.Wolfpack will run on standard servers from many vendors. It can use many interconnects ranging from standard Ethernet to specialized high-speed ones like Tandem ServerNet. It works with a wide range of disk drives and controllers including standard SCSI drives. This broad hardware support means flexibility, choice, and competitive pricing. Wolfpack clustering technology allows all nodes in the cluster to do useful work -- there’s no wasted “standby” server sitting idle waiting for a failure as there is with server mirroring solutions. And, of course, because it’s Windows software, it will have a familiar and easy to use graphical interface for the administrator.SQL Server will use Wolfpack’s Clustering API to provide high-availability via disk and IP address failover. SQL Server continues its close integration with NT and its unmatched ease-of-use. SQL Server 7.0 will provide a GUI configuration and management wizard to make it easy to configure high availability databases.
50 Where We Are Today Clusters moving fast Technology ahead of schedule OLTPSortWolfPackTechnology ahead of schedulecpus, disks, tapes,wires,..Databases are evolvingParallel DBMSs are evolvingOperations (batch) has a long way to go on Unix/PC.
51 Outline The challenge: Building GIANT data stores Conclusion 1 for example, the EOS/DIS 15 PB systemConclusion 1Think about Maps and SCANs & 5 minute ruleConclusion 2:Think about ClustersSlides & paper: December SIGMOD RECORD