Presentation on theme: "Building PetaByte Servers"— Presentation transcript:
1 Building PetaByte Servers Jim GrayMicrosoft ResearchKilo 103Mega 106Giga 109Tera today, we are herePeta 1015Exa 1018
2 Outline The challenge: Building GIANT data stores Conclusion 1 for example, the EOS/DIS 15 PB systemConclusion 1Think about MOX and SCANSConclusion 2:Think about ClustersSMP reportCluster report
3 The Challenge -- EOS/DIS Antarctica is melting -- 77% of fresh water liberatedsea level rises 70 metersChico & Memphis are beach-front propertyNew York, Washington, SF, LA, London, ParisLet’s study it! Mission to Planet EarthEOS: Earth Observing System (17B$ => 10B$)50 instruments on 10 satellitesLandsat (added later)EOS DIS: Data Information System:3-5 MB/s raw, MB/s processed.4 TB/day,15 PB by year 2007
4 The Process Flow Data arrives and is pre-processed. instrument data is calibrated, gridded averagedGeophysical data is derivedUsers ask for stored data OR to analyze and combine data.Can make the pull-push split dynamicallyPull ProcessingPush ProcessingOther Data
5 Designing EOS/DISExpect that millions will use the system (online) Three user categories:NASA funded by NASA to do scienceGlobal Change 10 k - other dirt bagsInternet 20 m - everyone elseGrain speculatorsEnvironmental Impact ReportsNew applications => discovery & access must be automaticAllow anyone to set up a peer- node (DAAC & SCF)Design for Ad Hoc queries, Not Standard Data Products If push is 90%, then 10% of data is read (on average).=> A failure: no one uses the data, in DSS, push is 1% or less.=> computation demand is enormous (pull:push is 100: 1)
6 Obvious Points: EOS/DIS will be a cluster of SMPs It needs 16 PB storage= 1 M disks in current technology= 500K tapes in current technologyIt needs 100 TeraOps of processing= 100K processors (current technology)and ~ 100 Terabytes of DRAM1997 requirements are 1000x smallersmaller data ratealmost no re-processing work
7 The architecture 2+N data center design Scaleable OR-DBMS Emphasize Pull vs Push processingStorage hierarchyData PumpJust in time acquisition
8 2+N data center design duplex the archive (for fault tolerance) let anyone build an extract (the +N)Partition data by time and by space (store 2 or 4 ways).Each partition is a free-standing OR-DBBMS (similar to Tandem, Teradata designs).Clients and Partitions interact via standard protocolsOLE-DB, DCOM/CORBA, HTTP,…
9 Hardware Architecture 2 Huge Data CentersEach has 50 to 1,000 nodes in a clusterEach node has about 25…250 TB of storageSMP Bips to 50 Bips K$DRAM 50GB to 1 TB K$100 disks TB to 230 TB 200K$10 tape robots 25 TB to 250 TB 200K$2 Interconnects 1GBps to 100 GBps 20K$Node costs 500K$Data Center costs 25M$ (capital cost)
10 Scaleable OR-DBMSAdopt cluster approach (Tandem, Teradata, VMScluster, DB2/PE, Informix,....)System must scale to many processors, disks, linksOR DBMS based on standard object modelCORBA or DCOM (not vendor specific)Grow by adding componentsSystem must be self-managing
11 Storage Hierarchy Cache hot 10% (1.5 PB) on disk. Keep cold 90% on near-line tape.Remember recent results on speculation15 PB of Tape Robot1 PB of Disk10-TB RAM500 nodes10,000 drives4x1,000 robots
12 Data Pump Some queries require reading ALL the data (for reprocessing) Some queries require reading ALL the data (for reprocessing)Each Data Center scans the data every 2 weeks.Data rate 10 PB/day = 10 TB/node/day = 120 MB/sCompute on demand small jobsless than 1,000 tape mountsless than 100 M disk accessesless than 100 TeraOps.(less than 30 minute response time)For BIG JOBS scan entire 15PB databaseQueries (and extracts) “snoop” this data pump.
13 Just-in-time acquisition 30% Hardware prices decline 20%-40%/yearSo buy at last momentBuy best product that day: commodityDepreciate over 3 years so that facility is fresh.(after 3 years, cost is 23% of original). 60% decline peaks at 10M$EOS DIS Disk Storage Size and Cost1102345assume 40% price decline/yearData Need TBStorage Cost M$19941996199820002002200420062008
14 Problems HSM Design and Meta-data Ingest Data discovery, search, and analysisreorg-reprocessdisaster recoverycost
15 Trends: New Applications The Old World:Millions of objects100-byte objectsTrends: New ApplicationsThe New World:Billions of objectsBig objects (1MB)Multimedia:Text, voice, image, video, ...The paperless officeLibrary of congress online (on your campus)All information comes electronicallyentertainmentpublishingbusinessInformation Network,Knowledge Navigator,Information at Your Fingertips
16 What's a Terabyte Terror Byte !! .1% of a PetaByte!!!!!!!!!!!!!!!!!! 1,000,000,000 business letters100,000,000 book pages50,000,000 FAX images10,000,000 TV pictures (mpeg)4,000 LandSat imagesLibrary of Congress (in ASCI) is 25 TB1980: 200 M$ of disc ,000 discs5 M$ of tape silo ,000 tapes1994: M$ of magnetic disc discs500 K$ of optical disc robot platters50 K$ of tape silo tapesTerror Byte !!.1% of a PetaByte!!!!!!!!!!!!!!!!!!150 miles of bookshelf15 miles of bookshelf7 miles of bookshelf10 days of video
17 The Cost of Storage & Access File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space 10$/ft2) 180$ total $ ¢/sheetDisk: disk (9 GB =) ,000$ ASCII: m pages ¢/sheet (100x cheaper)Image: 200 k pages ¢/sheet (similar to paper)
18 Standard Storage Metrics Capacity:RAM: MB and $/MB: today at 100 MB & 10 $/MBDisk: GB and $/GB: today at 10 GB and 200 $/GBTape: TB and $/TB: today at .1 TB and 100 k$/TB (nearline)Access time (latency)RAM: 100 nsDisk: msTape: second pick, 30 second positionTransfer rateRAM: GB/sDisk: MB/s Arrays can go to 1GB/sTape: MB/s not clear that striping works
19 New Storage Metrics: KOXs, MOXs, GOXs, SCANs? KOX: How many kilobyte objects served per secondthe file server, transaction processing metricMOX: How many megabyte objects served per secondthe Mosaic metricGOX: How many gigabyte objects served per hourthe video & EOSDIS metricSCANS: How many scans of all the data per daythe data mining and utility metric
20 Summary (of new ideas) Storage accesses are the bottleneck Accesses are getting larger (MOX, GOX, SCANS)Capacity and cost are improvingBUTLatencies and bandwidth are not improving muchSOUse parallel access (disk and tape farms)
21 How To Get Lots of MOX, GOX, SCANS parallelism: use many little devices in parallelBeware of the media mythBeware of the access time mythAt 10 MB/s: 1.2 days to scan1,000 x parallel: 1.5 minute SCAN.1 Terabyte1 Terabyte10 MB/sParallelism: divide a big problem into many smaller ones to be solved in parallel.
22 Meta-Message: Technology Ratios Are Important If everything gets faster&cheaper at the same rate then nothing really changes.Some things getting MUCH BETTER:communication speed & cost 1,000xprocessor speed & cost 100xstorage size & cost 100xSome things staying about the samespeed of light (more or less constant)people (10x worse)storage speed (only 10x better)
23 Outline The challenge: Building GIANT data stores Conclusion 1 for example, the EOS/DIS 15 PB systemConclusion 1Think about MOX and SCANSConclusion 2:Think about ClustersSMP reportCluster report
24 Scaleable Computers BOTH SMP and Cluster Grow Up with SMP4xP6 is now standardGrow Out with ClusterCluster has inexpensive partsSMPSuper ServerDepartmentalClusterof PCsServerPersonalSystem
25 TPC-C Current ResultsBest Performance is 30,390 $305/tpmC (Oracle/DEC)Best Price/Perf. is 7,693 $43.5/tpmC (MS SQL/Dell)Graphs showUNIX high priceUNIX scaleup diseconomy
29 What does this mean? PC Technology is 3x cheaper than high-end SMPs PC nodes performance are 1/2 of high-end SMPs4xP6 vs 20xUltraSparcPeak performance is a clusterTandem 100 node clusterDEC Alpha 4x8 clusterCommodity solutions WILL come to this market
30 Cluster: Shared What? Shared Memory Multiprocessor Shared Disk Cluster Multiple processors, one memoryall devices are localDEC, SG, Sun Sequent nodeseasy to program, not commodityShared Disk Clusteran array of nodesall shared common disksVAXcluster + OracleShared Nothing Clustereach device local to a nodeownership may changeTandem, SP2, Wolfpack
31 Clusters being built Teradata 1500 nodes +24 TB disk (50k$/slice) Tandem,VMScluster 150 nodes (100k$/slice)Intel, 9,000 55M$ ( 6k$/slice)Teradata, Tandem, DEC moving to NT+low slice priceIBM: m$ (200k$/slice)PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB serversKEY TECHNOLOGY HERE IS THE APPS.Apps distribute dataApps distribute execution
32 Cluster Advantages Clients and Servers made from the same stuff. Inexpensive: Built with commodity componentsFault tolerance:Spare modules mask failuresModular growthgrow by adding small modulesParallel data searchuse multiple processors and disks
33 Clusters are winning the high end You saw that a 4x8 cluster has best TPC-C performanceThis year, a 95xUltraSparc cluster won the MinuteSort Speed Trophy (see NOWsort atOrdinal 16x on SGI Origin is close (but the loser!).
34 Clusters (Plumbing) Single system image Fault Tolerance namingprotection/securitymanagement/load balanceFault ToleranceWolfpack DemoHot Pluggable hardware & Software
35 So, What’s New? When slices cost 50k$, you buy 10 or 20. Manageability, programmability, usability become key issues (total cost of ownership).PCs are MUCH easier to use and programMPPVicious CycleNo Customers!NewMPP &NewOSAppAppsCP/CommodityVirtuous Cycle:Standards allow progressand investment protectionStandardOS & HardwareCustomers
36 Windows NT Server Clustering High Availability On Standard Hardware Standard API for clusters on many platformsNo special hardware required.Resource Group is unit of failoverTypical resources:shared disk, printer, ...IP address, NetNameService (Web,SQL, File, Print Mail,MTSAPI to defineresource groups,dependencies,resources,GUI administrative interfaceA consortium of 60 HW & SW vendors (everybody who is anybody)2-Node Cluster in beta test now.Available 97H1>2 node is nextSQL Server and Oracle Demo on it todayKey conceptsSystem: a nodeCluster: systems working togetherResource: hard/ soft-ware moduleResource dependency: resource needs anotherResource group: fails over as a unitDependencies: do not cross group boundariesThe Wolfpack program has three goals: (1) To be the most reliable way to run Windows NT Server, (2) to be the most cost-effective high-availability platform, and (3) to be the easiest platform for developing cluster-aware solutions. Let’s look at each of those three in more detail.Wolfpack will be the most reliable way to run Windows NT Server. Out of the box, it will provide automatic recovery for file sharing, printer sharing, and Internet/Intranet services. It will be able to provide basic recovery services for virtually any existing server application without coding changes, and will feature an administrator’s console that makes it easy to take a server off-line for maintenance without disrupting your mission-critical business applications. The other server can deliver services while one is being changed.Wolfpack will run on standard servers from many vendors. It can use many interconnects ranging from standard Ethernet to specialized high-speed ones like Tandem ServerNet. It works with a wide range of disk drives and controllers including standard SCSI drives. This broad hardware support means flexibility, choice, and competitive pricing. Wolfpack clustering technology allows all nodes in the cluster to do useful work -- there’s no wasted “standby” server sitting idle waiting for a failure as there is with server mirroring solutions. And, of course, because it’s Windows software, it will have a familiar and easy to use graphical interface for the administrator.SQL Server will use Wolfpack’s Clustering API to provide high-availability via disk and IP address failover. SQL Server continues its close integration with NT and its unmatched ease-of-use. SQL Server 7.0 will provide a GUI configuration and management wizard to make it easy to configure high availability databases.
37 Wolfpack NT Clusters 1.0 B A Two node file and print failover Clients PrivatePrivateShared SCSI Disk StringsDisksDisksBAettyliceClientsGUI admin interfaceWolfpack NT Clusters 1.0 supports clusters containing two nodes, affectionately called Alice and Betty. Alice and Betty have some private devices and some shared SCSI strings. At any instant, each SCSI disk is “owned” by either Alice or Betty. The SCSI II commands are used to implement this ownership. Most modern SCSI controllers and disks correctly implement these commands. Microsoft is qualifying many controller and disk vendors for Wolfpack.In configuring Wolfpack NT Clusters, the operator assigns shared devices to one or another failover Resource Groups. During normal operation one failover group is served by Alice and the other group is served by Betty. In case one node fails, the other node takes ownership of the shared devices in that resource group and starts serving them. When the failed node returns to service, it can resume ownership of the resource group.Resources in the group can be disks, services, IP addresses, SQL databases, and other resources. The Wolfpack API allows any application to declare itself as a resource and participate in a Resource GroupThe cluster administration interface provides a graphical way to define resource groups and resources. It also provides a way to monitor and control the resource groups.
38 What is Wolfpack? Cluster Service Resource Management Interface Cluster Management ToolsCluster Api DLLRPCCluster ServiceGlobal UpdateDatabaseManagerManagerNodeEvent ProcessorManagerFailover MgrCommunicationAppResourceMgrManagerResourceOther NodesDLLOpenOnlineIsAliveLooksAliveOfflineCloseResourceResource MonitorsManagementInterfacePhysicalLogicalAppNon AwareAppResourceResourceResourceDLLDLLDLLCluster AwareApp
39 Where We Are Today Clusters moving fast Technology ahead of schedule OLTPSortWolfPackTechnology ahead of schedulecpus, disks, tapes,wires,..OR Databases are evolvingParallel DBMSs are evolvingHSM still immature
40 Outline The challenge: Building GIANT data stores Conclusion 1 for example, the EOS/DIS 15 PB systemConclusion 1Think about MOX and SCANSConclusion 2:Think about ClustersSMP reportCluster report
41 Building PetaByte Servers Jim GrayMicrosoft ResearchKilo 103Mega 106Giga 109Tera today, we are herePeta 1015Exa 1018