Building Peta-Byte Servers

Building Peta-Byte Servers
Jim Gray Microsoft Research Kilo 103 Mega 106 Giga 109 Tera today, we are here Peta 1015 Exa 1018

Outline The challenge: Building GIANT data stores Conclusion 1
for example, the EOS/DIS 15 PB system Conclusion 1 Think about Maps and SCANS Conclusion 2: Think about Clusters

The Challenge -- EOS/DIS
Antarctica is melting -- 77% of fresh water liberated sea level rises 70 meters Chico & Memphis are beach-front property New York, Washington, SF, SB, LA, London, Paris  Let’s study it! Mission to Planet Earth EOS: Earth Observing System (17B$ => 10B$) 50 instruments on 10 satellites Landsat (added later) EOS DIS: Data Information System: 3-5 MB/s raw, MB/s cooked. 4 TB/day, 15 PB by year 2007

The Process Flow Data arrives and is pre-processed.
instrument data is calibrated, gridded averaged Geophysical data is derived Users ask for stored data OR to analyze and combine data. Can make the pull-push split dynamically Pull Processing Push Processing Other Data

Designing EOS/DIS (for success)
Expect that millions will use the system (online) Three user categories: NASA funded by NASA to do science Global Change 10 k - other dirt bags Internet 20 m - everyone else Grain speculators Environmental Impact Reports school kids New applications => discovery & access must be automatic Allow anyone to set up a peer- node (DAAC & SCF) Design for Ad Hoc queries, Not Just Standard Data Products If push is 90%, then 10% of data is read (on average). => A failure: no one uses the data, in DSS, push is 1% or less. => computation demand is enormous (pull:push is 100: 1)

The (UC alternative) Architecture
2+N data center design Scaleable DBMS to manage the data Emphasize Pull vs Push processing Storage hierarchy Data Pump Just in time acquisition

2+N Data Center Design Duplex the archive (for fault tolerance)
Let anyone build an extract (the +N) Partition data by time and by space (store 2 or 4 ways). Each partition is a free-standing DBMS (similar to Tandem, Teradata designs). Clients and Partitions interact via standard protocols DCOM/CORBA, OLE-DB, HTTP,… Use the (Next Generation) Internet

Obvious Point: EOS/DIS will be a Cluster of SMPs
It needs 16 PB storage = 1 M disks in current technology = 500K tapes in current technology It needs 100 TeraOps of processing = 100K processors (current technology) and ~ 100 Terabytes of DRAM 1997 requirements are 1000x smaller smaller data rate almost no re-processing work

Hardware Architecture
2 Huge Data Centers Each has 50 to 1,000 nodes in a cluster Each node has about 25…250 TB of storage (FY00 prices) SMP Bips to 50 Bips K$ DRAM 50GB to 1 TB K$ 100 disks TB to 230 TB 200K$ 10 tape robots 50 TB to 500 TB 100K$ 2 Interconnects 1GBps to 100 GBps 20K$ Node costs 500K$ Data Center costs 25M$ (capital cost)

Scaleable DBMS Adopt cluster approach (Tandem, Teradata, VMScluster,..) System must scale to many processors, disks, links Organize data as a Database, not a collection of files SQL rather than FTP as the metaphor add object types unique to EOS/DIS (Object Relational DB) DBMS based on standard object model CORBA or DCOM (not vendor specific) Grow by adding components System must be self-managing

Storage Hierarchy Cache hot 10% (1.5 PB) on disk.
Keep cold 90% on near-line tape. Remember recent results on speculation| research challenge: how trade push +store vs. pull. (more on this later Maps & SCANS) 15 PB of Tape Robot 1 PB of Disk 10-TB RAM 500 nodes 10,000 drives 4x1,000 robots

Data Pump Some queries require reading ALL the data (for reprocessing)
Each Data Center scans the data every 2 days. Data rate 10 PB/day = 10 TB/node/day = 120 MB/s Compute on demand small jobs less than 1,000 tape mounts less than 100 M disk accesses less than 100 TeraOps. (less than 30 minute response time) For BIG JOBS scan entire 15PB database Queries (and extracts) “snoop” this data pump.

Just-in-time acquisition 30%
Hardware prices decline 20%-40%/year So buy at last moment Buy best product that day: commodity Depreciate over 3 years so that facility is fresh. (after 3 years, cost is 23% of original). 60% decline peaks at 10M$ EOS DIS Disk Storage Size and Cost 1 10 2 3 4 5 assume 40% price decline/year Data Need TB Storage Cost M$ 1994 1996 1998 2000 2002 2004 2006 2008

Just-in-time acquisition 50%!!!!!!!
Hardware prices decline 50%/year lately The PC revolution! Its amazing!

TPC C improved fast (250%/year!)
40% hardware, 100% software, 100% PC Technology

Problems HSM (hierarchical storage management) Design and Meta-data
Ingest Data discovery, search, and analysis reorganize-reprocess disaster recovery management/operations cost

for example, the EOS/DIS 15 PB system Conclusion 1 Think about Maps and SCANS Conclusion 2: Think about Clusters

Meta-Message: Technology Ratios Are Important
If everything gets faster & cheaper at the same rate THEN nothing really changes. Things getting MUCH BETTER: communication speed & cost 1,000x processor speed & cost 100x storage size & cost 100x Things staying about the same speed of light (more or less constant) people (10x more expensive) storage speed (only 10x better)

Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs
Size vs Speed Price vs Speed 10 15 12 9 6 3 10 4 2 -2 -4 Cache Nearline Tape Offline Main Tape Disc Secondary Online Online $/MB Secondary Tape Tape Typical System (bytes) Disc Main Offline Nearline Tape Tape Cache 10 -9 10 -6 10 -3 10 10 3 10 -9 10 -6 10 -3 10 10 3 Access Time (seconds) Access Time (seconds)

Storage Ratios Changed
10x better access time 10x more bandwidth 4,000x lower media price DRAM/DISK 100:1 to 10:10 to 50:1

What's a Terabyte Terror Byte !! .1% of a PetaByte!!!!!!!!!!!!!!!!!!
1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images Library of Congress (in ASCI) is 25 TB 1980: 200 M$ of disc ,000 discs 5 M$ of tape silo ,000 tapes 1997: K$ of magnetic disc discs 250 K$ of optical disc robot platters 25 K$ of tape silo tapes Terror Byte !! .1% of a PetaByte!!!!!!!!!!!!!!!!!! 150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video

The Cost of Storage & Access
File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space 10$/ft2) 180$ total $ ¢/sheet Disk: disk (9 GB =) ,000$ ASCII: m pages ¢/sheet (15x cheaper Image: 200 k pages ¢/sheet (similar to paper)

Trends: Application Storage Demand Grew
The New World: Billions of objects Big objects (1MB) The Old World: Millions of objects 100-byte objects

Trends: New Applications
Multimedia: Text, voice, image, video, ... The paperless office Library of congress online (on your campus) All information comes electronically entertainment publishing business Information Network, Knowledge Navigator, Information at Your Fingertips

Thesis: Performance =Storage Accesses not Instructions Executed
In the “old days” we counted instructions and IO’s Now we count memory references Processors wait most of the time

Terror Bytes! The Pico Processor 1 M SPECmarks 106 clocks/
fault to bulk ram Event-horizon on chip. VM reincarnated Multi-program cache Terror Bytes!

Storage Latency: How Far Away is the Data?
Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto 10 Disk 2 Years Sacramento 1.5 hr 100 Memory This Campus 10 On Board Cache 10 min 2 On Chip Cache This Room 1 Registers My Head 1 min

The Five Minute Rule Trade DRAM for Disk Accesses
Cost of an access (DriveCost / Access_per_second) Cost of a DRAM page ( $/MB / pages_per_MB) Break even has two terms: Technology term and an Economic term Grew page size to compensate for changing ratios. Still at 5 minute for random, 1 minute sequential

Shows Best Page Index Page Size ~16KB

Standard Storage Metrics
Capacity: RAM: MB and $/MB: today at 10MB & 100$/MB Disk: GB and $/GB: today at 10 GB and 200$/GB Tape: TB and $/TB: today at .1TB and 25k$/TB (nearline) Access time (latency) RAM: 100 ns Disk: ms Tape: second pick, 30 second position Transfer rate RAM: GB/s Disk: MB/s Arrays can go to 1GB/s Tape: MB/s striping is problematic

New Storage Metrics: Kaps, Maps, SCAN?
Kaps: How many kilobyte objects served per second The file server, transaction processing metric This is the OLD metric. Maps: How many megabyte objects served per second The Multi-Media metric SCAN: How long to scan all the data the data mining and utility metric And Kaps/$, Maps/$, TBscan/$

For the Record (good 1997 devices)
X 14

How To Get Lots of Maps, SCANs
parallelism: use many little devices in parallel Beware of the media myth Beware of the access time myth At 10 MB/s: 1.2 days to scan 1,000 x parallel: 100 seconds SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel.

The Disk Farm On a Card The 100GB disc card An array of discs
Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth 14" Life is cheap, its the accessories that cost ya. Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card).

Tape Farms for Tertiary Storage Not Mainframe Silos
100 robots 1M$ 50TB 50$/GB 3K Maps 10K$ robot 14 tapes 27 hr Scan 500 GB 5 MB/s 20$/GB Scan in 27 hours. many independent tape robots (like a disc farm) 30 Maps

The Metrics: Disk and Tape Farms Win
Data Motel: Data checks in, but it never checks out GB/K$ 1 , 000 , 000 Kaps 100 , 000 Maps 10 , 000 SCANS/Day 1 , 000 100 10 1 0.1 0.01 1000 x D i sc Farm STC Tape Robot 100x DLT Tape Farm 6,000 tapes, 8 readers

Tape & Optical: Beware of the Media Myth
Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (2x cheaper than disc) Tape is cheap: 30 $/tape 20 GB/tape => 1.5 $/GB (100x cheaper than disc).

Tape & Optical Reality: Media is 10% of System Cost
Tape needs a robot (10 k$ m$ ) tapes (at 20GB each) => 20$/GB $/GB (1x…10x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB ( more expensive than mag disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!

The Access Time Myth The Myth: seek or pick time dominates
The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server shorter queues parallel transfer lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for tape arrays

for example, the EOS/DIS 15 PB system Conclusion 1 Think about Maps and SCAN & 5 minute rule Conclusion 2: Think about Clusters

Scaleable Computers BOTH SMP and Cluster
Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts SMP Super Server Departmental Cluster of PCs Server Personal System

What do TPC results say? Mainframes do not compete on performance or price They have great legacy code (MVS) PC nodes performance is 1/3 of high-end UNIX nodes 6xP6 vs 48xUltraSparc PC Technology is 3x cheaper than high-end UNIX Peak performance is a cluster Tandem 100 node cluster DEC Alpha 4x8 cluster Commodity solutions WILL come to this market

Cluster Advantages Clients and Servers made from the same stuff.
Inexpensive: Built with commodity components Fault tolerance: Spare modules mask failures Modular growth grow by adding small modules Parallel data search use multiple processors and disks

Clusters being built Teradata 500 nodes (50k$/slice)
Tandem,VMScluster 150 nodes (100k$/slice) Intel, 9,000 55M$ ( 6k$/slice) Teradata, Tandem, DEC moving to NT+low slice price IBM: 512 nodes ASCI @ 100m$ (200k$/slice) PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB servers KEY TECHNOLOGY HERE IS THE APPS. Apps distribute data Apps distribute execution

Clusters are winning the high end
Until recently a 4x8 cluster has best TPC-C performance Clusters have best data mining story (TPC-D) This year, a 32xUltraSparc cluster won the MinuteSort

Clusters (Plumbing) Single system image Fault Tolerance
naming protection/security management/load balance Fault Tolerance Hot Pluggable hardware & Software

So, What’s New? When slices cost 50k$, you buy 10 or 20.
Manageability, programmability, usability become key issues (total cost of ownership). PCs are MUCH easier to use and program MPP Vicious Cycle No Customers! New MPP & NewOS App Apps CP/Commodity Virtuous Cycle: Standards allow progress and investment protection Standard OS & Hardware Customers

Windows NT Server Clustering High Availability On Standard Hardware
Standard API for clusters on many platforms No special hardware required. Resource Group is unit of failover Typical resources: shared disk, printer, ... IP address, NetName Service (Web,SQL, File, Print Mail,MTS …) API to define resource groups, dependencies, resources, GUI administrative interface A consortium of 60 HW & SW vendors (everybody who is anybody) 2-Node Cluster in beta test now. Available 97H1 >2 node is next SQL Server and Oracle Demo on it today Key concepts System: a node Cluster: systems working together Resource: hard/ soft-ware module Resource dependency: resource needs another Resource group: fails over as a unit Dependencies: do not cross group boundaries The Wolfpack program has three goals: (1) To be the most reliable way to run Windows NT Server, (2) to be the most cost-effective high-availability platform, and (3) to be the easiest platform for developing cluster-aware solutions. Let’s look at each of those three in more detail. Wolfpack will be the most reliable way to run Windows NT Server. Out of the box, it will provide automatic recovery for file sharing, printer sharing, and Internet/Intranet services. It will be able to provide basic recovery services for virtually any existing server application without coding changes, and will feature an administrator’s console that makes it easy to take a server off-line for maintenance without disrupting your mission-critical business applications. The other server can deliver services while one is being changed. Wolfpack will run on standard servers from many vendors. It can use many interconnects ranging from standard Ethernet to specialized high-speed ones like Tandem ServerNet. It works with a wide range of disk drives and controllers including standard SCSI drives. This broad hardware support means flexibility, choice, and competitive pricing. Wolfpack clustering technology allows all nodes in the cluster to do useful work -- there’s no wasted “standby” server sitting idle waiting for a failure as there is with server mirroring solutions. And, of course, because it’s Windows software, it will have a familiar and easy to use graphical interface for the administrator. SQL Server will use Wolfpack’s Clustering API to provide high-availability via disk and IP address failover. SQL Server continues its close integration with NT and its unmatched ease-of-use. SQL Server 7.0 will provide a GUI configuration and management wizard to make it easy to configure high availability databases.

Where We Are Today Clusters moving fast Technology ahead of schedule
OLTP Sort WolfPack Technology ahead of schedule cpus, disks, tapes,wires,.. Databases are evolving Parallel DBMSs are evolving Operations (batch) has a long way to go on Unix/PC.

for example, the EOS/DIS 15 PB system Conclusion 1 Think about Maps and SCANs & 5 minute rule Conclusion 2: Think about Clusters Slides & paper: December SIGMOD RECORD

Building Peta-Byte Servers

Similar presentations

Presentation on theme: "Building Peta-Byte Servers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building Peta-Byte Servers

Similar presentations

Presentation on theme: "Building Peta-Byte Servers"— Presentation transcript:

Similar presentations

About project

Feedback