Presentation on theme: "1 Yotta Zetta Exa Peta Tera Giga Mega Kilo Data Centric Computing Jim Gray Microsoft Research Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA,"— Presentation transcript:
1 Yotta Zetta Exa Peta Tera Giga Mega Kilo Data Centric Computing Jim Gray Microsoft Research Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999
2 Put Everything in Future (Disk) Controllers (its not if, its when?) Jim Gray Microsoft Research FAST 2002 Monterey, CA, 14 Oct 1999 Acknowledgements : Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen Helped me sharpen these arguments
3 First Disk 1956 IBM 305 RAMAC 4 MB 50x24 disks 1200 rpm 100 ms access 35k$/y rent Included computer & accounting software (tubes not transistors)
4 10 years later 1.6 meters
5 Disk Evolution Capacity:100x in 10 years 1 TB 3.5 drive in GB as 1 micro-drive System on a chip High-speed SAN Disk replacing tape Disk is super computer! Kilo Mega Giga Tera Peta Exa Zetta Yotta
6 Disks are becoming computers Smart drives Camera with micro-drive Replay / Tivo / Ultimate TV Phone with micro-drive MP3 players Tablet Xbox Many more… Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio… Applications Web, DBMS, Files OS
7 Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks Storage Network Display ASIC Today: P=50 mips M= 2 MB In a few years P= 500 mips M= 256 MB Processing decentralized Moving to data sources Moving to power sources Moving to sheet metal ? The end of computers ?
8 Its Already True of Printers Peripheral = CyberBrick You buy a printer You get a –several network interfaces –A Postscript engine cpu, memory, software, a spooler (soon) –and… a print engine.
9 The (absurd?) consequences of Moores Law 256 way nUMA? Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memories Huge disks now: GB 3.5 disks then: TB disks Petabyte storage farms –(that you cant back up or restore). Disks >> tapes –Small disks: One platter one inch 10GB SAN convergence 1 GBps point to point is easy 1 GB RAM chips MAD at 200 Gbpsi Drives shrink one quantum 10 GBps SANs are ubiquitous 1 bips cpus for 10$ 10 bips cpus at high end
10 The Absurd Design? Further segregate processing from storage Poor locality Much useless data movement Amdahls laws: bus: 10 B/ips io: 1 b/ips Processors Disks ~ 1 Tips RAM ~ 1 TB ~ 100TB 100 GBps 10 TBps
11 Whats a Balanced System? (40+ disk arms / cpu) System Bus PCI Bus
12 Amdahls Balance Laws Revised Laws right, just need interpretation (imagination?) Balanced System Law: A system needs 8 MIPS/MBpsIO, but instruction rate must be measured on the workload. –Sequential workloads have low CPI (clocks per instruction), –random workloads tend to have higher CPI. Alpha (the MB/MIPS ratio) is rising from 1 to 6. This trend will likely continue. One Random IOs per 50k instructions. Sequential IOs are larger One sequential IO per 200k instructions
13 Observations re TPC C, H systems More than ½ the hardware cost is in disks Most of the mips are in the disk controllers 20 mips/arm is enough for tpcC 50 mips/arm is enough for tpcH Need 128MB to 256MB/arm Ref: –Gray& Shenoy: Rules of Thumb… –Keeton, Riedel, Uysal, PhD thesis. ? The end of computers ?
14 Disks / cpu TPC systems Normalize for CPI (clocks per instruction) –TPC-C has about 7 ins/byte of IO –TPC-H has 3 ins/byte of IO TPC-H needs ½ as many disks, sequential vs random Both use 9GB 10 krpm disks (need arms, not bytes) MHz/ cpu CPImips KB / IO IO/s / disk Disk s MB/s / cpu Ins/ IO Byte Amdahl TPC-C= random TPC-H= sequential
15 TPC systems: Whats alpha (=MB/MIPS ) ? Hard to say: –Intel 32 bit addressing (= 4GB limit). Known CPI. –IBM, HP, Sun have 64 GB limit. Unknown CPI. –Look at both, guess CPI for IBM, HP, Sun Alpha is between 1 and 6 MipsMemory Alpha Amdahl11 1 tpcC Intel8x262 = 2Gips4GB 2 tpcH Intel8x458 = 4Gips4GB 1 tpcC IBM24 cpus ?= 12 Gips64GB 6 tpcH HP32 cpus ?= 16 Gips32 GB 2
16 When each disk has 1bips, no need for cpu
17 Implications Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA, IP, TCP… SMP and Cluster parallelism is important. Terabyte/s Backplane Move app to NIC/device controller higher-higher level protocols: CORBA / COM+. Cluster parallelism is VERY important. Central Processor & Memory ConventionalRadical
18 Interim Step: Shared Logic Brick with 8-12 disk drives 200 mips/arm (or more) 2xGbpsEthernet General purpose OS (except NetApp ) 10k$/TB to 50k$/TB Shared –Sheet metal –Power –Support/Config –Security –Network ports Snap ~1TB 12x80GB NAS NetApp ~.5TB 8x70GB NAS Maxstor ~2TB 12x160GB NAS
19 Next step in the Evolution Disks become supercomputers –Controller will have 1bips, 1 GB ram, 1 GBps net –And a disk arm. Disks will run full-blown app/web/db/os stack Distributed computing Processors migrate to transducers.
20 Gordon Bells Seven Price Tiers 10$: wrist watch computers 100$:pocket/ palm computers 1,000$:portable computers 10,000$: personal computers (desktop) 100,000$: departmental computers (closet) 1,000,000$:site computers (glass house) 10,000,000$:regional computers (glass castle) Super-Server: Costs more than 100,000 $ Mainframe Costs more than 1M$ Must be an array of processors, disks, tapes comm ports
22 NAS vs SAN Network Attached Storage –File servers –Database servers –Application servers –(its a slippery slope: as Novell showed) Storage Area Network –A lower life form –Block server: get block / put block –Wrong abstraction level (too low level) –Security is VERY hard to understand. (who can read that disk block?) SCSI and iSCSI are popular. High level Interfaces are better
23 How Do They Talk to Each Other? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other –WebServices/SOAP? CORBA? COM+? RMI? –One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. SAN SIO streams datagrams RPC? Applications SIO streams datagrams RPC? Applications
24 Basic Argument for x-Disks Future disk controller is a super-computer. –1 bips processor –256 MB dram –1 TB disk plus one arm Connects to SAN via high-level protocols –RPC, HTTP, SOAP, COM+, Kerberos, Directory Services,…. –Commands are RPCs –management, security,…. –Services file/web/db/… requests –Managed by general-purpose OS with good dev environment Move apps to disk to save data movement –need programming environment in controller
25 The Slippery Slope If you add function to server Then you add more function to server Function gravitates to data. Nothing = Sector Server Everything = App Server Something = Fixed App Server
26 Why Not a Sector Server? (lets get physical!) Good idea, thats what we have today. But –cache added for performance –Sector remap added for fault tolerance –error reporting and diagnostics added –SCSI commends (reserve,.. are growing) –Sharing problematic (space mgmt, security,…) Slipping down the slope to a 2-D block server
27 Why Not a 1-D Block Server? Put A LITTLE on the Disk Server Tried and true design –HSC - VAX cluster –EMC –IBM Sysplex (3980?) But look inside –Has a cache –Has space management –Has error reporting & management –Has RAID 0, 1, 2, 3, 4, 5, 10, 50,… –Has locking –Has remote replication –Has an OS –Security is problematic –Low-level interface moves too many bytes
28 Why Not a 2-D Block Server? Put A LITTLE on the Disk Server Tried and true design –Cedar -> NFS –file server, cache, space,.. –Open file is many fewer msgs Grows to have –Directories + Naming –Authentication + access control –RAID 0, 1, 2, 3, 4, 5, 10, 50,… –Locking –Backup/restore/admin –Cooperative caching with client
29 Why Not a File Server? Put a Little on the 2-D Block Server Tried and true design –NetWare, Windows, Linux, NetApp, Cobalt, SNAP,... WebDav Yes, but look at NetWare –File interface grew –Became an app server Mail, DB, Web,…. –Netware had a primitive OS Hard to program, so optimized wrong thing
30 Why Not Everything? Allow Everything on Disk Server (thin clients) Tried and true design –Mainframes, Minis,... –Web servers,… –Encapsulates data –Minimizes data moves –Scaleable It is where everyone ends up. All the arguments against are short-term.
31 The Slippery Slope If you add function to server Then you add more function to server Function gravitates to data. Nothing = Sector Server Everything = App Server Something = Fixed App Server
32 Disk = Node has magnetic storage (1TB?) has processor & DRAM has SAN attachment has execution environment OS Kernel SAN driverDisk driver File SystemRPC,... ServicesDBMS Applications
33 Hardware Homogenous machines leads to quick response through reallocation HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives $4k/TB (street), 2.5processors/TB, 1GB RAM/TB 3 weeks from ordering to operational Slide courtesy of Brewster Archive.org
34 Disk as Tape Tape is unreliable, specialized, slow, low density, not improving fast, and expensive Using removable hard drives to replace tapes function has been successful When a tape is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Slide courtesy of Brewster Archive.org
35 Disk As Tape: What format? Today I send NTFS/SQL disks. But that is not a good format for Linux. Solution: Ship NFS/CIFS/ODBC servers (not disks) Plug disk into LAN. –DHCP then file or DB server via standard interface. –Web Service in long term
36 Some Questions Will the disk folks deliver? What is the product? How do I manage 1,000 nodes (disks)? How do I program 1,000 nodes (disks)? How does RAID work? How do I backup a PB? How do I restore a PB?
37 Will the disk folks deliver? Maybe! Hard Drive Unit Shipments Source: DiskTrend/IDC Not a pretty picture (lately)
38 Most Disks are Personal 85% of disks are desktop/mobile (not SCSI) Personal media is AT LEAST 50% of the problem. How to manage your shoebox of: –Documents –Voic –Photos –Music –Videos
39 What is the Product? (see next section on media management) Concept: Plug it in and it works! Music/Video/Photo appliance (home) Game appliance PC File server appliance Data archive/interchange appliance Web appliance appliance Application appliance Router appliance power network
40 Auto Manage Storage 1980 rule of thumb: –A DataAdmin per 10GB, SysAdmin per mips 2000 rule of thumb –A DataAdmin per 5TB –SysAdmin per 100 clones (varies with app). Problem: –5TB is 50k$ today, 5k$ in a few years. –Admin cost >> storage cost !!!! Challenge: –Automate ALL storage admin tasks
41 How do I manage 1,000 nodes? You cant manage 1,000 x (for any x). They manage themselves. –You manage exceptional exceptions. Auto Manage –Plug & Play hardware –Auto-load balance & placement storage & processing –Simple parallel programming model –Fault masking Some positive signs: –Few admins at Google 10k nodes 2 PB, Yahoo! ? nodes, 0.3 PB, Hotmail 10k nodes, 0.3 PB
42 How do I program 1,000 nodes? You cant program 1,000 x (for any x). They program themselves. –You write embarrassingly parallel programs –Examples: SQL, Web, Google, Inktomi, HotMail,…. –PVM and MPI prove it must be automatic (unless you have a PhD)! Auto Parallelism is ESSENTIAL
43 Plug & Play Software RPC is standardizing: (SOAP/HTTP, COM+, RMI/IIOP) –Gives huge TOOL LEVERAGE –Solves the hard problems : naming, security, directory service, operations,... Commoditized programming environments –FreeBSD, Linix, Solaris,…+ tools –NetWare + tools –WinCE, WinNT,…+ tools –JavaOS + tools Apps gravitate to data. General purpose OS on dedicated ctlr can run apps.
44 Its Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, –use other copy until failure repaired, –refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space)
45 Disk vs Tape Disk –160 GB –25 MBps – 5 ms seek time – 3 ms rotate latency – 2$/GB for drive 1$/GB for ctlrs/cabinet –4 TB/rack Tape –100 GB –10 MBps –30 sec pick time –Many minute seek time –5$/GB for media 10$/GB for drive+library –10 TB/rack The price advantage of tape is narrowing, and the performance advantage of disk is growing Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drives
46 Im a disk bigot I hate tape, tape hates me. Unreliable hardware Unreliable software Poor human factors Terrible latency, bandwidth Disk –Much easier to use –Much faster –Cheaper! –But needs new concepts
47 Disk as Tape Challenges Offline disk (safe from virus) Trivialize Backup/Restore software –Things never change –Just object versions Snapshot for continuous change (databases) RAID in a SAN –(cross-disk journaling) –Massive replication (a la Farsite)
48 Summary Disks will become supercomputers Compete in Linux appliance space Build best NAS software (compete with NetApp,..) Auto-manage huge storage farms FarSite, SQL autoAdmin++,… Build worlds best disk-based backup system Including Geoplex (compete with Veritas,..) Push faster on 64-bit
49 Storage capacity beating Moores law 2 k$/TB today (raw disk) 1k$/TB by end of 2002
50 Trends: Magnetic Storage Densities Amazing progress Ratios have changed Capacity grows 60%/y Access speed grows 10x more slowly
51 Trends: Density Limits The end is near! Products:23 Gbpsi Lab: 50 Gbpsi limit: 60 Gbpsi But limit keeps rising & there are alternatives Bit Density 3 2 3,000 2,000 1, b/µm 2 Gb/in CD DVD ODD Wavelength Limit SuperParmagnetic Limit ?: NEMS, Florescent? Holographic, DNA? Figure adapted from Franco Vitaliano, The NEW new media: the growing attraction of nonmagnetic storage, Data Storage, Feb 2000, pp 21-32, Density vs Time b/µm 2 & Gb/in 2
52 CyberBricks Disks are becoming supercomputers. Each disk will be a file server then SOAP server Multi-disk bricks are transitional Long-term brick will have OS per disk. Systems will be built from bricks. There will also be –Network Bricks –Display Bricks –Camera Bricks –….
53 Yotta Zetta Exa Peta Tera Giga Mega Kilo Data Centric Computing Jim Gray Microsoft Research Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999
54 Communications Excitement!! Point-to-PointBroadcast Immediate Time Shifted conversation money lecture concert mail book newspaper NetWork + DB DataBase Its ALL going electronic Information is being stored for analysis (so ALL database) Analysis & Automatic Processing are being added Slide borrowed from Craig Mundie
55 Information Excitement! But comm just carries information Real value added is –information capture & render speech, vision, graphics, animation, … –Information storage retrieval, –Information analysis
56 Information At Your Fingertips All information will be in an online database (somewhere) You might record everything you –read: 10MB/day, 400 GB/lifetime (5 disks today) –hear: 400MB/day, 16 TB/lifetime (2 disks/year today) –see: 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe someday) Data storage, organization, and analysis is challenge. text, speech, sound, vision, graphics, spatial, time… Information at Your Fingertips –Make it easy to capture –Make it easy to store & organize & analyze –Make it easy to present & access
57 How much information is there? Soon everything can be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: See Lyman & Varian: How much information Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All LoC books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
58 Why Put Everything in Cyberspace? Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Locate Process Analyze Summarize
59 Disk Storage Cheaper than Paper File Cabinet : cabinet (4 drawer)250$ paper (24,000 sheets)250$ space 10$/ft 2 )180$ total700$ 3 ¢/sheet Disk :disk (160 GB =) 300$ ASCII: 100 m pages ¢/sheet ( 10,000x cheaper ) Image: 1 m photos 0.03 ¢/sheet ( 100x cheaper ) Store everything on disk
60 Gordon Bells MainBrain Digitize Everything A BIG shoebox? Scans 20 k pages 300 dpi 1 GB Music: 2 k tacks 7 GB Photos: 13 k images2 GB Video: 10 hrs3 GB Docs:3 k (ppt, word,..)2 GB Mail: 50 k messages1 GB 16 GB
61 Gary Starkweather Scan EVERYTHING 400 dpi TIFF 70k pages ~ 14GB OCR all scans (98% recognition ocr accuracy) All indexed (5 second access to anything) All on his laptop.
62 Q: What happens when the personal terabyte arrives? A: Things will run SLOWLY…. unless we add good software
63 Summary Disks will morph to appliances Main barriers to this happening –Lack of Cool Apps –Cost of Information management
64 The Absurd Disk 2.5 hr scan time (poor sequential access) 1 aps / 5 GB (VERY cold data) Its a tape! 1 TB 100 MB/s 200 Kaps
65 Crazy Disk Ideas Disk Farm on a card: surface mount disks Disk (magnetic store) on a chip: (micro machines in Silicon) Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller (a processor with 128 MB dram) ASIC The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail Clayton M. Christensen.ISBN:
66 The Disk Farm On a Card The 500GB disc card An array of discs Can be used as 100 discs 1 striped disc 50 Fault Tolerant discs....etc LOTS of accesses/second bandwidth 14"
67 Trends: promises NEMS (Nano Electro Mechanical Systems) (http://www.nanochip.com/) also Cornell, IBM, CMU,…http://www.nanochip.com/ 250 Gbpsi by using tunneling electronic microscope Disk replacement Capacity:180 GB now, 1.4 TB in 2 years Transfer rate: 100 MB/sec R&W Latency: 0.5msec Power: 23W active,.05W Standby 10k$/TB now, 2k$/TB in 2004
68 Trends: Gilders Law: 3x bandwidth/year for 25 more years Today: –40 Gbps per channel (λ) –12 channels per fiber (wdm): 500 Gbps –32 fibers/bundle = 16 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps
69 Technology Drivers: What if Networking Was as Cheap As Disk IO? TCP/IP –Unix/NT 100% 40MBps Disk –Unix/NT 8% 40MBps Why the Difference? Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA Host does TCP/IP packetizing, checksum,… flow control small buffers
70 Gbps Ethernet: 110 MBps SAN: Standard Interconnect PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps LAN faster than memory bus? 1 GBps links in lab. 100$ port cost soon Port is computer RIP FDDI RIP ATM RIP SCI RIP SCSI RIP FC RIP ?
71 Building a Petabyte Store EMC ~ 500k$/TB = 500M$/PB plus FC switches plus…800M$/PB TPC-C SANs (Dell 18GB/…) 62 M$/PB Dell local SCSI, 3ware 20M$/PB Do it yourself: 5M$/PB
72 The Cost of Storage (heading for 1K$/TB soon) 12/1/1999 9/1/2000 9/1/2001
73 Cheap Storage or Balanced System Low cost storage (2 x 1.5k$ servers) 6K$ TB 2x (1K$ system + 8x80GB disks + 100MbEthernet) Balanced server (7k$/.5 TB) –2x800Mhz (2k$) –256 MB (400$) –8 x 80 GB drives (2K$) –Gbps Ethernet + switch (1k$) –11k$ TB, 22K$/RAIDED TB 2x800 Mhz 256 MB
GB, 2k$ (now) 4x80 GB IDE (2 hot plugable) –(1,000$) SCSI-IDE bridge –200k$ Box –500 Mhz cpu –256 MB SRAM –Fan, power, Enet –700$ Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)
76 Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 160 GB in 3 hours) 160 GB/overnite = ~N x $/nite
77 Data delivery costs 1$/GB today Rent for big customers: 300$/megabit per second per month Improved 3x in last 6 years (!). That translates to 1$/GB at each end. You can mail a 160 GB disk for 20$. –Thats 16x cheaper –If overnight its 3 MBps. 3x160 GB ~ ½ TB
78 Data on Disk Can Move to RAM in 8 years 30:1 6 years
79 Storage Latency: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk Tape /Optical Robot Springfield This Campus This Room My Head 10 min 1.5 hr 2 Years 1 min Pluto 2,000 Years Andromeda
80 More Kaps and Kaps/$ but…. Disk accesses got much less expensive Better disks Cheaper disks! But: disk arms are expensive the scarce resource 1 hour Scan vs 5 minutes in GB 30 MB/s
81 Backup: 3 scenarios Disaster Recovery: Preservation through Replication Hardware Faults: different solutions for different situations –Clusters, –load balancing, –replication, –tolerate machine/disk outages –(Avoided RAID and expensive, low volume solutions) Programmer Error: versioned duplicates (no deletes)
82 Online Data Can build 1PB of NAS disk for 5M$ today Can SCAN ( read or write ) entire PB in 3 hours. Operate it as a data pump: continuous sequential scan Can deliver 1PB for 1M$ over Internet –Access charge is 300$/Mbps bulk rate Need to Geoplex data (store it in two places). Need to filter/process data near the source, –To minimize network costs.