Presentation on theme: "Data Centric Computing"— Presentation transcript:
1 Data Centric Computing YottaZettaExaPetaTeraGigaMegaKiloJim GrayMicrosoft ResearchResearch.Microsoft.com/~Gray/talksFAST 2002Monterey, CA, 14 Oct 1999
2 Sub-TitlePut Everything in Future (Disk) Controllers (it’s not “if”, it’s “when?”) Jim Gray Microsoft Research FAST Monterey, CA, 14 Oct Acknowledgements: Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van IngenBARC started in 1995 with Jim Gray and Gordon Bell. We are part of Microsoft Research with a focus on Scaleable Servers (Gray, Barrera, Barclay, Slutz, VanIngen) and Telepresence (Bell, Gemmell).In 1996 we grew to a staff of 6 and moved to our current location in downtown San Francisco (at the east end of Silicon Gulch).We have close ties to the SQL, MTS, NT, PowerPoint, and NetMeeting groups. We also collaborate with UC Berkeley, Cornell, and Wisconsin on Scaleable computing, with UC Berkeley and U. Virginia on Telepresence.Each summer we host two interns.Our web site isBARC is located at 301 Howard St, #830, San Francisco CA.94105Humor: our next door neighbor is the Justice Department (Environmental Division). So the sign in the lobby reads:Microsoft 830 <= Justice Department 870 =>Helped me sharpenthese arguments
3 First Disk 1956 IBM 305 RAMAC 4 MB 50x24” disks 1200 rpm 100 ms access 35k$/y rentIncluded computer & accounting software (tubes not transistors)
5 Disk EvolutionKiloMegaGigaTeraPetaExaZettaYottaCapacity:100x in 10 years 1 TB 3.5” drive in GB as 1” micro-driveSystem on a chipHigh-speed SANDisk replacing tapeDisk is super computer!
6 Disks are becoming computers Smart drivesCamera with micro-driveReplay / Tivo / Ultimate TVPhone with micro-driveMP3 playersTabletXboxMany more…Applications Web, DBMS, FilesOSDisk Ctlr + 1Ghz cpu+1GB RAMComm:Infiniband, Ethernet, radio…
7 Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks Processing decentralizedMoving to data sourcesMoving to power sourcesMoving to sheet metal? The end of computers ?ASICToday:P=50 mipsM= 2 MBIn a few yearsP= 500 mipsM= 256 MBStorageNetworkDisplay
8 It’s Already True of Printers Peripheral = CyberBrick You buy a printerYou get aseveral network interfacesA Postscript enginecpu,memory,software,a spooler (soon)and… a print engine.
9 The (absurd?) consequences of Moore’s Law 256 way nUMA?Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memoriesHuge disks now: GB 3.5” disks then: TB disksPetabyte storage farms(that you can’t back up or restore).Disks >> tapes“Small” disks: One platter one inch 10GBSAN convergence 1 GBps point to point is easy1 GB RAM chipsMAD at 200 GbpsiDrives shrink one quantum10 GBps SANs are ubiquitous1 bips cpus for 10$10 bips cpus at high end
10 The Absurd Design? Further segregate processing from storage Poor localityMuch useless data movementAmdahl’s laws: bus: 10 B/ips io: 1 b/ipsDisksRAM~ 1 TBProcessors100 GBps10 TBps~ 1 Tips~ 100TB
11 What’s a Balanced System? (40+ disk arms / cpu) System BusPCI Bus
12 Amdahl’s Balance Laws Revised Laws right, just need “interpretation” (imagination?)Balanced System Law: A system needs 8 MIPS/MBpsIO, but instruction rate must be measured on the workload.Sequential workloads have low CPI (clocks per instruction),random workloads tend to have higher CPI.Alpha (the MB/MIPS ratio) is rising from 1 to 6. This trend will likely continue.One Random IO’s per 50k instructions.Sequential IOs are larger One sequential IO per 200k instructions
13 Observations re TPC C, H systems More than ½ the hardware cost is in disksMost of the mips are in the disk controllers20 mips/arm is enough for tpcC50 mips/arm is enough for tpcHNeed 128MB to 256MB/armRef:Gray& Shenoy: “Rules of Thumb…”Keeton, Riedel, Uysal, PhD thesis.? The end of computers ?
14 8 7 22 3 50 TPC systems Normalize for CPI (clocks per instruction) TPC-C has about 7 ins/byte of IOTPC-H has 3 ins/byte of IOTPC-H needs ½ as many disks, sequential vs randomBoth use 9GB 10 krpm disks (need arms, not bytes)MHz/cpuCPImipsKB/IOIO/s/ diskDisksDisks/ cpuMB/s/ cpuIns/IO ByteAmdahl11168TPC-C=random5502.1262810039750407TPC-H= sequential5501.245864100176221413
15 TPC systems: What’s alpha (=MB/MIPS)? Hard to say:Intel 32 bit addressing (= 4GB limit). Known CPI.IBM, HP, Sun have 64 GB limit. Unknown CPI.Look at both, guess CPI for IBM, HP, SunAlpha is between 1 and 6MipsMemoryAlphaAmdahl1tpcC Intel8x262 = 2Gips4GB2tpcH Intel8x458 = 4GipstpcC IBM24 cpus ?= 12 Gips64GB6tpcH HP32 cpus ?= 16 Gips32 GB
17 Implications Conventional Radical Move app to NIC/device controller higher-higher level protocols: CORBA / COM+.Cluster parallelism is VERY important.Offload device handling to NIC/HBAhigher level protocols: I2O, NASD, VIA, IP, TCP…SMP and Cluster parallelism is important.CentralProcessor & MemoryTerabyte/sBackplane
18 Interim Step: Shared Logic Brick with 8-12 disk drives200 mips/arm (or more)2xGbpsEthernetGeneral purpose OS (except NetApp )10k$/TB to 50k$/TBSharedSheet metalPowerSupport/ConfigSecurityNetwork portsSnap™~1TB 12x80GB NASNetApp™~.5TB8x70GB NASMaxstor™~2TB 12x160GB NAS
19 Next step in the Evolution Disks become supercomputersController will have 1bips, 1 GB ram, 1 GBps netAnd a disk arm.Disks will run full-blown app/web/db/os stackDistributed computingProcessors migrate to transducers.
20 Gordon Bell’s Seven Price Tiers 10$: wrist watch computers100$: pocket/ palm computers1,000$: portable computers10,000$: personal computers (desktop)100,000$: departmental computers (closet)1,000,000$: site computers (glass house)10,000,000$: regional computers (glass castle)Super-Server: Costs more than 100,000 $“Mainframe” Costs more than 1M$Must be an array of processors,disks, tapescomm ports
22 NAS vs SAN Network Attached Storage Storage Area Network File servers High levelInterfaces are betterNetwork Attached StorageFile serversDatabase serversApplication servers(it’s a slippery slope: as Novell showed)Storage Area NetworkA lower life formBlock server: get block / put blockWrong abstraction level (too low level)Security is VERY hard to understand.(who can read that disk block?)SCSI and iSCSI are popular.
23 How Do They Talk to Each Other? Each node has an OSEach node has local resources: A federation.Each node does not completely trust the others.Nodes use RPC to talk to each otherWebServices/SOAP? CORBA? COM+? RMI?One or all of the above.Huge leverage in high-level interfaces.Same old distributed system story.ApplicationsApplicationsdatagramsstreamsRPC??RPCstreamsdatagramsSIOSIOSAN
24 Basic Argument for x-Disks Future disk controller is a super-computer.1 bips processor256 MB dram1 TB disk plus one armConnects to SAN via high-level protocolsRPC, HTTP, SOAP, COM+, Kerberos, Directory Services,….Commands are RPCsmanagement, security,….Services file/web/db/… requestsManaged by general-purpose OS with good dev environmentMove apps to disk to save data movementneed programming environment in “controller”
25 The Slippery Slope If you add function to server Nothing =Sector ServerIf you add function to serverThen you add more function to serverFunction gravitates to data.Fixed App ServerSomething =Everything =App Server
26 Why Not a Sector Server? (let’s get physical!) Good idea, that’s what we have today.Butcache added for performanceSector remap added for fault toleranceerror reporting and diagnostics addedSCSI commends (reserve,.. are growing)Sharing problematic (space mgmt, security,…)Slipping down the slope to a 2-D block server
27 Why Not a 1-D Block Server? Put A LITTLE on the Disk Server Tried and true designHSC - VAX clusterEMCIBM Sysplex (3980?)But look insideHas a cacheHas space managementHas error reporting & managementHas RAID 0, 1, 2, 3, 4, 5, 10, 50,…Has lockingHas remote replicationHas an OSSecurity is problematicLow-level interface moves too many bytes
28 Why Not a 2-D Block Server? Put A LITTLE on the Disk Server Tried and true designCedar -> NFSfile server, cache, space,..Open file is many fewer msgsGrows to haveDirectories + NamingAuthentication + access controlRAID 0, 1, 2, 3, 4, 5, 10, 50,…LockingBackup/restore/adminCooperative caching with client
29 Why Not a File Server? Put a Little on the 2-D Block Server Tried and true designNetWare, Windows, Linux, NetApp, Cobalt, SNAP,... WebDavYes, but look at NetWareFile interface grewBecame an app serverMail, DB, Web,….Netware had a primitive OSHard to program, so optimized wrong thing
30 Why Not Everything? Allow Everything on Disk Server (thin client’s) Tried and true designMainframes, Minis, ...Web servers,…Encapsulates dataMinimizes data movesScaleableIt is where everyone ends up.All the arguments against are short-term.
31 The Slippery Slope If you add function to server Nothing =Sector ServerIf you add function to serverThen you add more function to serverFunction gravitates to data.Fixed App ServerSomething =Everything =App Server
32 Disk = Node has magnetic storage (1TB?) has processor & DRAM has SAN attachmenthas execution environmentApplicationsServicesDBMSRPC, ...File SystemSAN driverDisk driverOS Kernel
33 HardwareHomogenous machines leads to quick response through reallocationHP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives$4k/TB (street), 2.5processors/TB, 1GB RAM/TB3 weeks from ordering to operationalSlide courtesy of Brewster Archive.org
34 Disk as TapeTape is unreliable, specialized, slow, low density, not improving fast, and expensiveUsing removable hard drives to replace tape’s function has been successfulWhen a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.Slide courtesy of Brewster Archive.org
35 Disk As Tape: What format? Today I send NTFS/SQL disks.But that is not a good format for Linux.Solution: Ship NFS/CIFS/ODBC servers (not disks)Plug “disk” into LAN.DHCP then file or DB server via standard interface.Web Service in long term
36 Some Questions Will the disk folks deliver? What is the product? How do I manage 1,000 nodes (disks)?How do I program 1,000 nodes (disks)?How does RAID work?How do I backup a PB?How do I restore a PB?
37 Will the disk folks deliver? Maybe! Hard Drive Unit Shipments Source: DiskTrend/IDCNot a pretty picture (lately)
38 Most Disks are Personal 85% of disks are desktop/mobile (not SCSI)Personal media is AT LEAST 50% of the problem.How to manage your shoebox of:DocumentsVoicPhotosMusicVideos
39 What is the Product? (see next section on media management) Concept: Plug it in and it works!Music/Video/Photo appliance (home)Game appliance“PC”File server applianceData archive/interchange applianceWeb applianceapplianceApplication applianceRouter appliancenetworkpower
40 Auto Manage Storage Admin cost >> storage cost !!!! 1980 rule of thumb:A DataAdmin per 10GB, SysAdmin per mips2000 rule of thumbA DataAdmin per 5TBSysAdmin per 100 clones (varies with app).Problem:5TB is 50k$ today, 5k$ in a few years.Admin cost >> storage cost !!!!Challenge:Automate ALL storage admin tasks
41 How do I manage 1,000 nodes? You can’t manage 1,000 x (for any x). They manage themselves.You manage exceptional exceptions.Auto ManagePlug & Play hardwareAuto-load balance & placement storage & processingSimple parallel programming modelFault maskingSome positive signs:Few admins at Google 10k nodes 2 PB , Yahoo! ? nodes, 0.3 PB, Hotmail 10k nodes, 0.3 PB
42 How do I program 1,000 nodes? You can’t program 1,000 x (for any x). They program themselves.You write embarrassingly parallel programsExamples: SQL, Web, Google, Inktomi, HotMail,….PVM and MPI prove it must be automatic (unless you have a PhD)!Auto Parallelism is ESSENTIAL
43 Plug & Play Software RPC is standardizing: (SOAP/HTTP, COM+, RMI/IIOP) Gives huge TOOL LEVERAGESolves the hard problems :naming,security,directory service,operations,...Commoditized programming environmentsFreeBSD, Linix, Solaris,…+ toolsNetWare + toolsWinCE, WinNT,…+ toolsJavaOS + toolsApps gravitate to data.General purpose OS on dedicated ctlr can run apps.
44 It’s Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days!Store it in two (or more) places online (on disk?). A geo-plexScrub it continuously (look for errors)On failure,use other copy until failure repaired,refresh lost copy from safe copy.Can organize the two copies differently (e.g.: one by time, one by space)
45 Disk vs Tape Disk Tape 160 GB 25 MBps 5 ms seek time 3 ms rotate latency2$/GB for drive 1$/GB for ctlrs/cabinet4 TB/rackTape100 GB10 MBps30 sec pick timeMany minute seek time5$/GB for media 10$/GB for drive+library10 TB/rackGuestimatesCern: 200 TB3480 tapes2 col = 50GBRack = 1 TB=20 drivesThe price advantage of tape is narrowing, andthe performance advantage of disk is growing
46 I’m a disk bigot I hate tape, tape hates me. Disk Much easier to use Unreliable hardwareUnreliable softwarePoor human factorsTerrible latency, bandwidthDiskMuch easier to useMuch fasterCheaper!But needs new concepts
47 Disk as Tape Challenges Offline disk (safe from virus)Trivialize Backup/Restore softwareThings never changeJust object versionsSnapshot for continuous change (databases)RAID in a SAN(cross-disk journaling)Massive replication (a la Farsite)
48 Summary Disks will become supercomputers Compete in Linux appliance spaceBuild best NAS software (compete with NetApp, ..)Auto-manage huge storage farms FarSite, SQL autoAdmin++,…Build world’s best disk-based backup system Including Geoplex (compete with Veritas,..)Push faster on 64-bit
49 Storage capacity beating Moore’s law 2 k$/TB today (raw disk)1k$/TB by end of 2002
50 Trends: Magnetic Storage Densities Amazing progressRatios have changedCapacity grows 60%/yAccess speed grows 10x more slowly
51 Trends: Density Limits Density vs Timeb/µm2 & Gb/in2Bit DensityThe end is near!Products:23 Gbpsi Lab: Gbpsi “limit”: GbpsiBut limit keeps rising & there are alternativesb/µm2 Gb/in2?: NEMS, Florescent? Holographic, DNA?3,000 2,0001,SuperParmagnetic LimitWavelength LimitODDDVDCDFigure adapted from Franco Vitaliano,“The NEW new media: the growing attractionof nonmagnetic storage”,Data Storage, Feb 2000, pp 21-32,
52 CyberBricks Disks are becoming supercomputers. Each disk will be a file server then SOAP serverMulti-disk bricks are transitionalLong-term brick will have OS per disk.Systems will be built from bricks.There will also beNetwork BricksDisplay BricksCamera Bricks….
53 Data Centric Computing YottaZettaExaPetaTeraGigaMegaKiloJim GrayMicrosoft ResearchResearch.Microsoft.com/~Gray/talksFAST 2002Monterey, CA, 14 Oct 1999
54 Communications Excitement!! Point-to-PointBroadcastlectureconcertconversationmoneyNetWork+ DBImmediateTimeShiftedmailbooknewspaperDataBaseIts ALL going electronicInformation is being stored for analysis (so ALL database)Analysis & Automatic Processing are being addedSlide borrowed from Craig Mundie
55 Information Excitement! But comm just carries informationReal value added isinformation capture & render speech, vision, graphics, animation, …Information storage retrieval,Information analysis
56 Information At Your Fingertips All information will be in an online database (somewhere)You might record everything youread: 10MB/day, 400 GB/lifetime (5 disks today)hear: 400MB/day, 16 TB/lifetime (2 disks/year today)see: 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe someday)Data storage, organization, and analysis is challenge.text, speech, sound, vision, graphics, spatial, time…Information at Your FingertipsMake it easy to captureMake it easy to store & organize & analyzeMake it easy to present & access
57 How much information is there? YottaZettaExaPetaTeraGigaMegaKiloSoon everything can be recorded and indexedMost bytes will never be seen by humans.Data summarization, trend detection anomaly detection are key technologiesSee Mike Lesk: How much information is there:See Lyman & Varian:How much informationEverything!RecordedAll Books MultiMediaAll LoC books(words).MovieA PhotoA Book24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
58 Why Put Everything in Cyberspace? Low rentmin $/byteShrinks timenow or laterShrinks spacehere or thereAutomate processingknowbotsPoint-to-PointORBroadcastImmediate OR Time DelayedLocateProcessAnalyzeSummarize
59 Disk Storage Cheaper than Paper File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space 10$/ft2) 180$ total 700$ ¢/sheetDisk: disk (160 GB =) $ ASCII: 100 m pages ¢/sheet (10,000x cheaper)Image: 1 m photos ¢/sheet (100x cheaper)Store everything on disk
60 Gordon Bell’s MainBrain™ Digitize Everything A BIG shoebox? Scans k “pages” 300 dpi 1 GBMusic: 2 k “tacks” 7 GBPhotos: 13 k images 2 GBVideo: 10 hrs 3 GBDocs: 3 k (ppt, word,..) 2 GBMail: k messages 1 GB16 GB
61 Gary Starkweather Scan EVERYTHING 400 dpi TIFF 70k “pages” ~ 14GB OCR all scans (98% recognition ocr accuracy)All indexed (5 second access to anything)All on his laptop.
62 A: Things will run SLOWLY…. unless we add good software Q: What happens when the personal terabyte arrives?A: Things will run SLOWLY…. unless we add good software
63 Summary Disks will morph to appliances Main barriers to this happening Lack of Cool AppsCost of Information management
64 1 TB The “Absurd” Disk 2.5 hr scan time (poor sequential access) 1 aps / 5 GB (VERY cold data)It’s a tape!1 TB100 MB/s200 Kaps
65 Crazy Disk Ideas Disk Farm on a card: surface mount disks Disk (magnetic store) on a chip: (micro machines in Silicon)Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller (a processor with 128 MB dram)ASICThe Innovator's Dilemma: When New Technologies Cause Great Firms to Fail Clayton M. Christensen .ISBN:
66 The Disk Farm On a Card The 500GB disc card An array of discs Can be used as100 discs1 striped disc50 Fault Tolerant discs....etcLOTS of accesses/secondbandwidth14"
67 Trends: promises NEMS (Nano Electro Mechanical Systems) (http://www Trends: promises NEMS (Nano Electro Mechanical Systems) (http://www.nanochip.com/) also Cornell, IBM, CMU,…250 Gbpsi by using tunneling electronic microscopeDisk replacementCapacity: 180 GB now, TB in 2 yearsTransfer rate: 100 MB/sec R&WLatency: 0.5msecPower: 23W active, .05W Standby10k$/TB now, 2k$/TB in 2004
68 Trends: Gilder’s Law: 3x bandwidth/year for 25 more years Today:40 Gbps per channel (λ)12 channels per fiber (wdm): 500 Gbps32 fibers/bundle = 16 Tbps/bundleIn lab 3 Tbps/fiber (400 x WDM)In theory 25 Tbps per fiber1 Tbps = USA 1996 WAN bisection bandwidthAggregate bandwidth doubles every 8 months!1 fiber = 25 Tbps
69 Technology Drivers: What if Networking Was as Cheap As Disk IO? TCP/IPUnix/NT 100% 40MBpsDiskUnix/NT 8% 40MBpsWhy the Difference?Host Bus Adapter doesSCSI packetizing,checksum,…flow controlDMAHost doesTCP/IP packetizing,small buffers
70 SAN: Standard Interconnect RIPFDDISAN: Standard InterconnectRIPATMGbps Ethernet: 110 MBpsLAN faster than memory bus?1 GBps links in lab.100$ port cost soonPort is computerRIPSCIPCI: 70 MBpsRIPSCSIUW Scsi: 40 MBpsFW scsi: 20 MBpsRIPFCscsi: 5 MBpsRIP?
71 Building a Petabyte Store EMC ~ 500k$/TB = 500M$/PB plus FC switches plus… 800M$/PBTPC-C SANs (Dell 18GB/…) M$/PBDell local SCSI, 3ware M$/PBDo it yourself: M$/PB
72 The Cost of Storage (heading for 1K$/TB soon) 12/1/19999/1/20009/1/2001
73 Cheap Storage or Balanced System Low cost storage (2 x 1.5k$ servers) 6K$ TB 2x (1K$ system + 8x80GB disks + 100MbEthernet)Balanced server (7k$/.5 TB)2x800Mhz (2k$)256 MB (400$)8 x 80 GB drives (2K$)Gbps Ethernet + switch (1k$)11k$ TB, 22K$/RAIDED TB2x800 Mhz256 MB
74 320 GB, 2k$ (now) Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID) 4x80 GB IDE (2 hot plugable)(1,000$)SCSI-IDE bridge200k$Box500 Mhz cpu256 MB SRAMFan, power, Enet700$Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)
76 Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 160 GB in 3 hours)160 GB/overnite= ~N x 4 MB/second@ 19.95$/nite
77 Data delivery costs 1$/GB today Rent for “big” customers: 300$/megabit per second per monthImproved 3x in last 6 years (!).That translates to 1$/GB at each end.You can mail a 160 GB disk for 20$.That’s 16x cheaperIf overnight it’s 3 MBps.3x160 GB~ ½ TB
78 Data on Disk Can Move to RAM in 8 years 30:16 years
79 Storage Latency: How Far Away is the Data? Andromeda910Tape /Optical2,000 YearsRobot6Pluto10Disk2 YearsSpringfield1.5 hr100MemoryThis Campus10On Board Cache10 min2On Chip CacheThis Room1RegistersMy Head1 min
80 More Kaps and Kaps/$ but…. Disk accesses got much less expensive Better disks Cheaper disks!But: disk arms are expensive the scarce resource1 hour Scan vs 5 minutes in 1990100 GB30 MB/s
81 Backup: 3 scenariosDisaster Recovery: Preservation through ReplicationHardware Faults: different solutions for different situationsClusters,load balancing,replication,tolerate machine/disk outages(Avoided RAID and expensive, low volume solutions)Programmer Error: versioned duplicates (no deletes)
82 Online Data Can build 1PB of NAS disk for 5M$ today Can SCAN (read or write) entire PB in 3 hours.Operate it as a data pump: continuous sequential scanCan deliver 1PB for 1M$ over InternetAccess charge is 300$/Mbps bulk rateNeed to Geoplex data (store it in two places).Need to filter/process data near the source,To minimize network costs.