Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Centric Computing

Similar presentations


Presentation on theme: "Data Centric Computing"— Presentation transcript:

1 Data Centric Computing
Yotta Zetta Exa Peta Tera Giga Mega Kilo Jim Gray Microsoft Research Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999

2 Sub-Title Put Everything in Future (Disk) Controllers (it’s not “if”, it’s “when?”) Jim Gray Microsoft Research FAST Monterey, CA, 14 Oct Acknowledgements: Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen BARC started in 1995 with Jim Gray and Gordon Bell. We are part of Microsoft Research with a focus on Scaleable Servers (Gray, Barrera, Barclay, Slutz, VanIngen) and Telepresence (Bell, Gemmell). In 1996 we grew to a staff of 6 and moved to our current location in downtown San Francisco (at the east end of Silicon Gulch). We have close ties to the SQL, MTS, NT, PowerPoint, and NetMeeting groups. We also collaborate with UC Berkeley, Cornell, and Wisconsin on Scaleable computing, with UC Berkeley and U. Virginia on Telepresence. Each summer we host two interns. Our web site is BARC is located at 301 Howard St, #830, San Francisco CA.94105 Humor: our next door neighbor is the Justice Department (Environmental Division). So the sign in the lobby reads: Microsoft 830 <= Justice Department 870 => Helped me sharpen these arguments

3 First Disk 1956 IBM 305 RAMAC 4 MB 50x24” disks 1200 rpm 100 ms access
35k$/y rent Included computer & accounting software (tubes not transistors)

4 10 years later 1.6 meters

5 Disk Evolution Kilo Mega Giga Tera Peta Exa Zetta Yotta Capacity:100x in 10 years 1 TB 3.5” drive in GB as 1” micro-drive System on a chip High-speed SAN Disk replacing tape Disk is super computer!

6 Disks are becoming computers
Smart drives Camera with micro-drive Replay / Tivo / Ultimate TV Phone with micro-drive MP3 players Tablet Xbox Many more… Applications Web, DBMS, Files OS Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio…

7 Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks
Processing decentralized Moving to data sources Moving to power sources Moving to sheet metal ? The end of computers ? ASIC Today: P=50 mips M= 2 MB In a few years P= 500 mips M= 256 MB Storage Network Display

8 It’s Already True of Printers Peripheral = CyberBrick
You buy a printer You get a several network interfaces A Postscript engine cpu, memory, software, a spooler (soon) and… a print engine.

9 The (absurd?) consequences of Moore’s Law
256 way nUMA? Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memories Huge disks now: GB 3.5” disks then: TB disks Petabyte storage farms (that you can’t back up or restore). Disks >> tapes “Small” disks: One platter one inch 10GB SAN convergence 1 GBps point to point is easy 1 GB RAM chips MAD at 200 Gbpsi Drives shrink one quantum 10 GBps SANs are ubiquitous 1 bips cpus for 10$ 10 bips cpus at high end

10 The Absurd Design? Further segregate processing from storage
Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips Disks RAM ~ 1 TB Processors 100 GBps 10 TBps ~ 1 Tips ~ 100TB

11 What’s a Balanced System? (40+ disk arms / cpu)
System Bus PCI Bus

12 Amdahl’s Balance Laws Revised
Laws right, just need “interpretation” (imagination?) Balanced System Law: A system needs 8 MIPS/MBpsIO, but instruction rate must be measured on the workload. Sequential workloads have low CPI (clocks per instruction), random workloads tend to have higher CPI. Alpha (the MB/MIPS ratio) is rising from 1 to 6. This trend will likely continue. One Random IO’s per 50k instructions. Sequential IOs are larger One sequential IO per 200k instructions

13 Observations re TPC C, H systems
More than ½ the hardware cost is in disks Most of the mips are in the disk controllers 20 mips/arm is enough for tpcC 50 mips/arm is enough for tpcH Need 128MB to 256MB/arm Ref: Gray& Shenoy: “Rules of Thumb…” Keeton, Riedel, Uysal, PhD thesis. ? The end of computers ?

14 8 7 22 3 50 TPC systems Normalize for CPI (clocks per instruction)
TPC-C has about 7 ins/byte of IO TPC-H has 3 ins/byte of IO TPC-H needs ½ as many disks, sequential vs random Both use 9GB 10 krpm disks (need arms, not bytes) MHz/ cpu CPI mips KB/ IO IO/s/ disk Disks Disks/ cpu MB/s/ cpu Ins/ IO Byte Amdahl 1 1 1 6 8 TPC-C= random 550 2.1 262 8 100 397 50 40 7 TPC-H= sequential 550 1.2 458 64 100 176 22 141 3

15 TPC systems: What’s alpha (=MB/MIPS)?
Hard to say: Intel 32 bit addressing (= 4GB limit). Known CPI. IBM, HP, Sun have 64 GB limit. Unknown CPI. Look at both, guess CPI for IBM, HP, Sun Alpha is between 1 and 6 Mips Memory Alpha Amdahl 1 tpcC Intel 8x262 = 2Gips 4GB 2 tpcH Intel 8x458 = 4Gips tpcC IBM 24 cpus ?= 12 Gips 64GB 6 tpcH HP 32 cpus ?= 16 Gips 32 GB

16 When each disk has 1bips, no need for ‘cpu’

17 Implications Conventional Radical Move app to NIC/device controller
higher-higher level protocols: CORBA / COM+. Cluster parallelism is VERY important. Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA, IP, TCP… SMP and Cluster parallelism is important. Central Processor & Memory Terabyte/s Backplane

18 Interim Step: Shared Logic
Brick with 8-12 disk drives 200 mips/arm (or more) 2xGbpsEthernet General purpose OS (except NetApp ) 10k$/TB to 50k$/TB Shared Sheet metal Power Support/Config Security Network ports Snap™ ~1TB 12x80GB NAS NetApp™ ~.5TB 8x70GB NAS Maxstor™ ~2TB 12x160GB NAS

19 Next step in the Evolution
Disks become supercomputers Controller will have 1bips, 1 GB ram, 1 GBps net And a disk arm. Disks will run full-blown app/web/db/os stack Distributed computing Processors migrate to transducers.

20 Gordon Bell’s Seven Price Tiers
10$: wrist watch computers 100$: pocket/ palm computers 1,000$: portable computers 10,000$: personal computers (desktop) 100,000$: departmental computers (closet) 1,000,000$: site computers (glass house) 10,000,000$: regional computers (glass castle) Super-Server: Costs more than 100,000 $ “Mainframe” Costs more than 1M$ Must be an array of processors, disks, tapes comm ports

21 Bell’s Evolution of Computer Classes
Technology enable two evolutionary paths: 1. constant performance, decreasing cost 2. constant price, increasing performance ?? Time Mainframes (central) Minis (dep’t.) PCs (personals) Log Price WSs 1.26 = 2x/3 yrs x/decade; 1/1.26 = .8 1.6 = 4x/3 yrs --100x/decade; 1/1.6 = .62

22 NAS vs SAN Network Attached Storage Storage Area Network File servers
High level Interfaces are better Network Attached Storage File servers Database servers Application servers (it’s a slippery slope: as Novell showed) Storage Area Network A lower life form Block server: get block / put block Wrong abstraction level (too low level) Security is VERY hard to understand. (who can read that disk block?) SCSI and iSCSI are popular.

23 How Do They Talk to Each Other?
Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other WebServices/SOAP? CORBA? COM+? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. Applications Applications datagrams streams RPC ? ? RPC streams datagrams SIO SIO SAN

24 Basic Argument for x-Disks
Future disk controller is a super-computer. 1 bips processor 256 MB dram 1 TB disk plus one arm Connects to SAN via high-level protocols RPC, HTTP, SOAP, COM+, Kerberos, Directory Services,…. Commands are RPCs management, security,…. Services file/web/db/… requests Managed by general-purpose OS with good dev environment Move apps to disk to save data movement need programming environment in “controller”

25 The Slippery Slope If you add function to server
Nothing = Sector Server If you add function to server Then you add more function to server Function gravitates to data. Fixed App Server Something = Everything = App Server

26 Why Not a Sector Server? (let’s get physical!)
Good idea, that’s what we have today. But cache added for performance Sector remap added for fault tolerance error reporting and diagnostics added SCSI commends (reserve,.. are growing) Sharing problematic (space mgmt, security,…) Slipping down the slope to a 2-D block server

27 Why Not a 1-D Block Server? Put A LITTLE on the Disk Server
Tried and true design HSC - VAX cluster EMC IBM Sysplex (3980?) But look inside Has a cache Has space management Has error reporting & management Has RAID 0, 1, 2, 3, 4, 5, 10, 50,… Has locking Has remote replication Has an OS Security is problematic Low-level interface moves too many bytes

28 Why Not a 2-D Block Server? Put A LITTLE on the Disk Server
Tried and true design Cedar -> NFS file server, cache, space,.. Open file is many fewer msgs Grows to have Directories + Naming Authentication + access control RAID 0, 1, 2, 3, 4, 5, 10, 50,… Locking Backup/restore/admin Cooperative caching with client

29 Why Not a File Server? Put a Little on the 2-D Block Server
Tried and true design NetWare, Windows, Linux, NetApp, Cobalt, SNAP,... WebDav Yes, but look at NetWare File interface grew Became an app server Mail, DB, Web,…. Netware had a primitive OS Hard to program, so optimized wrong thing

30 Why Not Everything? Allow Everything on Disk Server (thin client’s)
Tried and true design Mainframes, Minis, ... Web servers,… Encapsulates data Minimizes data moves Scaleable It is where everyone ends up. All the arguments against are short-term.

31 The Slippery Slope If you add function to server
Nothing = Sector Server If you add function to server Then you add more function to server Function gravitates to data. Fixed App Server Something = Everything = App Server

32 Disk = Node has magnetic storage (1TB?) has processor & DRAM
has SAN attachment has execution environment Applications Services DBMS RPC, ... File System SAN driver Disk driver OS Kernel

33 Hardware Homogenous machines leads to quick response through reallocation HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives $4k/TB (street), 2.5processors/TB, 1GB RAM/TB 3 weeks from ordering to operational Slide courtesy of Brewster Archive.org

34 Disk as Tape Tape is unreliable, specialized, slow, low density, not improving fast, and expensive Using removable hard drives to replace tape’s function has been successful When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Slide courtesy of Brewster Archive.org

35 Disk As Tape: What format?
Today I send NTFS/SQL disks. But that is not a good format for Linux. Solution: Ship NFS/CIFS/ODBC servers (not disks) Plug “disk” into LAN. DHCP then file or DB server via standard interface. Web Service in long term

36 Some Questions Will the disk folks deliver? What is the product?
How do I manage 1,000 nodes (disks)? How do I program 1,000 nodes (disks)? How does RAID work? How do I backup a PB? How do I restore a PB?

37 Will the disk folks deliver? Maybe! Hard Drive Unit Shipments
Source: DiskTrend/IDC Not a pretty picture (lately)

38 Most Disks are Personal
85% of disks are desktop/mobile (not SCSI) Personal media is AT LEAST 50% of the problem. How to manage your shoebox of: Documents Voic Photos Music Videos

39 What is the Product? (see next section on media management)
Concept: Plug it in and it works! Music/Video/Photo appliance (home) Game appliance “PC” File server appliance Data archive/interchange appliance Web appliance appliance Application appliance Router appliance network power

40 Auto Manage Storage Admin cost >> storage cost !!!!
1980 rule of thumb: A DataAdmin per 10GB, SysAdmin per mips 2000 rule of thumb A DataAdmin per 5TB SysAdmin per 100 clones (varies with app). Problem: 5TB is 50k$ today, 5k$ in a few years. Admin cost >> storage cost !!!! Challenge: Automate ALL storage admin tasks

41 How do I manage 1,000 nodes? You can’t manage 1,000 x (for any x).
They manage themselves. You manage exceptional exceptions. Auto Manage Plug & Play hardware Auto-load balance & placement storage & processing Simple parallel programming model Fault masking Some positive signs: Few admins at Google 10k nodes 2 PB , Yahoo! ? nodes, 0.3 PB, Hotmail 10k nodes, 0.3 PB

42 How do I program 1,000 nodes? You can’t program 1,000 x (for any x).
They program themselves. You write embarrassingly parallel programs Examples: SQL, Web, Google, Inktomi, HotMail,…. PVM and MPI prove it must be automatic (unless you have a PhD)! Auto Parallelism is ESSENTIAL

43 Plug & Play Software RPC is standardizing: (SOAP/HTTP, COM+, RMI/IIOP)
Gives huge TOOL LEVERAGE Solves the hard problems : naming, security, directory service, operations,... Commoditized programming environments FreeBSD, Linix, Solaris,…+ tools NetWare + tools WinCE, WinNT,…+ tools JavaOS + tools Apps gravitate to data. General purpose OS on dedicated ctlr can run apps.

44 It’s Hard to Archive a Petabyte It takes a LONG time to restore it.
At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, use other copy until failure repaired, refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space)

45 Disk vs Tape Disk Tape 160 GB 25 MBps 5 ms seek time
3 ms rotate latency 2$/GB for drive 1$/GB for ctlrs/cabinet 4 TB/rack Tape 100 GB 10 MBps 30 sec pick time Many minute seek time 5$/GB for media 10$/GB for drive+library 10 TB/rack Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drives The price advantage of tape is narrowing, and the performance advantage of disk is growing

46 I’m a disk bigot I hate tape, tape hates me. Disk Much easier to use
Unreliable hardware Unreliable software Poor human factors Terrible latency, bandwidth Disk Much easier to use Much faster Cheaper! But needs new concepts

47 Disk as Tape Challenges
Offline disk (safe from virus) Trivialize Backup/Restore software Things never change Just object versions Snapshot for continuous change (databases) RAID in a SAN (cross-disk journaling) Massive replication (a la Farsite)

48 Summary Disks will become supercomputers
Compete in Linux appliance space Build best NAS software (compete with NetApp, ..) Auto-manage huge storage farms FarSite, SQL autoAdmin++,… Build world’s best disk-based backup system Including Geoplex (compete with Veritas,..) Push faster on 64-bit

49 Storage capacity beating Moore’s law
2 k$/TB today (raw disk) 1k$/TB by end of 2002

50 Trends: Magnetic Storage Densities
Amazing progress Ratios have changed Capacity grows 60%/y Access speed grows 10x more slowly

51 Trends: Density Limits
Density vs Time b/µm2 & Gb/in2 Bit Density The end is near! Products:23 Gbpsi Lab: Gbpsi “limit”: Gbpsi But limit keeps rising & there are alternatives b/µm2 Gb/in2 ?: NEMS, Florescent? Holographic, DNA? 3,000 2,000 1, SuperParmagnetic Limit Wavelength Limit ODD DVD CD Figure adapted from Franco Vitaliano, “The NEW new media: the growing attraction of nonmagnetic storage”, Data Storage, Feb 2000, pp 21-32,

52 CyberBricks Disks are becoming supercomputers.
Each disk will be a file server then SOAP server Multi-disk bricks are transitional Long-term brick will have OS per disk. Systems will be built from bricks. There will also be Network Bricks Display Bricks Camera Bricks ….

53 Data Centric Computing
Yotta Zetta Exa Peta Tera Giga Mega Kilo Jim Gray Microsoft Research Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999

54 Communications Excitement!!
Point-to-Point Broadcast lecture concert conversation money Net Work + DB Immediate Time Shifted mail book newspaper Data Base Its ALL going electronic Information is being stored for analysis (so ALL database) Analysis & Automatic Processing are being added Slide borrowed from Craig Mundie

55 Information Excitement!
But comm just carries information Real value added is information capture & render speech, vision, graphics, animation, … Information storage retrieval, Information analysis

56 Information At Your Fingertips
All information will be in an online database (somewhere) You might record everything you read: 10MB/day, 400 GB/lifetime (5 disks today) hear: 400MB/day, 16 TB/lifetime (2 disks/year today) see: 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe someday) Data storage, organization, and analysis is challenge. text, speech, sound, vision, graphics, spatial, time… Information at Your Fingertips Make it easy to capture Make it easy to store & organize & analyze Make it easy to present & access

57 How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo Soon everything can be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: See Lyman & Varian: How much information Everything! Recorded All Books MultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

58 Why Put Everything in Cyberspace?
Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Locate Process Analyze Summarize

59 Disk Storage Cheaper than Paper
File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space 10$/ft2) 180$ total 700$ ¢/sheet Disk: disk (160 GB =) $ ASCII: 100 m pages ¢/sheet (10,000x cheaper) Image: 1 m photos ¢/sheet (100x cheaper) Store everything on disk

60 Gordon Bell’s MainBrain™ Digitize Everything A BIG shoebox?
Scans k “pages” 300 dpi 1 GB Music: 2 k “tacks” 7 GB Photos: 13 k images 2 GB Video: 10 hrs 3 GB Docs: 3 k (ppt, word,..) 2 GB Mail: k messages 1 GB 16 GB

61 Gary Starkweather Scan EVERYTHING 400 dpi TIFF 70k “pages” ~ 14GB
OCR all scans (98% recognition ocr accuracy) All indexed (5 second access to anything) All on his laptop.

62 A: Things will run SLOWLY…. unless we add good software
Q: What happens when the personal terabyte arrives? A: Things will run SLOWLY…. unless we add good software

63 Summary Disks will morph to appliances Main barriers to this happening
Lack of Cool Apps Cost of Information management

64 1 TB The “Absurd” Disk 2.5 hr scan time (poor sequential access)
1 aps / 5 GB (VERY cold data) It’s a tape! 1 TB 100 MB/s 200 Kaps

65 Crazy Disk Ideas Disk Farm on a card: surface mount disks
Disk (magnetic store) on a chip: (micro machines in Silicon) Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller (a processor with 128 MB dram) ASIC The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail Clayton M. Christensen .ISBN:

66 The Disk Farm On a Card The 500GB disc card An array of discs
Can be used as 100 discs 1 striped disc 50 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth 14"

67 Trends: promises NEMS (Nano Electro Mechanical Systems) (http://www
Trends: promises NEMS (Nano Electro Mechanical Systems) ( also Cornell, IBM, CMU,… 250 Gbpsi by using tunneling electronic microscope Disk replacement Capacity: 180 GB now, TB in 2 years Transfer rate: 100 MB/sec R&W Latency: 0.5msec Power: 23W active, .05W Standby 10k$/TB now, 2k$/TB in 2004

68 Trends: Gilder’s Law: 3x bandwidth/year for 25 more years
Today: 40 Gbps per channel (λ) 12 channels per fiber (wdm): 500 Gbps 32 fibers/bundle = 16 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps

69 Technology Drivers: What if Networking Was as Cheap As Disk IO?
TCP/IP Unix/NT 100% 40MBps Disk Unix/NT 8% 40MBps Why the Difference? Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA Host does TCP/IP packetizing, small buffers

70 SAN: Standard Interconnect
RIP FDDI SAN: Standard Interconnect RIP ATM Gbps Ethernet: 110 MBps LAN faster than memory bus? 1 GBps links in lab. 100$ port cost soon Port is computer RIP SCI PCI: 70 MBps RIP SCSI UW Scsi: 40 MBps FW scsi: 20 MBps RIP FC scsi: 5 MBps RIP ?

71 Building a Petabyte Store
EMC ~ 500k$/TB = 500M$/PB plus FC switches plus… 800M$/PB TPC-C SANs (Dell 18GB/…) M$/PB Dell local SCSI, 3ware M$/PB Do it yourself: M$/PB

72 The Cost of Storage (heading for 1K$/TB soon)
12/1/1999 9/1/2000 9/1/2001

73 Cheap Storage or Balanced System
Low cost storage (2 x 1.5k$ servers) 6K$ TB 2x (1K$ system + 8x80GB disks + 100MbEthernet) Balanced server (7k$/.5 TB) 2x800Mhz (2k$) 256 MB (400$) 8 x 80 GB drives (2K$) Gbps Ethernet + switch (1k$) 11k$ TB, 22K$/RAIDED TB 2x800 Mhz 256 MB

74 320 GB, 2k$ (now) Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)
4x80 GB IDE (2 hot plugable) (1,000$) SCSI-IDE bridge 200k$ Box 500 Mhz cpu 256 MB SRAM Fan, power, Enet 700$ Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)

75

76 Hot Swap Drives for Archive or Data Interchange
25 MBps write (so can write N x 160 GB in 3 hours) 160 GB/overnite = ~N x 4 MB/second @ 19.95$/nite

77 Data delivery costs 1$/GB today
Rent for “big” customers: 300$/megabit per second per month Improved 3x in last 6 years (!). That translates to 1$/GB at each end. You can mail a 160 GB disk for 20$. That’s 16x cheaper If overnight it’s 3 MBps. 3x160 GB ~ ½ TB

78 Data on Disk Can Move to RAM in 8 years
30:1 6 years

79 Storage Latency: How Far Away is the Data?
Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto 10 Disk 2 Years Springfield 1.5 hr 100 Memory This Campus 10 On Board Cache 10 min 2 On Chip Cache This Room 1 Registers My Head 1 min

80 More Kaps and Kaps/$ but….
Disk accesses got much less expensive Better disks Cheaper disks! But: disk arms are expensive the scarce resource 1 hour Scan vs 5 minutes in 1990 100 GB 30 MB/s

81 Backup: 3 scenarios Disaster Recovery: Preservation through Replication Hardware Faults: different solutions for different situations Clusters, load balancing, replication, tolerate machine/disk outages (Avoided RAID and expensive, low volume solutions) Programmer Error: versioned duplicates (no deletes)

82 Online Data Can build 1PB of NAS disk for 5M$ today
Can SCAN (read or write) entire PB in 3 hours. Operate it as a data pump: continuous sequential scan Can deliver 1PB for 1M$ over Internet Access charge is 300$/Mbps bulk rate Need to Geoplex data (store it in two places). Need to filter/process data near the source, To minimize network costs.


Download ppt "Data Centric Computing"

Similar presentations


Ads by Google