Computer Technology Forecast Jim Gray Microsoft Research

Slides:



Advertisements
Similar presentations
1 Storage Bricks Jim Gray Microsoft Research FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.
Advertisements

1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.
Name: Date: Read temperatures on a thermometer Independent / Some adult support / A lot of adult support
1 Store Everything Online In A Database Jim Gray Microsoft Research
1 The 5 Minute Rule Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10 12 today,
Past High Availability Standards Efforts Jim Gray Microsoft
1 Designing for 20TB Disk Drives And enterprise storage Jim Gray, Microsoft research.
What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey.
Data Centric Computing
Win Big AddingSubtractEven/Odd Rounding Patterns Q $100 Q $200 Q $300 Q $400 Q $500 Q $100 Q $200 Q $300 Q $400 Q $500 Last Chance.
1 Optical network CERNET's experience and prospective Xing Li, Congxiao Bao
Helping TCP Work at Gbps Cheng Jin the FAST project at Caltech
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
Year 6 mental test 15 second questions Numbers and number system Numbers and the number system, Measures and Shape.
£1 Million £500,000 £250,000 £125,000 £64,000 £32,000 £16,000 £8,000 £4,000 £2,000 £1,000 £500 £300 £200 £100 Welcome.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Bit preservation cost outlook: Cost for 10, 20, 30 years archive.
1 Compact Discs **** Optical Discs 2 Media based on optical discs (Part 1) WORM (not following the CD standards) CD-DA(digital audio) CD-ROM for PC,
Computing Infrastructure
Chapter 8 Interfacing Processors and Peripherals.
Hard Disks Low-level format- organizes both sides of each platter into tracks and sectors to define where items will be stored on the disk. Partitioning:
Tasks in Setting Up a Hard Disk
DiskCon September 2004 Solid State Disks: The Future of Storage?
1 RAID Overview n Computing speeds double every 3 years n Disk speeds cant keep up n Data needs higher MTBF than any component in system n IO.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
Fast Crash Recovery in RAMCloud
1 Disks Introduction ***-. 2 Disks: summary / overview / abstract The following gives an introduction to external memory for computers, focusing mainly.
Basic Principles of PACS Networking Emily Seto Medical Engineering/SIMS Center for Global eHealth Innovation April 29, 2004.
NASK RESEARCH AND ACADEMIC COMPUTER NETWORK Collocation at the NASK Collocation and Hosting Centre.
The IP Revolution. Page 2 The IP Revolution IP Revolution Why now? The 3 Pillars of the IP Revolution How IP changes everything.
Storing Data Chapter 4.
Network, Local, and Portable Storage Media Computer Literacy for Education Majors.
Describing Storage Devices Store data when computer is off Two processes –Writing data –Reading data Storage terms –Media is the material storing data.
Components of a Computer System Components of a Computer System.
Why should I consider Implementing a Document Imaging / Management System? Created by Harold Hegerhorst North American Technology. LLC © North American.
1 The information industry and the information market Summary.
A Comparison of HTTP and HTTPS Performance Arthur Goldberg, Robert Buff, Andrew Schmitt [artg, buff, Computer Science Department Courant.
Storage and Disks.
Created by Susan Neal $100 Fractions Addition Fractions Subtraction Fractions Multiplication Fractions Division General $200 $300 $400 $500 $100 $200.
Introduction to Indexes Rui Zhang The University of Melbourne Aug 2006.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Introduction to Computer Administration Introduction.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Volume Concepts HP Restricted Module.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Equal or Not. Equal or Not
Slippery Slope
Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.
Week 1.
We will resume in: 25 Minutes.
4/5/20001 Windows 2000 IO Performance Leonard Chung & Jim Gray.
CS597A: Managing and Exploring Large Datasets Kai Li.
The Cost of Storage about 1K$/TB 12/1/1999 9/1/2000 9/1/2001 4/1/2002.
The Dawning of the Age of Infinite Storage William Perrizo Dept of Computer Science North Dakota State Univ.
Section 1 # 1 CS The Age of Infinite Storage.
Persistent Storage (disk?) Requirements (For The Low End ==the bottom 99%of the market ) Jim Gray Microsoft Research.
Section 1 # 1 CS The Age of Infinite Storage.
Memory  Main memory consists of a number of storage locations, each of which is identified by a unique address  The ability of the CPU to identify each.
1 Store Everything Online In A Database Jim Gray Microsoft Research
Storage Systems CSE 598d, Spring 2007 Lecture ?: Rules of thumb in data engineering Paper by Jim Gray and Prashant Shenoy Feb 15, 2007.
Building Peta-Byte Data Stores Jim Claus Shira Anniversary European Media Lab 12 February 2001.
Mbps over 5,626 km ~ 4e15 bit meters per second 4 Peta Bmps (“peta bumps”) Single Stream TCP/IP throughput Information Sciences Institute Microsoft.
1 Meta-Message: Technology Ratios Matter Price and Performance change. If everything changes in the same way, then nothing really changes. If some things.
How much information? Adapted from a presentation by:
Computer Technology Forecast
CS The Age of Infinite Storage
Jim Gray Microsoft Research
Presentation transcript:

Computer Technology Forecast Jim Gray Microsoft Research

Reality Check Good news –In the limit, processing & storage & network is free –Processing & network is infinitely fast Bad news –Most of us live in the present. –People are getting more expensive. Management/programming cost exceeds hardware cost. –Speed of light not improving. –WAN prices have not changed much in last 8 years.

Interesting Topics Ill talk about server-side hardware What about client hardware? –Displays, cameras, speech,…. What about Software? –Databases, data mining, PDB, OODB –Objects / class libraries … –Visualization –Open Source movement

How Much Information Is there? Soon everything can be recorded and indexed Most data never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search is key technology. Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All LoC books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

Moores Law Performance/Price doubles every 18 months 100x per decade Progress in next 18 months = ALL previous progress –New storage = sum of all old storage (ever) –New processing = sum of all old processing. E. coli double ever 20 minutes! 15 years ago

Trends: ops/s/$ Had Three Growth Phases Mechanical Relay 7-year doubling Tube, transistor, year doubling Microprocessor 1.0 year doubling

System Bus PCI Bus Whats a Balanced System?

Storage capacity beating Moores law 5 k$/TB today (raw disk)

Cheap Storage Disks are getting cheap: 7 k$/TB disks (25 40 GB 230$ each)

Cheap Storage or Balanced System Low cost storage (2 x 1.5k$ servers) 7K$ TB 2x (1K$ system + 8x60GB disks + 100MbEthernet) Balanced server (7k$/.5 TB) –2x800Mhz (2k$) –256 MB (400$) –8 x 60 GB drives (3K$) –Gbps Ethernet + switch (1.5k$) –14k$ TB, 28K$/RAIDED TB 2x800 Mhz 256 MB

The Absurd Disk 2.5 hr scan time (poor sequential access) 1 aps / 5 GB (VERY cold data) Its a tape! 1 TB 100 MB/s 200 Kaps

Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 60 GB in 40 minutes) 60 GB/overnite = ~N x $/nite 17$ 260$

240 GB, 2k$ (now) 300 GB by year end. 4x60 GB IDE (2 hot plugable) –(1,100$) SCSI-IDE bridge –200k$ Box –500 Mhz cpu –256 MB SRAM –Fan, power, Enet –700$ Or 8 disks/box 600 GB for ~3K$ ( or 300 GB RAID)

Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 74 GB in 3 hours) 74 GB/overnite = ~N x $/nite

Its Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, –use other copy until failure repaired, –refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space)

Disk vs Tape Disk –60 GB –30 MBps – 5 ms seek time – 3 ms rotate latency – 7$/GB for drive 3$/GB for ctlrs/cabinet –4 TB/rack –1 hour scan Tape –40 GB –10 MBps –10 sec pick time – second seek time –2$/GB for media 8$/GB for drive+library –10 TB/rack –1 week scan The price advantage of tape is narrowing, and the performance advantage of disk is growing At 10K$/TB, disk is competitive with nearline tape. Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drives

Trends: Gilders Law: 3x bandwidth/year for 25 more years Today: –10 Gbps per channel –4 channels per fiber: 40 Gbps –32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps

300 MBps OC48 = G2 Or memcpy() 90 MBps PCI Sense of scale How fat is your pipe? Fattest pipe on MS campus is the WAN! 20 MBps disk / ATM / OC3 94 MBps Coast to Coast

Redmond/Seattle, WA San Francisco, CA New York Arlington, VA 5626 km 10 hops Information Sciences Institute MicrosoftQwest University of Washington Pacific Northwest Gigapop HSCC (high speed connectivity consortium) DARPA

The Path DC -> SEA C:\tracert -d Tracing route to over a maximum of 30 hops DELL 4400 Win2K WKS Arlington Virginia, ISIAlteon GbE 1 16 ms <10 ms <10 ms Juniper M40 GbE Arlington Virginia, ISI Interface ISIe 2 <10 ms <10 ms <10 ms Cisco GSR OC48 Arlington Virginia, Qwest DC Edge 3 <10 ms <10 ms <10 ms Cisco GSR OC48 Arlington Virginia, Qwest DC Core 4 <10 ms <10 ms 16 ms Cisco GSR OC48 New York, New York, Qwest NYC Core 5 62 ms 63 ms 62 ms Cisco GSR OC48 San Francisco, CA, Qwest SF Core 6 78 ms 78 ms 78 ms Cisco GSR OC48 Seattle, Washington, Qwest Sea Core 7 78 ms 78 ms 94 ms Juniper M40 OC48 Seattle, Washington, Qwest Sea Edge 8 78 ms 79 ms 78 ms Juniper M40 OC48 Seattle, Washington, PNW Gigapop 9 78 ms 78 ms 94 ms Cisco GSR OC48 Redmond Washington, Microsoft ms 78 ms 94 ms Compaq SP750 Win2K WKS Redmond Washington, Microsoft SysKonnect GbE

PetaBumps 751 mbps for 300 seconds = (~28 GB) single-thread single-stream tcp/ip desktop-to-desktop out of the box performance* 5626 km x 751Mbps = ~ 4.2e15 bit meter / second ~ 4.2 Peta bmps Multi-steam is 952 mbps ~5.2 Peta bmps 4470 byte MTUs were enabled on all routers. 20 MB window size

The Promise of SAN/VIA:10x in 2 years Yesterday: –10 MBps (100 Mbps Ethernet) –~20 MBps tcp/ip saturates 2 cpus –round-trip latency ~250 µs Now –Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… – Fast user-level communication tcp/ip ~ 100 MBps 10% cpu round-trip latency is 15 us 1.6 Gbps demoed on a WAN

Pointers The single-stream submission: Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm The multi-stream submission: Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm The code: speedy.h speedy.c And a PowerPoint presentation about it. Windows2000_WAN_Speed_Record.ppt

Networking WANS are getting faster than LANS G8 = OC192 = 8Gbps is standard Link bandwidth improves 4x per 3 years Speed of light (60 ms round trip in US) Software stacks have always been the problem. Time = SenderCPU + ReceiverCPU + bytes/bandwidth This has been the problem

Rules of Thumb in Data Engineering Moores law -> an address bit per 18 months. Storage grows 100x/decade (except 1000x last decade!) Disk data of 10 years ago now fits in RAM (iso-price). Device bandwidth grows 10x/decade – so need parallelism RAM:disk:tape price is 1:10:30 going to 1:10:10 Amdahls speedup law: S/(S+P) Amdahls IO law: bit of IO per instruction/second (tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars) Amdahls memory law: byte per instruction/second (going to 10) (1 TB RAM per TOP: 1 TeraDollars) PetaOps anyone? Gilders law: aggregate bandwidth doubles every 8 months. 5 Minute rule: cache disk data that is reused in 5 minutes. Web rule: cache everything! MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc

Dealing With TeraBytes (Petabytes): Requires Parallelism parallelism: use many little devices in parallel At 10 MB/s: 1.2 days to scan 1,000 x parallel: 100 seconds scan. Use 100 processors & 1,000 disks

Parallelism Must Be Automatic There are thousands of MPI programmers. There are hundreds-of-millions of people using parallel database search. Parallel programming is HARD! Find design patterns and automate them. Data search/mining has parallel design patterns.

Scalability: Up and Out Out Scale Out clones & partitions –Use commodity servers –Add clones & partitions as needed Up Scale Up –Use big iron (SMP) –Cluster into packs for availability

Everyone scales out Whats the Brick? 1M$/slice –IBM S390? –Sun E 10,000? 100 K$/slice –HPUX/AIX/Solaris/IRIX/EMC 10 K$/slice –Utel / Wintel 4x 1 K$/slice –Beowulf / Wintel 1x

Terminology for scaleability Farms of servers: –Clones: identical Scaleability + availability –Partitions: Scaleability –Packs Partition availability via fail-over GeoPlex for disaster tolerance. Farm Clone Shared Nothing Shared Disk Partition Pack Shared Nothing Active- Active Active- Passive

Shared Nothing ClonesShared Disk Clones PartitionsPacked Partitions Farm Clone Shared Nothing Shared Disk Partition Pack Shared Nothing Active- Active Active- Passive

Unpredictable Growth The TerraServer Story: –We expected 5 M hits per day –We got 50 M hits on day 1 –We peak at M hpd on a hot day –Average 5 M hpd after 1 year Most of us cannot predict demand –Must be able to deal with NO demand –Must be able to deal with HUGE demand

An Architecture for Internet Services? Need to be able to add capacity –New processing –New storage –New networking Need continuous service –Online change of all components (hardware and software) –Multiple service sites –Multiple network providers Need great development tools –Change the application several times per year. –Add new services several times per year.

Premise: Each Site is a Farm Buy computing by the slice (brick): –Rack of servers + disks. Grow by adding slices –Spread data and computation to new slices Two styles: –Clones: anonymous servers –Parts+Packs: Partitions fail over within a pack In both cases, remote farm for disaster recovery

Clones: Availability+Scalability Some applications are –Read-mostly –Low consistency requirements –Modest storage requirement (less than 1TB) Examples: –HTML web servers (IP sprayer/sieve + replication) –LDAP servers (replication via gossip) Replicate app at all nodes (clones)Replicate app at all nodes (clones) Spray requests across nodes. Grow by adding clones Fault tolerance: stop sending to that clone. Growth: add a clone.

Two Clone Geometries Shared-Nothing: exact replicas Shared-Disk (state stored in server) Shared Nothing ClonesShared Disk Clones

Facilities Clones Need Automatic replication –Applications (and system software) –Data Automatic request routing –Spray or sieve Management: –Who is up? –Update management & propagation –Application monitoring. Clones are very easy to manage: –Rule of thumb: 100s of clones per admin

Partitions Partitions for Scalability Clones are not appropriate for some apps. –Statefull apps do not replicate well –high update rates do not replicate well Examples – / chat / … –Databases Partition state among servers Scalability (online): –Partition split/merge –Partitioning must be transparent to client.

Partitioned/Clustered Apps Mail servers –Perfectly partitionable Business Object Servers –Partition by set of objects. Parallel Databases Transparent access to partitioned tables Parallel Query

Packs for Availability Each partition may fail (independent of others) Partitions migrate to new node via fail-over –Fail-over in seconds Pack: the nodes supporting a partition –VMS Cluster –Tandem Process Pair –SP2 HACMP –Sysplex –WinNT MSCS (wolfpack) Cluster In A Box now commodity Partitions typically grow in packs.

What Parts+Packs Need Automatic partitioning (in dbms, mail, files,…) –Location transparent –Partition split/merge –Grow without limits (100x10TB) Simple failover model –Partition migration is transparent –MSCS-like model for services Application-centric request routing Management: –Who is up? –Automatic partition management (split/merge) –Application monitoring.

Partitions and Packs Packs for availabilty PartitionsPacked Partitions

GeoPlex: Farm pairs Two farms Changes from one sent to other When one farm fails other provides service Masks –Hardware/Software faults –Operations tasks (reorganize, upgrade move –Environmental faults (power fail)

Services on Clones & Partitions Application provides a set of services If cloned: –Services are on subset of clones If partitioned: –Services run at each partition System load balancing routes request to –Any clone –Correct partition. –Routes around failures.

Cluster Scenarios: 3- tier systems Web Clients A simple web site Front End Web File StoreSQL Temp State SQL Database Packs for availability Clones for availability Load Balance

Cloned Packed file servers Packed Partitions: Database Transparency Cluster Scale Out Scenarios SQL Temp StateWeb File StoreA Cloned Front Ends ( firewall, sprayer, web server ) SQL Partition 3 The FARM: Clones and Packs of Partitions Web Clients Web File StoreB replication SQL DatabaseSQL Partition 2SQL Partition1 Load Balance

Terminology Terminology for scaleability Farms of servers: –Clones: identical Scaleability + availability –Partitions: Scaleability –Packs Partition availability via fail-over GeoPlex for disaster tolerance. Farm Clone Shared Nothing Shared Disk Partition Pack Shared Nothing Active- Active Active- Passive

Helping move the data to SQL –Database design –Data loading Experimenting with queries on a 4 M object DB –20 questions like find gravitational lens candidates –Queries use parallelism, most run in a few seconds.(auto parallel) –Some run in hours (neighbors within 1 arcsec) –EASY to ask questions. Helping with an outreach website: SkyServer Personal goal: Try datamining techniques to re-discover Astronomy What we have been doing with SDSS

References (.doc or.pdf) Technology forecast: MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc Gbps experiments: Disk experiments (10K$ TB) Scaleability Terminology