1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

Software Version: DSS ver up01
VCS 5.0 for VMware ESX.
Computer Networks TCP/IP Protocol Suite.
1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.
Chapter 14 Intranets & Extranets. Awad –Electronic Commerce 1/e © 2002 Prentice Hall 2 OBJECTIVES Introduction Technical Infrastructure Planning an Intranet.
Distributed Systems Architectures
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Terminology and empirical measures General methods to mask faults.
Clustering Technology For Scaleability Jim Gray Microsoft Research
Gray & Reuter FT 2: 1 Dependable Computing Systems Jim Gray Microsoft, Microsoft.com Andreas Reuter International University,
Gray FT 4/24/95 1 Dependable Computing Systems Jim Gray UC Berkeley McKay Lecture 25 April 1995 Microsoft.com Talk 1: Many little will win over.
Past High Availability Standards Efforts Jim Gray Microsoft
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
The World Wide Web and the Internet MIS XLM.B Jack G. Zheng June 20 th 2005.
The World Wide Web and the Internet MIS XLM.B Jack G. Zheng May 13 th 2008.
ELECTRONIC DATA COLLECTION SYSTEM Howard Hamilton.
By Rick Clements Software Testing 101 By Rick Clements
18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
The Impact of Soft Resource Allocation on n-tier Application Scalability Qingyang Wang, Simon Malkowski, Yasuhiko Kanemasa, Deepal Jayasinghe, Pengcheng.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Database Systems: Design, Implementation, and Management
Solve Multi-step Equations
Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
The IP Revolution. Page 2 The IP Revolution IP Revolution Why now? The 3 Pillars of the IP Revolution How IP changes everything.
Virtualization & Disaster Recovery
ACT User Meeting June Your entitlements window Entitlements, roles and v1 security overview Problems with v1 security Tasks, jobs and v2 security.
Chapter 1: Introduction to Scaling Networks
The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.
The Lucernex Cloud: A software-as-a-service solution delivered via the Cloud What is the Cloud? Cloud Computing is the future of all software applications,
UC Santa Cruz Reliability of MEMS-Based Storage Enclosures Bo Hong, Thomas J. E. Schwarz, S. J. * Scott A. Brandt, Darrell D. E. Long Storage Systems Research.
25 July, 2014 Hailiang Mei, TU/e Computer Science, System Architecture and Networking 1 Hailiang Mei Remote Terminal Management.
VOORBLAD.
ICS 434 Advanced Database Systems
Large-Scale Distributed Systems Andrew Whitaker CSE451.
31242/32549 Advanced Internet Programming Advanced Java Programming
© 2012 National Heart Foundation of Australia. Slide 2.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 v3.1 Module 10 Routing Fundamentals and Subnets.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Introduction to Computer Administration Introduction.
Services Course Windows Live SkyDrive Participant Guide.
SLP – Endless Possibilities What can SLP do for your school? Everything you need to know about SLP – past, present and future.
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Equal or Not. Equal or Not
Januar MDMDFSSMDMDFSSS
Prof.ir. Klaas H.J. Robers, 14 July Graduation: a process organised by YOU.
We will resume in: 25 Minutes.
A SMALL TRUTH TO MAKE LIFE 100%
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 v3.1 Module 9 TCP/IP Protocol Suite and IP Addressing.
PSSA Preparation.
VPN AND REMOTE ACCESS Mohammad S. Hasan 1 VPN and Remote Access.
Chapter 13 The Data Warehouse
Introduction to ikhlas ikhlas is an affordable and effective Online Accounting Solution that is currently available in Brunei.
Mirjam Kühne 1 RIPE 34, September 1999 RIPE NCC Status RIPE NCC Staff presented by Mirjam Kühne.
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)
1 Dependability in the Internet Era Jim Gray Microsoft Research High Dependability Computing Consortium Conference Santa Cruz, CA 7 May 2001 REVISED: 13.
Large Distributed Systems
Maximum Availability Architecture Enterprise Technology Centre.
Terminology and empirical measures General methods to mask faults.
Presentation transcript:

1 Dependability in the Internet Era

2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations

3 Preview The Last 5 Years: Availability Dark Ages Ready for a Renaissance? Things got better, then things got a lot worse! 9% 99% 99.9% 99.99% % Computer Systems Telephone Systems Cell phones Internet Availability

4 DEPENDABILITY: The 3 ITIES RELIABILITY / INTEGRITY: Does the right thing. (also MTTF>>1) AVAILABILITY: Does it now. (also 1 >> MTTR ) MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time ). Holistic vs. Reductionist view Security Integrity Reliability Availability

5 Fail-Fast is Good, Repair is Needed Improving either MTTR or MTTF gives benefit Simple redundancy does not help much. Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability is low UN-Availability Unavailability ~ MTTR MTTF MTTF

6 Fault Model Failures are independent So, single fault tolerance is a big win Hardware fails fast (dead disk, blue-screen) Software fails-fast (or goes to sleep) Software often repaired by reboot: –Heisenbugs Operations tasks: major source of outage –Utility operations –Software upgrades

7 Disks (raid) the BIG Success Story Duplex or Parity: masks faults 1M hours (~100 years) But –controllers fail and –have 1,000s of disks. Duplexing or parity, and dual path gives perfect disks Wal-Mart never lost a byte (thousands of disks, hundreds of failures). Only software/operations mistakes are left.

8 Fault Tolerance vs Disaster Tolerance Fault-Tolerance: mask local faults –RAID disks –Uninterruptible Power Supplies –Cluster Failover Disaster Tolerance: masks site failures –Protects against fire, flood, sabotage,.. –Redundant system and service at remote site.

9 Case Study - Japan "Survey on Computer Security", Japan Info Dev Corp., March (trans: Eiichi Watanabe). Vendor (hardware and software) 5 Months Application software 9 Months Communications lines1.5 Years Operations 2 Years Environment 2 Years 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas 42% 12% 25% 9.3% 11.2 % Vendor Environment Operations Application Software Tele Comm lines

10 Case Studies - Tandem Trends MTTF improved Shiftfrom Hardware & Maintenance to from 50% to 10% toSoftware (62%) & Operations (15%) NOTE: Systematic under-reporting ofEnvironment Operations errors Application Software

11 Dependability Status circa 1995 ~4-year MTTF => 5 9s for well-managed sys. Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Many hidden software outages in operations: –New Software. –Utilities. Make all hardware/software changes ONLINE. Software seems to define a 30-year MTTF ceiling. Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.

12 Whats Happened Since Then? Hardware got better Software got better (even though it is more complex) Raid is standard, Snapshots coming standard Cluster in a box: commodity failover Remote replication is standard.

13 Availability well-managed nodes well-managed packs & clones well-managed GeoPlex Masks some hardware failures Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures Masks site failures (power, network, fire, move,…) Masks some operations failures Availability Un-managed

14 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations

15 Progress? MTTF improved from MTTR has not improved much since 1970 failover Hardware and Software online change (pNp) is now standard Then the Internet arrived: –No project can take more than 3 months. –Time to market is everything –Change is good.

16 The Internet Changed Expectations 1990 Phones delivered % ATMs delivered 99.99% Failures were front-page news. Few hackers Outages last an hour 2000 Cellphones deliver 90% Web sites deliver 98% Failures are business-page news Many hackers. Outages last a day This is progress?

17 Why (1) Complexity Internet sites are MUCH more complex. –NAP –Firewall/proxy/ipsprayer –Web –DMZ –App server –DB server –Links to other sites –tcp/http/html/dhtml/dom/xml/co m/corba/cgi/sql/fs/os… Skill level is much reduced

18 One of the Data Centers (500 servers)

19 A Schematic of HotMail ~7,000 servers 100 backend stores with 120TB (cooked) 3 data centers Links to –Passport –Ad-rotator –Internet Mail gateways –… ~ 1B messages per day 150M mailboxes, 100M active ~400,000 new per day.

20 Why (2) Velocity No project can take more than 13 weeks. Time to market is everything Functionality is everything Faster, cheaper, badder Schedule Quality Functionality trend

21 Why (3) Hackers Hackers are a new increased threat Any site can be attacked from anywhere Motives include ego, malice, and greed. Complexity makes it hard to protect sites. Concentration of wealth makes attractive target: Why did you rob banks? Willie Sutton: Cause thats where the money is! Note: Eric Raymonds How to Become a Hacker is the positive use of the term, here I mean malicious and anti-social hackers.

22 How Bad Is It? Connectivity is poor.

23 How Bad Is It? Median monthly % ping packet loss for 2/ 99

24 Microsoft.Com Operations mis-configured a router Took a day to diagnose and repair. DOS attacks cost a fraction of a day. Regular security patches.

25 BackEnd Servers are More Stable Generally deliver 99.99% TerraServer for example single back-end failed after 2.5 y. Went to 4-node cluster Fails every 2 mo. Transparent failover in 30 sec. Online software upgrades So… % in backend… Year 1 Through 18 Months Down 30 hours in July (hardware stop, auto restart failed, operations failure) Down 26 hours in September (Backplane failure, I/O Bus failure)

26 eBay: A very honest site Publishes operations log.Publishes operations log. Has 99% of scheduled uptimeHas 99% of scheduled uptime Schedules about 2 hours/week down.Schedules about 2 hours/week down. Has had some operations outagesHas had some operations outages Has had some DOS problems.Has had some DOS problems.

27 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations

28 Not to throw stones but… Everyone has a serious problem. The BEST people publish their stats. The others HIDE their stats (check Netcraft to see who I mean). We have good NODE-level availability 5-9s is reasonable. We have TERRIBLE system-level availability 2-9s is the goal.

29 Recommendation #1 Continue progress on back-ends. –Make management easier (AUTOMATE IT!!!) –Measure –Compare best practices –Continue to look for better algoritims. Live in fear –We are at 10,000 node servers –We are headed for 1,000,000 node servers

30 Recommendation #2 Current security approach is unworkable: –Anonymous clients –Firewall is clueless –Incredible complexity We cant win this game! So change the rules (redefine the problem): –No anonymity –Unified authentication/authorization model –Single-function devices (with simple interfaces) –Only one-kind of interface (uddi/wsdl/soap/…).

31 References Adams, E. (1984). Optimizing Preventative Service of Software Products. IBM Journal of Research and Development. 28(1): Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems Gray, J. (1990). A Census of Tandem System Availability between 1985 and IEEE Transactions on Reliability. 39(4): Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15th FTCS Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10th Symposium on Reliable Distributed Systems, pp , Pisa, September Darrell LongDarrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, p. 2-9Richard Golding They have even better for-fee data as well, but for-free is really excellent. eBay is an Excellent benchmark of best Internet practices Network traffic/quality report, dated, but the others have died off!