Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Similar presentations


Presentation on theme: "Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324."— Presentation transcript:

1

2 Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324

3

4 What is the uptime in Office 365? Why is it good? What does Microsoft do to make sure it is? How do these numbers translate into my organization? What happens when I have an outage? How does our approach differ from our competitors? What’s next? How does Microsoft make sure it keeps getting better? Agenda

5 What is the uptime in Office 365? Why is it good? What does Microsoft do to make sure it is? How do these numbers translate into my organization? What happens when I have an outage? How does our approach differ from our competitors? What’s next? How does Microsoft make sure it keeps getting better?

6 6

7 What is the uptime in Office 365? Why is it good? What does Microsoft do to make sure it is? How do these numbers translate into my organization? What happens when I have an outage? How does our approach differ from our competitors? What’s next? How does Microsoft make sure it keeps getting better?

8 Hotmail 1997 Windows Update 1995 Bing / MSN search 1998 Xbox Live 2002 Exchange Hosted Services (now part of Office 365)

9 Backed by the most responsive support available and a comprehensive 99.9% financially backed SLA 9

10 Reconfiguration Resilience Absorption Restoration Anticipation 10

11 Functional redundancy Physical redundancy Reorganization Human backup “Human-in-loop” Predictability Complexity avoidance Context spanning Graceful degradation Drift correction “Neutral” state Inspectability Intent awareness Learning/Adaptation As described by Madni and Jackson 11

12 Online and offline functionality in order to provide continued functionality even in the light of failures Office 365 provides physical redundancy at multiple levels to protect against hardware failures Data in transit and at rest Network and hardware redundancy Facilities and power redundancy At least 2 datacenters per region Physical Redundancy Functional Redundancy 12

13 Active load balancing to restructure the system against rare extreme load conditions Response to hardware failures Reorganization 13

14 Monitoring system attempts automated recovery actions, and alerts 24x7 on-call engineer when recovery does not succeed On-call engineers are core product group members (dev, test, and PM) in the relevant area for the alert for rapid response and relevant information collection Human Backup and “Human-in-Loop” 14

15 Detailed logging and tracing to avoid unnecessary assumptions by on-call engineers Deviations from normal behavior deliver alerts to on- call engineers, enabling relevant information collection and rapid resolution Inspectability and Predictability 15

16 Standardized hardware and automated deployment model Complexity Avoidance Recovery across “failure domains” tested regularly, including regional disasters Service component isolation to avoid failure cascades. Context Spanning Built-in workload management mechanisms to avoid catastrophic impact and degrade gracefully Graceful Degradation and Drift Correction 16

17 What is the uptime in Office 365? Why is it good? What does Microsoft do to make sure it is? How do these numbers translate into my organization? What happens when I have an outage? How does our approach differ from our competitors? What’s next? How does Microsoft make sure it keeps getting better?

18 Longer outages have greater impact to the percentage Outages that affect a greater number of users have greater impact More severe outages in terms of users or duration lead to greater deviations from 100%, which is desirable for remedy service credits. The Office 365 service level agreement expresses uptime in this way: Why? 18

19 The objective is to describe the risk of outage to an individual customer based on the aggregate uptime of the service. CAUTION – This does not capture the full risk picture! The aggregate uptime of service components can also be expressed similarly: 19

20 Downtime that is not dependent on the number of users is still adjusted by the number of users. The aggregate uptime is heavily dependent on the definition of downtime. Different cloud services provide different functionality, making uptime hard to compare. Productivity loss due to downtime differs by service. What are the caveats with this aggregate uptime number, particularly when it is used to compare different services? 20

21 What is the uptime in Office 365? Why is it good? What does Microsoft do to make sure it is? How do these numbers translate into my organization? What happens when I have an outage? How does our approach differ from our competitors? What’s next? How does Microsoft make sure it keeps getting better?

22 Office 365 Service Communication Experiences Planned Maintenance Notification of planned service maintenance including transitions/upgrades, repair and update scenarios. Service Alteration Notification about changes to service features, capabilities or business terms of service. Service Incident Notification regarding major service interrupting incidents. Account Lifecycle Notification of milestones in the subscription lifecycle. Experiences when Customers’ Access to Services are Impacted.

23 Backed by the most responsive support available and a comprehensive 99.9% financially backed SLA

24 Service Incident Communication Flow Incident Identification Incident Identification Post Incident Wrap Up Post Incident Wrap Up Ongoing Communication

25 Facebook Twitter Service Health Dashboard Community Post Incident Review RSS Feed Additional Actions

26 26

27 Taxonomy for Service Incident Status StatusDescriptionSHD icon InvestigatingMonitors have indicated a service anomaly, and/or we have received reports of a potential service incident and we are currently investigating the reports. Service InterruptionWe have confirmed that the normal services are being impacted. We are taking immediate action to:  Understand the cause of the failure and  Determine best course of action to restore service(s). Degraded ServiceThe services are currently experiencing degraded performance due to a service incident. Services are still active, but service responsiveness and/or delivery times may be slower than usual. We are currently working to restore normal service responsiveness. Restoring ServiceWe have isolated the likely cause of the incident and are in the process of restoring normal services. Extended RecoverySystem services are restored. Due to existing backlog of items the services may be slower than usual while the backlog clears Service RestoredNormal system services have been restored.

28 O365 Service Incident Notification Process Incident Occurs Service Health Dashboard (SHD) updated “investigating” Incident Status posted to SHD SHD updated until service restoration Closure Summary posted to SHD Post Incident Review posted to SHD Post Incident Review posted to SHD Within 5 business days

29 Office 365 Planned Maintenance Communication TypeDescriptionChannel Planned Maintenance Update 5 day prior notification of planned service maintenance that falls within approved maintenance timeframes. Service Health Dashboard Planned Maintenance Update (Outside Window) Notification of planned service maintenance that falls outside the approved maintenance timeframes. Service Health Dashboard, To: Customer Transitions / Upgrades Notification of service transitions and/or upgrades Service Health Dashboard, To: Customer StatusDescriptionSHD iconScheduled (5 business days advance notice) The planned maintenance activity has been scheduled. In Progress The planned maintenance activity is in progress. Please see the details for the expected time for completion. Completed The planned maintenance activity is complete. Postponed The planned maintenance activity has been postponed. Please see the details regarding the updated schedule Cancelled The planned maintenance activity has been cancelled

30 What is the uptime in Office 365? Why is it good? What does Microsoft do to make sure it is? How do these numbers translate into my organization? What happens when I have an outage? How does our approach differ from our competitors? What’s next? How does Microsoft make sure it keeps getting better?

31 Microsoft is the company that businesses look to for the software they need to boost productivity and operate with efficiency, effectiveness, and intelligence. Microsoft Office Division, the division that produces Office 365, produced over half of Microsoft’s operating income. For some of our competitors, productivity is a minor side business at best. 31

32 To Network Interruptions… To Cloud Disruptions… To The Realities Of Business Life 32

33 The Office 365 service level agreement covers all services – no exceptions! The definition of downtime for Office 365 is more than the “server-side error rate” – it covers real functionality, when users are unable to read, write, access, send, receive data.

34 The Office 365 service level agreement refers to all end users, not just those exceeding a particular threshold. 34

35 The recovery time objective (RTO) and recovery point objective (RPO) are based on regular verification and what we believe we can deliver in a real disaster. Some of our competitors claim a zero RTO and zero RPO, even if they have needed to restore from tape in the past! 35

36 What is the uptime in Office 365? Why is it good? What does Microsoft do to make sure it is? How do these numbers translate into my organization? What happens when I have an outage? How does our approach differ from our competitors? What’s next? How does Microsoft make sure it keeps getting better?

37 37

38 38

39 Biweekly service updates Feature and capability releases every 90 days Major feature and capability releases every months Anomaly Detection Improvement Development 39

40 Microsoft Office 365 will continue to offers the most resilient and predictable service availability experience for the cloud. Backed by the most responsive support available and a the most comprehensive financially backed SLA to reflect our commitment to meet your service availability needs. 40

41

42

43 Connect. Share. Discuss. Learning Microsoft Certification & Training Resources TechNet Resources for IT Professionals Resources for Developers

44 Evaluations Submit your evals online

45


Download ppt "Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324."

Similar presentations


Ads by Google