Presentation is loading. Please wait.

Presentation is loading. Please wait.

ARC301 BUT… we are here FromTo Chicago Cheyenne Dublin Amsterdam Hong Kong Singapore San Antonio Microsoft has datacenter capacity around the.

Similar presentations


Presentation on theme: "ARC301 BUT… we are here FromTo Chicago Cheyenne Dublin Amsterdam Hong Kong Singapore San Antonio Microsoft has datacenter capacity around the."— Presentation transcript:

1

2 ARC301

3

4

5 BUT…

6

7 we are here

8 FromTo

9 Chicago Cheyenne Dublin Amsterdam Hong Kong Singapore San Antonio Microsoft has datacenter capacity around the world…and we’re growing Boydton Shanghai Quincy Des Moines Brazil 35+ factors in site selection:  Proximity to customers  Energy, Fiber Infrastructure  Skilled workforce "Data Centers have become as vital to the functioning of society as power stations." The Economist

10

11

12 Office365.com Our surface area is too big/partitioned to manage sanely Service management is largely done via our Datacenter Service Fabric North America 1North America nEurope 1 DATACENTER AUTOMATION

13 Big Data External Signals System Signals Access Approval Auditing Compliance Changes Safety Orchestration Repair We simplify by focusing all our work along the three pillars— these work in tandem to create a great service fabric Allows us to create a virtuous automation system that is SAFE, DATA DRIVEN while being AGILE at very high scale Machine Learning

14 Orchestration Central Admin (CA), the change/task engine for the service Deployment/Patching Build, System orchestration (CA) + specialized system and server setup Monitoring eXternal Active Monitoring (XAM): outside in probes, Local Active Monitoring (LAM/MA): server probes and recovery, Data Insights (DI): System health assessment/analysis Diagnostics, Perf Extensible Diagnostics Service (EDS): perf counters, Watson (per server) Data (Big, Streaming) Cosmos, Data Pumpers/Schedulers, Data Insights streaming analysis On-call Interfaces Office Service Portal, Remote PowerShell admin access Notification/Alerting Smart Alerts (phone, email alerts), on-call scheduling, automated alerts Provisioning/Directory Service Account Forest Model (SAFM) via AD and tenant/user addition/updates via Provisioning Pipeline Networking Routers, Load Balancers, NATs New Capacity Pipeline Fully automated server/device/capacity deployment DATACENTER AUTOMATION

15

16

17 Multi-signal analysisData driven automationConfidence in data communicate snooze recover block AUTOAUTO

18

19 PARTITION Office365.com Each scenario tests each DB WW ~5mins—ensuring near continuous verification of availability From two+ locations to ensure accuracy and redundancy in system 250 million test transactions per day to verify the service Synthetics create a robust “baseline” or heartbeat for the service NETWORK

20

21

22

23 Deviation from normal means something might be wrong 99.5% and 0.5% historical thresholds Moving Average +/- 2 Standard Deviations Methodology for data computed

24 4:46 PM is when the alert was raised This is 4:46 PM! Allows us to inform customers in real- time Keeps engineers focused on recovery Improves transparency with support and others who keep customers happy

25

26

27 CAPACITY

28

29

30

31

32

33 1) Run a simple patching cmd to initiate patching: Request-CAReplaceBinary 2) CA creates a patching approval request email with all relevant details 3) CA applies patching in stages (scopes) and notifies the requestor 4) Approver reviews scopes and determines if the patch should be approved 5) Once approved, CA will start staged rollout (first scope only contains 2 servers) 6) CA moves to the next stage ONLY if the previous scope has been successfully patched AND health index is high 7) Supports “Persistent Patching” mode

34

35 Old network design…To new:

36

37

38 What we are today is a mix of experimentation, learning from others and industry trends (and making a lot of mistakes!)

39 Product Team Service Operations Service Tier 2 Operations Tier 1 Operations Product Team Service Tier 1 Operations Service Product Team Operations Software Aided Processes SupportOther Product Team we are here

40 September People Impact 1. 176 unique on-calls were paged 2. 33 of them got > 15 pages (40% of pages) 3. 30 got >= 8 and <= 15 (35%) 4. 113 < 8 pages (15% of pages)

41

42

43

44

45

46

47


Download ppt "ARC301 BUT… we are here FromTo Chicago Cheyenne Dublin Amsterdam Hong Kong Singapore San Antonio Microsoft has datacenter capacity around the."

Similar presentations


Ads by Google