Presentation is loading. Please wait.

Presentation is loading. Please wait.

What Marketing likes to show off… …how we actually get there.

Similar presentations


Presentation on theme: "What Marketing likes to show off… …how we actually get there."— Presentation transcript:

1

2

3 What Marketing likes to show off… …how we actually get there

4 BUT…

5

6 *Operated by 21Vianet 1 million+ servers 100+ Datacenters in over 40 countries 1,500 network agreements and 50 Internet connections Microsoft has datacenter capacity around the world…and we’re growing

7

8 Conversation for every service provider and customer has always been:

9 Low Mean Time To Resolution (MTTR) is only possible via: Data Driven Secure Automate d Big Data External Signals System Signals Access Approval Auditing Compliance Changes Safety Orchestration Repair Machine Learning

10 From Complex…To Simpler…

11

12

13 Office365.com North America 1North America nEurope 1 DATACENTER AUTOMATION

14 …is made up of a lot of stuff Orchestration Central Admin (CA), the change/task engine for the service Deployment/Patching Build, System orchestration (CA) + specialized system and server setup Monitoring eXternal Active Monitoring (XAM): outside in probes, Local Active Monitoring (LAM/MA): server probes and recovery, Data Insights (DI): System health assessment/analysis Diagnostics, Perf Extensible Diagnostics Service (EDS): perf counters, Watson (per server) Data (Big, Streaming) Cosmos, Data Pumpers/Schedulers, Data Insights streaming analysis On-call Interfaces Office Service Portal, Remote PowerShell admin access Notification/Alerting Smart Alerts (phone, email alerts), on-call scheduling, automated alerts Provisioning/Directory Service Account Forest Model (SAFM) via AD and tenant/user addition/updates via Provisioning Pipeline Networking Routers, Load Balancers, NATs New Capacity Pipeline Fully automated server/device/capacity deployment DATACENTER AUTOMATION

15

16

17

18 Multi-signal analysisData driven automationConfidence in data communicate snooze recover block AUTOAUTO Evolution of Monitoring == Data Analysis

19 Data Insights Engine 100s of millions users 100k+ servers 15TB/day Dozens of components Dozens of datacenters many regions 500M+ Events per hour Millions of organizations

20 PARTITION Office365.com Each scenario tests each DB WW ~5mins—ensuring near continuous verification of availability From two+ locations to ensure accuracy and redundancy in system 250 million test transactions per day to verify the service Synthetics create a robust “baseline” or heartbeat for the service NETWORK

21 System handles 50K transactions/sec, over 1 billion “user” records per day

22

23

24 Deviation from normal means something might be wrong 99.5% and 0.5% historical thresholds Moving Average +/- 2 Standard Deviations Methodology for data computed

25 4:46 PM is when the alert was raised This is 4:46 PM!

26

27

28 CAPACITY

29

30

31

32

33 Tickets Opened:1278 Tickets Closed:1431 Tickets Currently Active:196 % Automated Found:77% Average time to complete (hrs):9.43 95 th Percentile (hrs):28.73

34 1) Run a simple patching cmd to initiate patching: Request-CAReplaceBinary 2) CA creates a patching approval request email with all relevant details 3) CA applies patching in stages (scopes) and notifies the requestor 4) Approver reviews scopes and determines if the patch should be approved 5) Once approved, CA will start staged rollout (first scope only contains 2 servers) 6) CA moves to the next stage ONLY if the previous scope has been successfully patched AND health index is high 7) Supports “Persistent Patching” mode

35

36

37 What we are today is a mix of experimentation, learning from others and industry trends (and making a lot of mistakes!)

38 Product Team Traditional IT Highly skilled, domain specific IT (not true Tier 1) Success depends on static, predictable systems Service IT Tiered IT Progressive escalations (tier-to-tier) “80/15/5” goal Direct Support Tier 1 used for routing/escalation only 10-12 engineering teams provide direct support of service 24x7 DevOps Direct escalations Operations applied to specific problem spaces (i.e. deployment) Emphasize software and automation over human processes Service Operations Service Tier 2 Operations Tier 1 Operations Product Team Service Tier 1 Operations Service Product Team Operations Software Aided Processes SupportOther Product Team we are here

39 Provide any additional feedback on your on call experience. When I am on-call, I hate my job and consider switching. It's only that I work with great people when I'm not on-call Build Freshness report has had bugs more than once. Need better reliability before deploying changes. Too many alerts It was much easier this time around than last.

40

41

42

43

44


Download ppt "What Marketing likes to show off… …how we actually get there."

Similar presentations


Ads by Google