Presentation is loading. Please wait.

Presentation is loading. Please wait.

EarthLink and Micromuse: Growing up Together Doug McClure EarthLink Operations Sr. Manager, Fault and Performance Mgmt June 3, 2004.

Similar presentations

Presentation on theme: "EarthLink and Micromuse: Growing up Together Doug McClure EarthLink Operations Sr. Manager, Fault and Performance Mgmt June 3, 2004."— Presentation transcript:

1 EarthLink and Micromuse: Growing up Together Doug McClure EarthLink Operations Sr. Manager, Fault and Performance Mgmt June 3, 2004

2 Fault & Performance Mgmt 1 Overview One of the Nations Largest ISPs Headquarters in Atlanta, GA –Key facilities in Dallas, TX, Pasadena and San Jose, CA, Knoxville, TN and Seattle, WA Profitable, strong balance sheet Largest DSL footprint First-to-market with products that provide the best possible Internet experience Customer Advocacy: Fighting SPAM with technical solutions, litigation, legislative support, industry collaboration and consumer education –Howard Carmack, aka the "Buffalo Spammer," was sentenced to 3-1/2 to seven years in prison on May 27 th after EarthLink received a $16.4M civil judgment in May 2003 10 th Anniversary (1994-2004) –http://www.redefineyourworld.com

3 Fault & Performance Mgmt 2 Overview 5.25M Customers ~4M Dialup (Premium ~3.5M, Value ~500K) ~1.2M Broadband (Cable, xDSL) ~160K Web Hosting (Unix, Windows) ~50K Wireless (Blackberry, PDA, Laptops, Wi-Fi) Dial Access Coverage > 90% of US Population ~16K Local Dial Access Numbers ~500K Active Modem Ports (~50% ELNK, ~50% Outsourced) ~400 PoPs (18 Core Backbone PoPs, four data centers) Broadband Coverage ~200 Markets with Broadband Offerings Large and Diverse Infrastructure 2300 Network Elements 1500 Server Elements Thousands of Access Circuits, Hundreds of WAN Circuits

4 Fault & Performance Mgmt 3 Overview Access Technology Innovation Premium and Value Dial-up Broadband (Cable, xDSL, Satellite) Voice (Converged Devices, VoIP) Wireless (WiFi, CDMA, Blackberry, PDA) Broadband over Power Lines (BPL) Value Added Service and Product Innovation Blocker Family: spamBlocker, POP-UP Blocker, ScamBlocker, Virus Blocker, Spyware Blocker Parental Controls Webmail Web Accelerator

5 Fault & Performance Mgmt 4 Overview Exceptional Customer Service 2003 PC Magazine Readers' Choice Awards for both high-speed and dial-up services 2003 highest ranking in customer satisfaction for the second year in a row for high-speed Internet service by J.D. Power and Associates in its Internet Service Provider Residential Customer Satisfaction Study SM 2003 CNET Editors' Choice award

6 Fault & Performance Mgmt 5 Innovation = Constant Change Drivers Speed to Market, Competition – Do more, faster Quality, Performance, Support Costs Compliance - Sarbanes-Oxley Operational Challenges Release Management Change Management Service Level Management

7 Fault & Performance Mgmt 6 Operations Maturity: Growing Up Production Improvement Program (PIP) Foundation in IT Service Management, ITIL, CobIT Focusing on four main areas: Service Level Mgmt, Change Mgmt, Release Mgmt, and Production Security –Over 10% of Operations staff have now attended ITIL Foundation Training 1 Master Level Certified (more planned) 9 Practitioner Level Trained in CCR Quadrant (pending certification results) 114 Foundation Level Trained (most pending certification results)

8 Fault & Performance Mgmt 7 Operations Maturity: Growing Up Service Level Management NOC, Help Desk Set and manage expectations internal/external to Operations Change Management Provide oversight and control of the production environment Minimize risk and impact from change activities Release Management Development Operations Minimize poor quality production releases Enterprise Security Compliance, control, audit

9 Fault & Performance Mgmt 8 EarthLink and Micromuse Facts Very Early Netcool Adopter EarthLink (Mindspring) was Micromuses first US customer –Began evaluating Micromuse Netcool in 1996, official customer April 1997 Early Innovation Early joint innovation and development helped build foundation for many of Micromuses key products –EarthLink and Micromuse are revitalizing joint development projects with emerging service and business activity monitoring products Driving 3 rd Party Vendor Integration & Partnerships EarthLink requires detailed integration with Micromuse suite – much more than just sending SNMP TRAPs –Quest Software, Compuware, PeopleSoft, Remedy, Cisco Systems, Arbor Networks Current Deployment Netcool OMNIbus, Internet Service Monitors, Desktop Clients, Webtop, Impact, numerous Gateways, Probes, Data Source Adaptors –Two Senior System Engineers, Three System Engineers, Two System Analysts devoted to Fault and Performance Management (Netcool + Other) Services provided for NOC (3 shifts, 6 per shift), Systems Administration (3 shifts, 10 per shift), Network Engineering

10 Fault & Performance Mgmt 9 Moving Beyond MoM and Apple Pie EarthLinks Early Micromuse Netcool Deployment Focused on Netcool as the Manager of Managers or MoM Needed during EarthLinks rapid growth and expansion Enabled event management, eliminated swivel chair NOC Apple Pie is Event Correlation and Deduplication The Netcool sweet spot was providing EarthLink with event correlation and deduplication –Able to reduce the event stream from 100,000s to 1,000s per week –Further reduction expected to 100s per week through use of advanced Netcool/Impact policies and deployment of Netcool/Precision Enables NOC and support staffs to operate efficiently Focus now on End-to-End Service Management Netcool Suite allows EarthLink to manage entire service –Understand service relationships, service levels, perform service modeling and service discovery Enables impact assessment, prioritization, understanding service delivery chain Eliminates needle in the haystack approach of event management –This is the problem that needs attention now (compared to I think this is the event causing problems)

11 Fault & Performance Mgmt 10 Service Management Complexity Good Customer Experience? Performance? Infrastructure Events to Netcool Source: EarthLink Product Group

12 Fault & Performance Mgmt 11 Service Management Complexity Number of Components Time (24x7x365) System Changes Infrastructure Events D D DDDD D D DDDD DDDD DDDD DDDD Identify key service elements Instrument those elements Consolidate & analyze data Develop service model and SLAs Dealing with EarthLink Service Complexity: The complexity and amount of data generated from end-to-end service management is enormous Networks, Firewalls, Servers, Applications, Switches, Routers, Load Balancers, Applications, Databases, etc. Netcool/ObjectServer is a must have for EarthLink to effectively manage and understand EarthLinks service event stream from end-to-end Impact 3.0s cluster capability will enable EarthLink to analyze, enrich, suppress, and manage event stream regardless of our growth Source: EarthLink Product Group RAD (future) Impact Precision (future) ISM System Agents SNMP ObjectServer RAD (future) Impact RAD (future) Impact ISM

13 Fault & Performance Mgmt 12 The Customer IS Important Customer Experience Monitoring and Management The Micromuse Netcool Suite enables proactive, real-time monitoring of the customers experience for core EarthLink services –Over 14K Internet Service Monitors (ISM) instances in operation covering all key services (HTTP, HTTPS, SMTP, POP3, IMAP) and dedicated customers (ICMP) Allows for customer experience monitoring information to be correlated, analyzed, and presented in real-time –Micromuse Netcool/ISMs, Keynote, Compuware Client Vantage, Quest Foglight –External/Internal Synthetic testing system & network element monitoring system and network port monitoring Immediate notification to support groups when customers experience degrades

14 Fault & Performance Mgmt 13 The Business IS Important Business Activity Monitoring and Management Expands IT Operations visibility vertically and horizontally Ties IT Operations data and Business data together –System Downtime vs. Contact Center Call Volume –Real-Time Customer Subscriptions vs. Sales Forecasts Enables Real Time Monitoring and Management of Business and IT processes –Change and Downtime Management –Customer Registration Management

15 Fault & Performance Mgmt 14 Production Improvement Program Release Planning Dev / Procurement Release Design, Build Release Acceptance Roll-out Planning Comm, Prep, Training Distribution/ Installation Policy, Procedures, Standards & Guidelines Security Consulting Security Assessment Security Monitoring STATUS CHANGE (1) Prioritization, Risk Assessment and Forward Schedule of Change STATUS CHANGE (2) Change Approval and Proj. Service Availability STATUS CHANGE (3) Final Change Approval and Implementation Metrics & Reporting Corp Project Ops Project Non-Project Prod Sec REQUEST FOR CHANGE (RFC) CLOSED RFC STATUS CHANGE (4) Review Changes Security Test & Sign off Release Mgt Change Mgt Mutual Benefit from EarthLinks Innovation and Advanced Use of Micromuse Products Micromuse OMNIbus, Impact, Webtop, RAD, NFSM Source: EarthLink SLM Group

16 Fault & Performance Mgmt 15 Business Activity Monitoring Managing the Impact of Change and Downtime Activities on the Business and Operations

17 Fault & Performance Mgmt 16 Overview Drivers Adoption of ITIL/COBIT Best Practices for Change Management –Production Improvement Program (PIP), SOX Compliance, etc. –Significant change for many groups – Fear, Uncertainty, Doubt (FUD) No Real-Time Visibility into Change/Downtime Management Activities –Business Process Who, What, When, Where, Why, and How, Cost, Risk, and Impact –Workflow – Monitor Lifecycle, SLAs, Bottlenecks – Is the process enabling Operations or is it a bottleneck? –Impact on Infrastructure – False Positives, Contact Center Call Volume (COGS) Drive out False Positives from Production Monitoring Systems –Huge burden on NOC and other support staff Desire to have Automated Remedy Trouble Ticket Creation –Reduce time to address problems, reduces MTTR

18 Fault & Performance Mgmt 17 Overview Solution Provide Real-Time Visibility into Change/Downtime Process –There are 12 pending and 24 scheduled change requests for tonight, 6 are underway and 8 start in 15 minutes or less Create Actionable Information –Dept. 828 has five outstanding major change requests, attention is needed Ensure Business Rules are Guiding/Enabling the Process – Not Hindering It –Eliminate FUD Report (dashboards, reports) on Process and Impact –NOC and other support groups know whats happening during change and downtime windows –Management has oversight and visibility –Business understands impact of change and downtime activity

19 Fault & Performance Mgmt 18 Implementation Micromuse Netcool/OMNIbus –Custom integration with Request for Change (RFC) and Downtime Management System –ObjectServer flexibility allows for definition of important business and IT data in each event to capture Change/Downtime Status Service Impact, Business Impact, Customer Impact, SLA, Restoral Priority, Escalation Path, etc. Micromuse Netcool/Impact 3.0 –Impact policies build lists in real time for all nodes listed in change/downtime request –As change/downtime activity progresses through its lifecycle, the change/downtime Netcool event changes states –Change/Downtime event suppression policy updates all incoming events that match node list during the maintenance window with Suppression Status and Change/Downtime Reference Number Micromuse Netcool/Webtop 1.2 – RAD 2.0 –Process owner (Change/Downtime Management Group) dashboard for monitoring and managing the overall end-to-end process, workflow, and business impact –Business group dashboards for monitoring change/downtime activities within area of control (Network Engineering, MIS, etc.)

20 Fault & Performance Mgmt 19 Webtop 1.2 Presentation

21 Fault & Performance Mgmt 20 RAD 2.0 Presentation

22 Fault & Performance Mgmt 21 Netcool Event Management Change/Downtime Request Events Suppressed Change/Downtime Activity Events Change / Downtime Status Event Suppressed by Change / Downtime Change / Downtime ID

23 Fault & Performance Mgmt 22 Future Enhancements Planned Netcool/Impact Policies COGS Impact –Assess support cost impact due to change and downtime activities within Operations and Customer Support in Real-Time Data Gap Management –A common question: Why does my chart or graph have gaps? –The solution: Annotate graphs, charts, portals, etc. with the reason for data gaps caused by planned change/downtime activities –How: Integrate change and downtime event information with all performance, utilization, and capacity monitoring solutions via Impact 3.0

24 Fault & Performance Mgmt 23 Business Activity Monitoring EarthLink Customer Registration, Provisioning, and Fulfillment Dashboards

25 Fault & Performance Mgmt 24 RAD 2.0 Joint Development Business Activity Monitoring: Real-Time Customer Registration Dashboard

26 Fault & Performance Mgmt 25 RAD 2.0 Joint Development Business Activity Monitoring: Real-Time Customer Registration Dashboard

27 Fault & Performance Mgmt 26 Continuous Improvement Building better Network and Systems Management Founded Atlanta Network and Systems Management Technical User Group (ANSMTUG) in January 2004 –http://www.ansmtug.org –Metro-Atlanta Fortune 100, Service Providers, Enterprise, Media, and Emerging Technology Companies Bell South, The Home Depot, EarthLink, Southern Company, N2 Broadband, eDeltacom, Delta, CNN, Cingular, E*Trade, Knology Broadband, Cox Communications Customers helping Customers –Use Micromuse and other NSM products better –Collectively drive product requirements and features into Micromuse and other NSM vendors Special Interest Groups (SIG) Forming –Best practices for NSM using Micromuse Netcool Suite –Aligning NSM solutions to ITIL, MOF, CobIT, etc.

28 Fault & Performance Mgmt 27 Challenges facing Micromuse Product Development, Focus, and Release Cycle –Business * Monitoring (BAM, BSM, BI, BTI, B-I-N-G-O) –Performance Monitoring & Management Solution –Features vs. New Product – Finding the Right Balance –Licensing – Needs Review and Simpler Approach –Support New Technologies Sooner Across Core Products –Uniform Release Cycle (core architecture components and capabilities) Discovery, Root Cause Analysis (RCA), Next-Gen Polling –Emerging Competition –Service / Application Discovery & RCA –Universal Poller Concept Out of the Box Functionality and Updates –Appearance of Requiring Too Much Customization Competition is focusing on this Many customers have product still on the shelf –Ease of Use More out of the box, templates, examples, plug and play, wizards, Tools and Utilities section on Support website is a start –Improving Documentation

29 Fault & Performance Mgmt 28 Closing and Q&A Closing Q&A Doug McClure Sr. Manager, Fault and Performance Mgmt EarthLink Operations 404-748-7665 (W) 678-362-7712 (C)

Download ppt "EarthLink and Micromuse: Growing up Together Doug McClure EarthLink Operations Sr. Manager, Fault and Performance Mgmt June 3, 2004."

Similar presentations

Ads by Google