Presentation on theme: "AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection."— Presentation transcript:
AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection in a Large Environment Paul K Merline & Mike Badaczewski November 15, 2011
Page 2 Which is greater…the average attendance at Busch Stadium or the number of servers we collect data on every night?
Page 3 Answer….. AT&T Systems collected nightly= 38,353 Busch Stadium Average nightly attendance = 38,196 (source ESPN.com A/O 10-18-11)
Page 4 Data Collection Goals Provide consistent, standard and meaningful resource usage data for all servers to support Capacity Planning. Establish and maintain an environment capable of supporting data collection for 40,000 servers with the existing staff. Have previous days data available by 08:00 local time.
Page 5 Data Collection Overview Number of metrics collected and retention based upon criticality of server (service levels). Separate data collection based on platform, e.g. UNIX, Windows, etc. Spread workload across several centralized data collection servers (Consoles). Stagger data collection across time zones. Analyzed data output sent to data base server for Visualizer db loads.
Data Collection Strategy Tier and Tier Level assigned based on: server criticality (MCA, normal production) status (production, test, development) in-service indicator Service Level assigned based on Tier and Tier Level which determines: metrics collected retention period of metrics Collect and retain only the metrics necessary based on the criticality of the server Page 6
Page 7 Tier Mapping to Service Levels
Page 8 Data Collection Service Levels PLATINUMGOLD SILVER BRONZE Tier 1 Tier 2-1Tier 2-2 or 2-4 MCA requiring extra careMCA or on frame w/ MCANon-MCA productionTest and development Custom workload char Standard workload charNo workload char 35 UNIX metrics 21 UNIX metrics10 UNIX metrics 15 min. data summary60 min. data summary 6 mos. data retention 1 mos. data retention7 day data retention 200 UNIX7,400 UNIX, 400 Win6,000 UNIX, 14,000 Win4,000 UNIX, 3,000 Win
Data Collection Process Gold Silver Bronze Service Level Region Platform (target is 25 servers per domain for performance reasons) Servers are grouped into collection domains based on: UNIX Windows VMWare East Central Mountain Pacific Alaska Hawaii Frame Frames Non-Frames Page 9
Page 10 UNIX Metric Groups
Page 11 Data Retentions
Page 12 Current Data Collection Counts Totals by Service Level Service Level UNIX Systems Windows Systems VMWare Systems Total Systems% of Total Gold 6,0919112697,27119% Silver 7,7868,714---- 16,50043% Bronze 8.4132,0224,147 14,58238% Total22,29011,6474,41638,353 100%
Page 13 Data Collection Tools The BMC Performance Assurance product family offers a complete solution for performance management of UNIX and Windows systems. It delivers the following critical functions for managing distributed systems: Real-time monitoring Modeling and predicting Graphical performance analysis
Page 15 BMC Consoles and Visualizer Database Console A Visualizer Database Schemas Console B Console C Console D 8,566 476 domains Number of Servers Collected from Nightly Automation 8,743 475 domains 11,970 485 domains 9,074 489 domains Visualizer database is 2.3 Tb. in size and divided into 92 schemas by: Platform Time Zone Service Level (limit to 1,000 servers per schema for performance) UNIX Windows VMWare Gold – 6 Silver – 8 Bronze - 6 East Central Pacific Gold – 6 Silver – 8 Bronze - 7 Gold – 5 Silver – 7 Bronze - 4 Gold – 1 Silver – 5 Bronze - 2 East Central Pacific Gold – 1 Silver – 6 Bronze - 2 Gold – 1 Silver – 3 Bronze - 1 Gold – 1 East Central Pacific Gold – 1 All Other - 5 All Other - 4 All Other - 1 62 UNIX Schemas 26 Windows Schemas 4 VMWare Schemas
Page 16 Data Collection Process Perform binaries are laid down with the Patrol installation on the server (node) A collector runs on each server (node) and writes data to disk periodically (currently set to 15 minutes) The data is pulled by the Perform Console and processed nightly (hourly summarization) creating vis files Nightly automation consists of 3 processes: Retrieve Analyze Populate
Page 17 Nightly Automation Scheduling
Page 18 Monitoring Environment Results Nightly automation stats 7 time zones 39 states 256 cities 1,925 domains 1,947 VIS files 38,353 servers 621,615 UDR files 13.5 new servers added per day over the last year (4,947)
Page 19 AT&T Capacity Planning Database
Page 20 AT&T Application Reporter
Page 21 Bonus Material BMC 7.5
Page 22 Performance Assurance Release 7.5 New Features and Functionality New Virtualization Support SUN Solaris Logical Domains (LDoms) SUN Chip Multi-Threading (CMT) technology IBM AIX Live Partition Mobility IBM AIX Workload Partitions (WPARs) IBM PowerVM HP Integrity Virtual Machines (IVM) Microsoft 2008 Virtualization Server (Hyper-V) Enhanced VMware Virtualization Support Cluster, resource pool, disk and datastore metrics Info on relationships between servers, virtual machines, pools Perceiver support for cluster, resource pool and disk views Improvements to proxy data collector Complete re-design of Visualizer tables and relationships
Page 23 Performance Assurance Release 7.5 New Features and Functionality (cont) Console Operations Improvements to Manager for recovery and reprocessing of data Manager exception reports Officially supported Service Levels New General Manager web application to manage Perform and Perceiver – daily operation and exceptions UDR Transfer Utility Changes to management of Hardware table for performance ratings Changes to the Visualizer database structures Problem resolutions and enhancement implementations
7.5 Migration Issues Page 24 Some Visualizer tables have been re-designed to accommodate metrics for virtual servers (current metrics may have moved to new tables). The changes in Visualizer require migrating all data from the old 7.4 schemas to the new 7.5 schemas. If multiple Consoles update the same Visualizer schema, all Consoles must be migrated to release 7.5 at the same time. The Visualizer database migrations must be done at the same time the Consoles are migrated to release 7.5. Therefore, in our environment, all Consoles and all Visualizer databases must be migrated to 7.5 simultaneously. Per BMC, very large Visualizer schemas may take longer than a day to migrate to 7.5 (we have 90+ Visualizer schemas). Per BMC, the most significant problems they have seen with the new release involves database migrations.