Modeling Retail Applications @ a Major Telecom Company Predictive Analysis in a Multi-Tier Infrastructure John Slobodnik October 21, 2008 CMG Canada
Preparation for Modeling Get an application infrastructure diagram. Turn on Solaris Process Accounting. Install TeamQuest Manager. Install TeamQuest View. Gather Key Performance Indicator. Perform Workload Characterization Perform predictive analysis using TeamQuest Model.
Infrastructure Diagram It is important to get this diagram to understand the infrastructure that this multi-tier application resides on. Typically, an application support team is responsible for keeping these diagrams up-to-date.
Turn on Solaris Process Accounting Turn on Solaris Process accounting. Minimal additional CPU overhead since the data is already collected. Allows short-running tasks to be captured for workload characterization. Normally tasks <0.5 seconds get grouped. Certain applications with thousands of short tasks are prime candidates for this extra level of accuracy.
Install TeamQuest Manager Install TeamQuest Manager on at least one server from each tier of the application architecture. At least one agent was installed in each of 4 tiers Customize the TQ database on each server. Changed retention of 10 minute data to 2 weeks. Changed retention of 1 minute data to 1 week. Deactivated reductions. Requires Process Accounting turned on. Keep process information for 7 days. Created a silent install script to install the agent and customize the database. Create a script to customize the database (using tqdbu) with the settings specified in the previous bullet. Record the silent install script Syntax “install.sh –r silentinstallscriptnamehere tqmgr”
Install TeamQuest Manager Create a specifications file backup for each TQ database daily. Makes rebuilding the DB, in case of disaster, easier. The command to create a specifications file called “productionDBspec” is: teamquesthomedirectory/bin/tqdbu –o productionDBspec The command to use the specifications file to recreate a new database is: teamquesthomedirectory/bin/tqdbu –c productionDBspec Put disk free space monitoring in place. With process accounting on a lot of data was gathered on our Oracle server. There was barely enough space to keep a week’s worth of data in the existing filesystem. Alerts us when there is <20% free space in the filesystem used by the TQ DB.
Install TeamQuest View TQ View was used to ensure consistent performance across each server. –This tells us that the workload is consistent and reliable to use for modeling. –Data for whole week was analyzed to come up with the best time frame to use for modeling.
Gather Key Performance Indicator We asked the business what their key performance indicator (or main business driver metric) was. They were tracking these sales numbers by hour in an Oracle database. – Using a customized SQL query. – Which you can turn into a “custom metric” and create historical reports against.
Workload Characterization Purpose: To uniquely identify application-related work that runs on each server. A pre-requisite for modeling. Used TeamQuest View to list all processes that run on each server. Identified processes into unique workloads. –This is the most labor-intensive part of the whole exercise (can take days or weeks depending upon level of co-operation). Requires co-operation of the application experts to help identify processes which belong to their application. –Try to keep the number of workloads to as small a number as possible. Our goal was to create 2 workloads per server, one for the application- related work and OTHER. Define the workload definitions using TeamQuest Manager. On each server we created a new “Workload Set” containing a new “Workload” definition which uniquely identifies application-related activity. –Left the default “Example” workload set alone. “Login =“ uniquely identified application-related work on our Web Services, authentication, WebLogic, and Oracle servers.
Using TeamQuest Model The most important decision to make for modeling is “What timeframe do I use to base my model upon?”. The answer varies upon the peak usage time of the application from both a system resource and business sales perspective. I use a combination of busiest CPU, I/O and sales to come up with the timeframe to use. This has worked successfully for me using a 1-hour timeframe to base my modelling upon (5 hour timeframe as well). Stay away from “problem” times. Then we apply a growth percentage to that timeframe which equates to what the business said the estimated peak volume would be at their busiest time of year. We frame the growth % (LT & GT 50%). If the model did not show any weakness in the infrastructure at 50% growth we created another model with enough growth applied to find a weakness.
Using TeamQuest Model Outcome: We have successfully identified the need for an additional Oracle node in the infrastructure. Other outcomes have been: Your infrastructure is sufficient to make it through this years peak period, however, once the growth from the current state hits 300% then the Web Services tier will be the bottleneck, addition of 2 additional servers of the same build is recommended prior to that time.
Select data to build the Model Select “Generate Input File” servername
Select data to build the Model Fill out time and date and click “Next” servername
Select data to build the Model Confirm Workload Set, click “Next” servername
Select data to build the Model Click “Create Model Input File” servername
Select data to build the Model “Save” the file servername
Select data to build the Model Choose a filename then save.
Select data to build the Model Confirmation servername
TQ Model - Assumptions TeamQuest was not installed on all the systems in the environment, so in absence of that data we assume : External webservers – The 4 Sun servers are load balanced. WebLogic tier – The 3 Sun servers are load balanced. The 2 Sun WebLogic instances performs twice the work as a single WebLogic instance on the larger Sun server. Applications such as iPlanet, WebLogic, and Oracle are well instrumented. The orders are coming from the External Webserver.
TQ Model - Findings, Recommendations & Results Findings for multi-tier application environment: The number of orders on mm/dd/yyyy from noon until 5 pm was n. At 300% growth or nn orders from noon till 5 pm, the CPU in the UNIX web services iPlanet tier is maxed and the response time is significantly higher than for n orders, i.e. 382.4% higher. Recommendations: Add 2 additional nodes to the external web tier Plan to add the additional servers in 2009. Results: TeamQuest time spent on Model = less than 2 hours
TQ View – CPU Utilization CPU utilization of all the systems: One day does not stand out as looking any different than any other day for CPU & I/O. So, we chose the afternoon mm/dd/yyyy, 12:00- 17:00. We divided the work between application & non-application (workloads).
What if we add 2 servers to the external web server tier?
What if we model the external web server on its own?
Frequency of Modeling During peak time of year for the application and 6 months later (at a minimum). Prior to and after any major hardware changes to the infrastructure. After any major software changes to the infrastructure. This can be changes to the application code. Can also be vendor software version change. New version of WebLogic. New OS level. Latest version of Oracle These happen more frequently, it is not realistic (in my life) to re-do the exercise monthly.