Presentation is loading. Please wait.

Presentation is loading. Please wait.

Planning and Auditing Your Firm’s Capacity Planning Efforts By Ron “The Hammer” Kaminski

Similar presentations


Presentation on theme: "Planning and Auditing Your Firm’s Capacity Planning Efforts By Ron “The Hammer” Kaminski"— Presentation transcript:

1 Planning and Auditing Your Firm’s Capacity Planning Efforts By Ron “The Hammer” Kaminski Ronald.R.Kaminski@kcc.com ron@kaminski-family.com

2 Foreign speaker rules Please feel free to stop me to ask any questions –Raise your hand or clap if I am going too fast or if my Mississippi accent becomes impossible for y’all to understand This is not rude, and I will not take it that way –The paper and all slides will be furnished to my hosts

3 Introduction Over the past 20 years, I’ve started and expanded capacity planning groups at dozens of firms, my most recent is now 15 months old – You learn things in that process – CMG is the place to share this information – I look forward to your presentation on this topic in a few years! Today’s goal is to give you “planning and audit points” that you can use to review how you do capacity planning, and maybe persuade you that other methods might be more productive, or at least worth a shot! There will also be “How to” information, that may have you adding some “to do” items to your list If you have a question, ask it! – I like nothing better than surfing off on a tangent that helps the class Story Times! New risks 3© Ron Kaminski 2010, All Rights Reserved

4 Introduction In the next few hours, we will cover – Defining your mission – Picking the right vendor partners – Going “Extra-Product” – Avoiding the “IT Mindset Traps” – The politics of capacity planning in organizations, the key factor in your eventual success, or failure – Reporting, what you should and surprisingly should not do – Classic capacity planning question descriptions and proper answering techniques © Ron Kaminski 2010, All Rights Reserved4

5 Introduction In the next few hours, we will cover – How clouds and “software as a service” will still need capacity tracking and planning tools, and what new kinds you will need – Modeling when all of the cards are stacked against you, or “Tricks of the trade” – Goals to work towards – An audit list to compare to your systems Capacity planning done well can change the fortunes of a company and help all of our careers. Come sharpen your methods and learn tricks that will make you part of your firm’s future productive assets, and not an expense to be controlled © Ron Kaminski 2010, All Rights Reserved5

6 Ron’s Rules You can ask anything, at any time – Sometimes the answer is coming up soon in the examples, and in that case I’ll tell you so Quick Survey – Does anyone here already have… A network queuing theory based modeling package? Regular, automated process and workload pathology detection? Fast web reporting of resource consumption by business useful workloads? By the end of this talk, I hope that you will realize that workload characterized views of consumption, web accessible, over business useful time spans are a must have part of the best run IT shops – Lets see why…

7 Defining your mission Every site has their own “Hot button!” issues – “We are buying a new $23 million computer room every 6 months!” Attack server sprawl with data, not words – “I don’t know why we hired a capacity planner, we just…” – “Our critical applications are slowing down!” Use relative response times and historical information to show why – Chargeback used to be a big draw but it has really faded away in the post.com world It shows you when you are talking to an old vendor – The ITIL push and reality when facing outsourcing or “ZOG” ITIL takes a back seat to cost control, at least in the states – “We need better reporting!” Be careful to be holistic in what you deliver, cover every thing that they can buy, historically and ideally with business cycle peaks When you start hearing terms like “focus on business priority” and “really look at travel expenses” realize that cost cutting is in your future and report in ways that enable them to cut power and machines © Ron Kaminski 2010, All Rights Reserved7

8 Defining your mission You might think that all that variation would lead to very different solutions, and you’d be wrong! – All effective capacity planning systems are based on having: Efficient data collection, regrouping, reduction and storage Effective graphical reporting of business meaningful spans of time Components of workload response time that lead to diagnosis Solving the desire for answers to “What if…?” questions Problematic consumption diagnosis, reporting and ticketing – Some capacity planning product “features” marketed by vendors to the naïve are actually seldom used in the real world, and for good reasons Linear Trending, when what you really need is business cycle discovery and planning – The retail cycle at grocery chains and web payment system vendors Real Time Monitors, when you might want to go home or on vacation some day. Remember, problems happen 24 X 7, and humans won’t be watching “twitch monitors” that consistently. - The mission control room story Top 10 is often used to focus a newbie on peak consumption, which may all be valid © Ron Kaminski 2010, All Rights Reserved8

9 Defining your mission Who is doing the reporting? – Vendor supplied reports Tend to be single metric Often don’t include contextual information Are often “generate on demand” and therefore any useful span of time takes beyond the allowable attention span Often have serious contextual clarity problems – Workloads change colors as » the number present changes » You switch machines » Use black outlines that swamp the colors for small workloads – The “I’m only using vendor reports this time” and hit count story Can take unimaginable resources to produce – Set yourself a consumption budget and manage to it – You want to trade more bonds? Stop looking at it! May focus on reporting “right now” data rather than long term useful decision support information Seldom contain “disturbance to the status quo” notation capabilities © Ron Kaminski 2010, All Rights Reserved9

10 Defining your mission Who is doing the reporting? – Write your own reports Can be anything that you dream up (and can deliver the code for) There are multiple “free” languages and infrastructure to pick from – We’ve used perl, PHP, java and a whole lot more Can be tailored for your firm’s decision maker’s specific needs Can use “generate ahead” and other techniques to speed web reporting Writing your own can also have “down sides” – Staff turnover and the “Who is going to maintain this ___?” issues – Some staff are not gifted visual communicators – If the information used changes formats, (and over time they all do) someone is going to have to maintain that stuff © Ron Kaminski 2010, All Rights Reserved10

11 Defining your mission What do you want to present? – Workload characterized subdivisions of consumption over time? – Long term historical context for decision makers over multiple natural business cycles? – Information subdivided into audience specific groupings for ease of use by subgroups – Integration into your firm’s CMDB Ticketing systems Software development life cycle – Totals over time The spark lines counter-argument © Ron Kaminski 2010, All Rights Reserved11

12 Why sparklines of totals can be really useful These are sparklines of total CPU used, Average CPU used and the average CPU used by all nodes in that O/S Is there one in particular that draws your eyes to it, that wants you to probe deeper? © Ron Kaminski 2010, All Rights Reserved12

13 Why sparklines of totals can be really useful If you are like me, ustca102 has you wondering, “What made it step up like that? On our system, clicking on the tiny sparkline brings up a “zoomed in” image, which really gets you wondering: Clicking on that graphic brings up our normal web reporting system: © Ron Kaminski 2010, All Rights Reserved13

14 Why sparklines of totals can be really useful © Ron Kaminski 2010, All Rights Reserved14

15 Why sparklines of totals can be really useful OK, sometimes totals are useful – Sometimes they can draw your eye to issues – They can quickly dispel rumors that “All of our machines are maxed out!” For example, our applications specialists were consistently maintaining that all of their machines were barely big enough to make month end, and they would argue mightily whenever we might suggest that there was room for consolidation I brought the chart on the next slide to the next meeting, and suddenly their tune changed… © Ron Kaminski 2010, All Rights Reserved15

16 Why sparklines of totals can be really useful © Ron Kaminski 2010, All Rights Reserved16

17 Why sparklines of totals can be really useful What happened after the meeting? – In the next 9 months, using extremely conservative criteria, we Virtualized 230 machines ($1,521,000) Retired 55 machines ($ 390,553) – “Oh! You can just turn that off!”, or, “See steam come out of the operations folk’s ears” stories Planned 10 machines ($ 40,000) Potential 28 machines ($ 112,000) – We then plan on going back over with slightly less conservative criteria and finding a couple million more – We will also be doing more “application stacking” where it makes more sense Sort of makes capacity planning tools look cheap, doesn’t it? © Ron Kaminski 2010, All Rights Reserved17

18 Why sparklines of totals can be really useful © Ron Kaminski 2010, All Rights Reserved18 A DBA pal of mine asked for a review of memory on a box, asking for an increase to add caching and improve performance – I didn’t really detect a memory shortage:

19 Why sparklines of totals can be really useful Still, people don’t usually mention issues unless there is an underlying cause. So, as a capacity planner, you have to always look deeper and always check all of the following: – CPU – Disk I/O – Memory – Network – Response time for key workloads If you don’t always check everything, something can sneak by – Here is what I found when I followed the “always check everything” rule When I looked at CPU, I saw: © Ron Kaminski 2010, All Rights Reserved19

20 Why sparklines of totals can be really useful © Ron Kaminski 2010, All Rights Reserved20

21 Why sparklines of totals can be really useful © Ron Kaminski 2010, All Rights Reserved21

22 Update! They’ve since added 2 more CPUs and the issue continues unabated – Some issues are not based in physics and data! © Ron Kaminski 2010, All Rights Reserved22

23 New, new update, Just for St. Louis! © Ron Kaminski 2010, All Rights Reserved23

24 New, new update, Just for India! © Ron Kaminski 2010, All Rights Reserved24 In the end, someone looked at what was running, and decided most was waste! Look at what happened after Feb 22 nd !

25 Why sparklines of totals can be really useful Now you see several reasons see why longer term sparklines can be pretty useful – Do you currently have ways to generate them? – If not, do you want to get ways to generate them? – Don’t you all think that your vendor ought to provide them, in group and zoomed in formats? So lets start asking them to… Do you also see why you should always check everything and then sit back and ask yourself: – “If I had asked that question and then got this response, what would I ask next?” © Ron Kaminski 2010, All Rights Reserved25

26 Defining your mission Anticipate the “next questions” and always answer them before being asked –The unanswered “next question” can be a huge time waster often a stall technique used by the politically astute –It raises temporary doubt in your findings, and builds their case for swift purchase, before you answer their question –often a way for the old guard to show that they still are the “top dogs” to management Impatient or frightened management might run off and buy something! The undeclared war between Project Managers and Capacity Planners The “project manager weasel who never lost” story © Ron Kaminski 2010, All Rights Reserved 26

27 Defining your mission –If you are going to shoot down someone’s hypothesis that lack of CPU was the cause of a problem, you’d better find out what really caused the problem before the meeting! –Your goal: One meeting or phone call per issue! –They may say “We just want a quick and dirty answer” but they never really do! –Always cover at least: CPU Memory Disk I O Workload response time changes For web-centric systems, network distances and loads 27© Ron Kaminski 2010, All Rights Reserved

28 Defining your mission Cultural differences are real and might affect your workload choices –Some cultures avoid direct blame or information that would cause someone to “lose face” –Any workloads are better than none –The “No personal pronouns” story Be consistent! –Always use the same groupings on all similar nodes Use the same colors if you can! –Reduce the burden on your audience –Multiply the value of your workload creation efforts –Use consistent precedence order to decide where to put a process that meets the criteria to be in several different workloads © Ron Kaminski 2010, All Rights Reserved 28

29 Defining your mission Whatever you decide: – Track your own tools usage! There are multiple great freeware web usage reports that will tell you if folks are using or snoozing your data (We use webilizer: http://www.mrunix.net/webalizer/ ) http://www.mrunix.net/webalizer/ Unviewed information is wasted time and efforts – Use speed tests If there are multiple ways to do something (CSV files versus a Performance database) code for both and have a race – Will your web users want the slower one? – The capacity planning reporting challenge story – Don’t settle, always seek new audiences and better reports Add new functions – Sadly, there is no shortage of bad vendor reporting on expensive infrastructure » Anyone here ever seen a great graphical historical display in business useful terms of SAN information or LAN usage by segment? – Your firm may have business specific information that might be really useful to decision makers if overlaid on or graphically reported near with IT resource consumption © Ron Kaminski 2010, All Rights Reserved29

30 Our site’s web usage: © Ron Kaminski 2010, All Rights Reserved30

31 Our site’s web usage: © Ron Kaminski 2010, All Rights Reserved31

32 Our site’s web usage: © Ron Kaminski 2010, All Rights Reserved32

33 Our shared long term mission When you innovate and come up with new report ideas, share them at CMG! – Or at least send me examples in mail and I’ll do it for you! – Share code in this or other user groups that make sense We should all work together in user groups, public forums, on the web, etc., to push all of our vendor partners to address these needs – The more they do for us, the less we carry the “home brew code” weight We should also all work to reduce the volume, impact and long term storage requirements of our solutions – I have yet to encounter a vendor that isn’t carrying around a lot of extra metrics in the bowels of their systems that will never be used We should have a CMG sponsored “help wanted” section for capacity planning specialist positions in the various countries © Ron Kaminski 2010, All Rights Reserved33

34 Picking the right vendor partners I believe that all capacity planning efforts should have tools that include: – Efficient resource usage and process consumption collectors – Network queuing theory based “what if…?” modeling based on workloads, not total consumption The bulge trap – Efficient, speedy web-based historical consumption data display Ideally your chosen vendor would – support most or all of your differing operating systems and devices – have ample training and consultants available, there is nothing better than a co-pilot when you are starting out – participates in and supports CMG! © Ron Kaminski 2010, All Rights Reserved34

35 Picking the right vendor partners In the not too distant future, the best vendors should be: – Offering efficient “low impact” “cloud deployable wrappers” that run with your applications in a cloud – “We don’t have to worry, its in a cloud” is nonsensical Are you going to generate fake transactions and time them? When you get a long time back, or significant variance, are you going to have enough information to know why? I think that in time people will realize this need, and want it in their contracts Don’t you want to know the overhead of encryption and decryption in the process, and it’s response time effects? Stupidity is infinitely scalable, as long as you aren’t getting the bill – If nobody cares to make their code efficient, because they just send it to the cloud, how good is that code going to be? – Will it be running on the same machine as you tested? – Will it impact your users? © Ron Kaminski 2010, All Rights Reserved35

36 Picking the right vendor partners In the not too distant future, the best vendors should be: – Offering efficient “low impact” “cloud deployable wrappers” that run with your applications in a cloud (continued) The internet will continue to grow logarithmically – So those clouds could get mighty full, mighty quick – How do you want to find out that it is too full? » Do you want your customers telling you? » Or do you want your own reports based on scientifically accurately collected consumption data? Social media sites are becoming valuable business tools – Businesses “tweet” and have Facebook pages! – Do you think that a free application originally designed to let 14 year olds share photos is designed for high performance business needs? – How will you be sure? © Ron Kaminski 2010, All Rights Reserved36

37 Picking the right vendor partners In the not too distant future, the best vendors should be: – Thinking about SaaS user tools as well, Sure, SaaS vendors maintain the code and pay if it is a hog, but are they: running maintenance activities like backups and virus cans that slow things down right during prime time for Australia in your globally distributed firm? suffering from office hours peaks of consumption that impact your user’s response times? Taking outages to horizontally scale that might impact your firm’s ability to ship product? – Without your own data, you will never know What responsibility do you have to your firm’s users? Why is this network queuing theory based modeling stuff so important? – Let’s understand what it means and then see an example… © Ron Kaminski 2010, All Rights Reserved37

38 © Ron Kaminski 2010, All Rights Reserved38 Modeling Norms Most modeling packages assume a Poison or Chi-squared distributions of the arrival rate of transactions Some simpler, yet often quite elegant systems like Dr. Neil Gunther’s PDQ modeling just use a quadratic and forget the tails – They aren’t all that different despite what we modeling junkies might say! Don’t focus on the distribution selected, focus on whether they use queuing theory models and give you relative response times

39 © Ron Kaminski 2010, All Rights Reserved39 Why network queuing theory based modeling? These concepts are also often illustrated with simple queue graphics like the one at the right An important implied assumption is that all requests are served, none are lost Response time is the sum of Queuing Time plus Service Time

40 © Ron Kaminski 2010, All Rights Reserved40 Why network queuing theory based modeling? Methods do differ, but queues for interactive workloads are usually computed based on load percentage using a formula like: Q = U/(1-U) – where: – Q = Expected Queue – U = Utilization Response time is the sum of Queuing Time plus Service Time

41 © Ron Kaminski 2010, All Rights Reserved41 Why network queuing theory based modeling? So, as a workload competes for resources throughout a day, it’s response time is likely to vary Computed relative response times show us both the variations and the reason The Y Axis metric does not matter! – Just pick a basis, the ratio is the important part!

42 © Ron Kaminski 2010, All Rights Reserved42 Why network queuing theory based modeling? A workload’s typical transaction is likely to rely on several resources Imagine a workload running on a machine with four CPUs, six disks and some network IO on one card Note that when technologies differ, service times can differ

43 © Ron Kaminski 2010, All Rights Reserved43 Why network queuing theory based modeling? Now do you see where a graph like this can come from? If the warehouse folks are complaining about response times at 3:00 AM, should you upgrade the CPU? – When do you suspect that the backups are running? – Would a CPU upgrade help daytime response? But it also might make demand for I/Os faster and really slow down the warehouse at 3:00 AM too, so you better address the I/O issue!

44 Picking the right vendor partners In my experience, network queuing theory based tools move folks quickest to actionable answers – Once you understand relative response times, most issues are quick and easy to diagnose If a new vendor harps on linear “trending” graphics and projections, don’t expect them to be around for very long If a monitoring or other product vendor keeps adding “and you can use this for capacity studies” it is probably because the salesperson heard that you were looking for capacity planning tools! – Stick with network queuing theory based packages and you won’t go wrong! – Dozens of “And we can do capacity planning too!” stories © Ron Kaminski 2010, All Rights Reserved44

45 Ron Goes Off on VMware VMware is not a capacity solution VMware is a “symptom” of now capacity management © Ron Kaminski 2010, All Rights Reserved45

46 Ron Goes Off on VMware VMware is the single biggest indictment of the poor way most firms have done capacity planning in the Windows space – The lack of workload characterized views of consumption is why folks bought a server for each functional part – “We don’t want to stack multiple applications on one server! So we VMware them! …which is just stacking with the added joy of paying for not only extra copies of the OS and tools, but $900+ for VMware as well And in the end, the code is running on the same box! – VMware’s “so called” capacity planning tool is proof that they never attended a CMG! It is as near useless as any marketed tool that I have ever seen, but at least it is expensive… © Ron Kaminski 2010, All Rights Reserved46

47 Going “Extra-Product” Once you get used to your vendor’s product, if you are like me, you’ll start wishing for more functions tailored to your specific needs – In the old days, a grey haired expert would whip out a spreadsheet or other mathematical package and start creating some “home-brew” solution – I use perl and GD:Graphics, PHP, java script and anything else that I can think of, you can use what makes sense to you – Check out old CMG papers, they are laced with great ideas In other words, don’t feel limited to what your vendor does “out of the box” – Find buddies that use the same vendor and start sharing ideas and code – Things that you will see later in this presentation are shared among dozens of firms and they wouldn’t live without them – You don’t have to agree 100%, take what fits best and leave the rest © Ron Kaminski 2010, All Rights Reserved47

48 Going “Extra-Product” There are a whole group of us running many of the extensions that we’ve developed over time – Some of our extensions have made it into some products, but nowhere near enough of them! We probably get 50% of our firm’s benefit from the tools from our own extensions We regularly meet with the vendors and implore them to add the features that we like Having more singing from the same hymnal might just get through to them! Come join us! The best ideas might be in your head! Share! © Ron Kaminski 2010, All Rights Reserved48

49 Avoiding the “IT Mindset Traps” Capacity planners come in several flavors, because people from several different camps end up in this role – Scientists - Scientifically minded users of network queuing theory tools and simulation models that want to subdivide consumption into different behavioral groups and analyze them – Application specialists – application subject matter experts who “know the application” are trusted by management, and care deeply about it’s success. They often come from the application side of the firm – Old Timers – They know everybody, have worked on everything and have connections a and favors to call in to get things done. They often come from the operations side of the firm Each of these can be successful, but some are more prone to certain behaviors that can limit your capacity planning effectiveness and raise the costs of doing it Lets look at the typical pros, cons and peccadilloes of each © Ron Kaminski 2010, All Rights Reserved49

50 The Scientists The Scientist capacity planner – loves to get data from everywhere and everything that they can – Willingly tackles huge tasks as long as there is a possible learning benefit – Will constantly tweak the automation to be able to get yet more data – Will go “extra product” and build tools for specific functions without fear, because they are used to building things from scratch and being successful Pros – No fear, they view no problem as intractable and are sure that if they can get real data into a scientifically designed framework, business useful learning will result – No agenda, all applications and systems are equally important to them, they will not lobby for one application to get resources instead of another, preferring instead a rising tide that raises all boats – Willing to try new methods and tools in search of solutions © Ron Kaminski 2010, All Rights Reserved50

51 The Scientists Cons – Scientists can be viewed as “remote” or “doesn’t know the business” by some in management, particularly application development – They may want some really expensive and/or tricky software, and on every machine, and these tools produce copious amounts of data that needs to be processed, graphed and stored – The volume of tools and special case software that they accumulate over time can be hard to support by others – Good ones are relatively rare, ones that can teach/mentor others are extremely scarce Mindset Traps – Scientists can go off on tangents, they really need a manager who can Help them get the most productive subset of tools working first translate their outputs into terms understandable to the business help keep them focused on what the business deems most valuable – Their pursuit of the “one scientifically superior way” left unchecked can lead to ongoing high costs © Ron Kaminski 2010, All Rights Reserved51

52 The Application Specialist The application specialist in the capacity planning role – Will often drop everything else to don their fire-fighter jacket and “save the firm” by working on emergencies – Will rely strictly on simple O/S tools and minimal data, often just totals because ‘that was all we needed when we started this thing, and look how far we’ve come” – Seldom tracks historical consumption data over time, or if they do, seldom presents it in a format that is easily understood by others Pros – They really do know the application, the folks who are powerful, and they have a lot of chips at the bargaining table when it comes time to get things negotiated – Their application specific knowledge can really come in handy when strange behaviors are noticed – Their continuing drive to make an application succeed and the lengths that they go to are often very favorably viewed by non-technical management © Ron Kaminski 2010, All Rights Reserved52

53 The Application Specialist Cons – EGO! Our conference rooms are named after comic book super heroes! © Ron Kaminski 2010, All Rights Reserved53

54 The Application Specialist Cons – Their self confidence can lead to large egos, they dismiss opposing views of how to address issues other than “the way that we’ve always done it” – Their extreme willingness to join in every fire-fight eats a lot of time and delays the deployment of tools and systems (like long term historical consumption tracking) that would help others understand and make better decisions – Tend to enjoy being the “go to guy” and thus seldom share the basis for their decisions This is sometimes covering up the fact that the basis for their decisions is gut feel, not data – They will commit in public forums where management is present to supporting the scientists to get some application specific technical need, and then fail to do so in a timely manner, if ever – They really know their silo, but they are very uncomfortable when asked to go outside of it © Ron Kaminski 2010, All Rights Reserved54

55 The Application Specialist Mindset traps – These folks career successes have been built on “thinking on their feet” as issues occur, so they seldom take the time to build data collection and reporting structures that lead to well informed decisions “When you need to know something, just ask me.” They may even resist or delay deployment of capacity planning systems, calling them “costly, unnecessary and not our application’s highest priority” They will resist changes to their sacred “architectures” from the 1980s – They can be initially really interested in capacity planning information about their application, and use it to point out the positive impacts of their past decisions and successes …but don’t expect them to mention immense over capacity – Often their interest stops immediately at the edge of their application When there are issues larger than one application, they view it as their duty to “defend their applications turf” and will move to segregate the environments into “us” and “them” groupings that need not share any infrastructure – They think that “The vendor will tell us when to…” © Ron Kaminski 2010, All Rights Reserved55

56 The Old Timers The old timers in the capacity planning role – Are a calming presence in meetings – Have stories of a time when we faced something similar – Have the best jokes – Know and address the VPs as ‘Phil” and “Sandy” – Have capacity tracking systems that tend to the super-inclusive, when asked, they alone can root out data about darn near anything, but they have to be asked Pros – They have the trust and respect of nearly everyone, because everyone has worked successfully with them over time – When they need tools or space to get or keep their data, they just go ask “Phil” or “Sandy” – Are among the few to have worked on many of the systems, not just one or two, and so they understand deeply the inter-reliance of many of the systems and how an issue in one can manifest elsewhere © Ron Kaminski 2010, All Rights Reserved56

57 The Old Timers Cons – Old timers are often tired of learning. They seldom want to embrace radical new methods when they are retiring in a few years – Old timers are survivalists, or they wouldn’t be old timers. They have a great political sense of when “not to rock the boat” and “who not to mess with” that can prevent or delay the introduction of useful new information Mindset Traps – They approach capacity planning like they approached most of the IT issues that they’ve faced in their long careers “Let’s start with a database with thousands of metrics! You never know what will come in handy”, so resist deleting them while disk can still be purchased – Their reporting systems evolved over a long time, hence can be hopeless for someone new to decipher or change They can be based on large tables of numbers that only a select few can successfully use © Ron Kaminski 2010, All Rights Reserved57

58 Avoiding the “IT Mindset Traps” So what do we do? – How do we get the “pros” of each type and minimize the downsides? You must build a “matrix-ed” team containing some of each type – The team concept must have support from the highest levels – It must have priority from each of their respective management – They must be charged with: enabling the scientists to integrate new tools into the environment getting graphical reporting working that management can understand maintaining just enough information to provide long term historical context for decisions, but no more – Sometimes, you’ll have to bring in outside expertise, and the only way that will succeed is to have “friends in high places” It is critical to put this under an excellent manager – Each of the three types have useful and less useful behavior patterns – You need a manager that all can respect, who doesn’t try to be the expert, rather one who coaches each to be part of a well functioning whole © Ron Kaminski 2010, All Rights Reserved58

59 The politics of capacity planning in organizations Organizational politics are often the key factor in your capacity planning group’s eventual success or failure Long experience has taught many of us the importance of – Friends in high places Try to get the capacity planning issue instigated by a knowledgeable VP or at least a director Often a major initial stumbling block is even getting permission to install collectors on production systems, much less the physics of actually doing it, and there is nothing better than having their bosses boss saying, “Yes, you must do this, it is a priority” – Determining and rating the skills and power balances in your organization, usually by O/S – Managerial chaos can be a severe issue – Diagnosing and surmounting the barriers to success Describing the type Their common barriers and techniques to surmount them © Ron Kaminski 2010, All Rights Reserved59

60 Identifying and surmounting barriers Barrier: The “not invented here” über-geek – Identification clues Often are early members of a firm Usually position themselves as masters of several related technologies, but can be rather sparse on details The younger the firm, the more often you find them, internet firms in high growth areas are full of them They are convinced that “If we didn’t need it then, we don’t need it now!” – Their typical barrier methods “This is not an organizational priority” “This collector code is not proven on our sensitive production systems” – Techniques to surmount their barriers Friends in high places compel them Share credit for successes with them to their management Involve them in the model setup, ideally model along side them, letting them suggest probable growth steps © Ron Kaminski 2010, All Rights Reserved60

61 Identifying and surmounting barriers Barrier: “The high priests of the old tool set” – Identification clues They like “twitch monitoring” and often have built an extensive installation of them with impressive sounding names like “The war room” or “mission control” – Whenever you enter it during non-emergencies, notice how few people are actually using the displays They prefer current “totals” like total CPU because they’ve never had consumption by business identifiable sub-groupings They react to brief workload peaks by demanding upgrades – Their typical barrier methods Stalling. They ask streams of technical questions, and each answer that you give prompts another Requests to integrate, new capacity tools must feed information to their “war room” – Techniques to surmount their barriers Ask them to put long term, workload characterized consumption on their displays Have them tasked to help address pathologies automatically detected (that their monitors did not seem to surface) © Ron Kaminski 2010, All Rights Reserved61

62 Identifying and surmounting barriers Barrier: The application architects – Identification clues They rigorously defend their current multi-node spread as vital for – The organization – Uptime – Scalability 90% of their machines will be empty or nearly so The architecture was set in stone a decade ago, and is designed to solve the issues of that time, miniscule PCs – Their typical barrier methods Lecturing you on how their way is the “only way” – “Don’t you realize that these are business critical systems?” is used to justify all manner of excessive purchasing – They will lecture you on availability and scalability at the drop of a hat – Techniques to surmount their barriers Show them the serious speedups possible by collapsing application layers onto fewer machines and removing network time from chatty applications Ask them for estimates on just how much more their application will need to scale, given that it is 7 years old and already in use firm wide? © Ron Kaminski 2010, All Rights Reserved62

63 Identifying and surmounting barriers Barrier: The entrenched fire fighting squad – Identification clues They offer to work with you, but not today as there is an emergency They position themselves as “the experts” in an application They are hyper-sensitive to any changes in the environment, they view them as “dangerous” “Our conference rooms are named after comic book super heroes!” revisited, when you fly in to interview, everyone is fighting a fire – Their typical barrier methods They position themselves as “must have” team members and then are never Beware their commitments to make data or specifics available, they will often be “too busy” later to do it in a timely manner if at all – Techniques to surmount their barriers Agree to work with them as valued members of the team, then ignore them in your plans as they will always be too busy to help anyway Never trust them to come through with a key item, always plan for another way to get what they promise that does not involve them Over time, train them that many of the “time consuming fires” that they fight are simple pile ups of multiple pathologies that won’t bite if addressed in a timely manner © Ron Kaminski 2010, All Rights Reserved63

64 Identifying and surmounting barriers Barrier: The overwhelmed, outsource-able and scared – Identification clues They have single functions, often somewhat amorphous, and difficult to tag a dollar value on They are not in politically savvy management’s structures – Their typical barrier methods They stall, seemingly frightened to take on any task without exact instructions from their management The view tasks related to capacity planning as “Not their priority” They view all new functions as threats They seem to ignore all information not generated by their own function – Techniques to surmount their barriers These are politically weak people in politically weak areas, stay away from them so as not to have to rely on them If forced to work with them, work with their manager to emphasize that capacity planning is an important priority that they cannot stall Help the good ones get out of that group © Ron Kaminski 2010, All Rights Reserved64

65 Identifying and surmounting barriers Barrier: “This is a database server only” DBAs – Identification clues They claim that “In order to save the firm database license money, we are concentrating the databases from multiple applications on just a few servers” and “nothing else can run on these servers” – Their typical barrier methods Outright refusal to try collapsing micro-applications onto database servers Claim remaining capacity on the 1/3 used database server is “for growth” but are real hard to pin down for specifics, usually because there aren’t any – Techniques to surmount their barriers Try to get them to allow/install only a certain small percentage of application code on their machines due to “a network emergency”. That seems tiny and reasonable. – Use a number like 10% to 20%. They don’t need to know that that was all of the applications that you ever dreamed of doing. Show them how your automated process pathology code works, to ease their fears about rogue applications eating their machines alive and harming other applications Praise them to their boss as “innovative and balanced problem solvers” © Ron Kaminski 2010, All Rights Reserved65

66 Identifying and surmounting barriers Barrier: Lying, manipulative project leaders – Identification clues You are originally asked to model 400 users from a sample of 30. Later they say, “Oh no! We meant 1000 users!” – Their typical barrier methods Some project leaders view themselves as risk minimizers. Sadly, they often feel that 60% excess hardware is a proper sized “cushion”, so they inflate their usage estimate 60% to make the modelers justify excess hardware for them They took 3 extra months to get all these whacky features in, way past their deadline, but now time is an emergency and they need their results immediately or they just need to buy hardware right away because they have no time to test properly – Techniques to surmount their barriers Speed. You can model this stuff far faster than they can get a load test to work without half of those whacky features blowing up Ask more people for how many users really are going to be there © Ron Kaminski 2010, All Rights Reserved66

67 Identifying and surmounting barriers Barrier: Enthusiastic but “We went to Load Runner Class and we absolutely have to to run huge saturation load tests” drones – Identification clues They don’t understand mesa tests and modeling is all that is needed. Even if you can get a decent mesa test out of them, they still want to do a saturated load test anyway They REALLY BELIEVE two seemingly counter intuitive things: 1.Your operations group must run out and buy exactly the machine and memory that they dreamed up from dubious research for their tests 2.They do not have to run against realistic data volumes with similar indexes and size as intended production. They will NEVER create a statistically relevant data source. They will frankly state: “It is impossible!” © Ron Kaminski 2010, All Rights Reserved67

68 Identifying and surmounting barriers Barrier: Enthusiastic but “We went to Load Runner Class and we absolutely have to to run huge saturation load tests” drones – Their typical barrier methods No matter how many times you say not to, they will always strive to ramp up users at the start and ramp down afterward. Get ready to lose your first and last measurement periods If you can get a realistic transaction mix from them, they will still strive to run them too fast – The 30 second contract review, 8 hours a day story – Techniques to surmount their barriers Always question their user think times, then adjust your model to deal with the silliness that you uncover. Maybe 20% of the samples that I get have realistic transaction arrival rates, so beware Be consistent, over a series of tests you will wear them down, or get them fired © Ron Kaminski 2010, All Rights Reserved68

69 A mail message to a new fleet of “Load Runner” enthused contractor drones The purpose of load tests can be manifold, to test functionality, capacity, and “feel”. Modeling based on a sample does the same things and more, and usually much faster and cheaper. If you choose to run a load test, be sure to run a “realistic transaction mix” with the expected blend of all commands, not just one kind. If you are limited to simulating a subset of intended loads by physics (we don’t recommend simulating above 20 users per load running PC for accuracy) we can then take that load and model much higher ones and any alternate hardware that you might dream of. We have these caveats to improve accuracy: 1.Perform the tests on real, not virtual, servers for measurement accuracy 2.Run a proper “mesa test” for sampling which includes: A.Make sure that the CPP group has a collector on your intended test machine days before the test B.Start your test precisely on an hour boundary C.Do not, repeat, DO NOT “ramp up” or “ramp down” users. Just start and go, 20 users per load runner box will not overwhelm anything. Ramping is not required for models, indeed it is wrong to do it. D.Stop precisely on an hour boundary E.Send mail to us telling us I.how many users you simulated II.The precise timings III.How many more users we should add in the models IV.Anything else pertinent © Ron Kaminski 2010, All Rights Reserved69

70 A mail message to a new fleet of “Load Runner” enthused contractor drones 3.The purpose of the test is to produce a flat topped “mesa” of usage that depicts your users acting normally. A graph of CPU consumption should look like a rectangle with a flat steady top, nowhere near saturated. We then take that sample of happy users unconstrained and model what hardware is needed for more happy unconstrained users. 4.Do a “practice run” several days before your real test to flush out issues and tell us so we can see how well you followed mesa instructions 5.DO NOT do any of the following, which will waste your time, ruin the data and cause rework A.DO NOT “ramp up” or “ramp down” usage at the start or end of your tests. It just makes us throw out that data B.DO NOT try to “saturate the machine”. The models will find that saturation load, don’t waste your time. Concentrate on producing an unsaturated load of happy users getting great response times C.DO NOT try to simulate hundreds of users from one PC with one network card. It will fail or worse, produce incorrect data leading to massive errors D.DO NOT create loads with unrealistically fast “think times”. If the user is likely to do a transaction, then wait 5 minutes reading it or processing it, then set the inter-transaction wait time to 5 minutes, not 30 seconds. Remember, your goal is to be realistic, not to have high unrealistic loads. Mesa tests may seem odd at first, but in time you will learn to love mesa tests and their time and cost savings to projects. After a few of them, you’ll never load test the old way ever again. Questions? Please ask, or invite us to your team meetings for a confab! © Ron Kaminski 2010, All Rights Reserved70

71 The politics of capacity planning in organizations How to win friends and influence people in the operations group – Set up “being on the capacity planning team” as an aspiration goal, a promotion path, for the operations folks – Try to find an operations or O/S expert at the top of their game and get them assigned to the capacity planning effort These are often the best acolytes and really take well to capacity planning – As the operations staff start to use the capacity planning reporting and pathology detection systems Praise their efforts and successes to management Coach their failures privately – Get them (and their management) to realize that keeping process pathology counts down reduces emergencies and call-outs, and greatly contributes to system stability – Train them on the tools so they start to use them and build new skills If the only users of the capacity planning reports are on the capacity planning team, you are doing something wrong! © Ron Kaminski 2010, All Rights Reserved71

72 The politics of capacity planning in organizations How to win friends and influence people in the application development group – In addition to the barriers presented previously, you may also encounter The earnest improver, who takes the time to learn about new technologies and tries to integrate their benefits into their software development lifecycle The non-technical manager, who may never understand all of the math and formulas, but who will be far better at the political skill required for success External vendors whose future profits hinge on success – Try to become an asset to each of these groups make sure that they see you as a willing partner in their success work late on their models help them succeed and get the resources that they need when they need them – Send mail when you work early, late or on the weekends (and CC your boss of course), it shows that you are really trying to help © Ron Kaminski 2010, All Rights Reserved72

73 The politics of capacity planning in organizations How to win over and influence your boss – There are several types of bosses The experienced true believer The unbeliever The unconvinced cost counter – There are techniques to deal with each Your goal is to convert the last two into the first one! – Keeping all happy will involve deploying collectors, generating workload characterized historical consumption web pages and “What if..?” models of future consumption The key is to survive long enough to – get a proper network queuing theory model based software purchased in sufficient quantity to make a difference – Get some applications leadership on your side – keep the last two from canning you before you start to get meaningful results on a large scale © Ron Kaminski 2010, All Rights Reserved73

74 The experienced true believer Usually you have worked with or for this boss before, so they already know – How expensive the tools can be, so they are not shocked – What a reasonable time for results is – How to help enable your success – What battles to fight, and what battles to avoid My last 4 gigs have been for someone who I had either consulted for or worked for – Delivering results delivers career options for you! Characteristics of the experienced believer – Patience – Helps get the software quickly – Helps break through organizational politics to get your collectors quickly deployed – Projects confidence in meetings with other management © Ron Kaminski 2010, All Rights Reserved74

75 The unbeliever These folks (often with a development background) are distrustful of fancy methods like network queuing theory – This is often based on an insecurity, they don’t understand complex tools and thus distrust them – Have made their career by betting on simple solutions and extrapolating linearly – Are often in their position due to management turmoil In several gigs I’ve had non-believers in the management structure above me Characteristics of the non-believer – Initial open contempt of scientific capacity planning methods – Demand results before they help you get collectors in place to answer it with a historical basis – Often will throw CPU and memory at disk I/O slowness – Can be turned, but wow, it sure takes patience! © Ron Kaminski 2010, All Rights Reserved75

76 The unconvinced cost counter These can be great bosses in time, because like scientists, they demand proof before supporting you, but once they have it, they will be true believers They either have no experience with sophisticated capacity planning, or have had running the group forced on them by higher ups who have Characteristics of the unconvinced cost counter – Repeated references early in the process to how much your group and your software costs, and lots of implying that savings results had better surpass that soon – Caution early on, so they will spend the time with other departments getting them to go along with you – Thrive on informational updates, so show steady progress You don’t have to be perfect, just constantly getting better – You’ll know when they switch to true believers when They start buying you more licenses! They stop complaining about costs The “We need to show results!” to “Do you need more licenses?” conversion © Ron Kaminski 2010, All Rights Reserved76

77 Reporting There are a lot of tragically bad business graphics and especially capacity planning reports out there. Issues include: – Graphics that distort the viewers perceptions Quasi-3d Black outlines around bar charts Non-calendar displays of long spans of time No color consistency – Foolish consistency may be the hobgoblin of little minds, but it is also the key to getting management to use your site for decision making (don’t pay attention to “little minds” and “management” appearing in the same sentence…) Lots of chrome, little content – Tufte: “Question every pixel. Basically, any pixel that isn’t conveying new data, get rid of it!” © Ron Kaminski 2010, All Rights Reserved77

78 Reporting Other issues that limit effectiveness – Multi-page reports that nobody ever reads If your answer is so complex that it requires that much evidence, start over on a new one They paid $10,000! It has to hit the desk with a thud! The “same thud” lives on! – Relying on the untrained user to wade in and find the answers themselves Some you can train, most no If any correlation of graphics requiring memory is needed, forget it – Ron’s Position: Non-web presentations in general are useless relics of a bygone age. Most of your reader’s data comes in hyperlinked form, so get with it or be left behind – Web reports of all nodes in the firm Most users really appreciate ways to see only their span of control © Ron Kaminski 2010, All Rights Reserved78

79 Reporting There are also some “Must have’s” – Automated context that graphically highlights when something is out of the ordinary (managers love this stuff) – Automated business and hardware context, ideally driven by your CMDB, that include Hardware and software specifics Business Purpose Business owner Primary and backup technical contacts Ideally a text description of it’s business function Other helpful links © Ron Kaminski 2010, All Rights Reserved79

80 The Zen of Great Reporting Seek minimalism in all parts of it Reduce graphic clutter Reduce user perceived complexity – Workload color consistency is a simple “must-have” Reduce user choices and actions – If the user needs to know 4 things to make a decision, they had better be close on the same web page Add extra information that lets the user more fully understand odd behaviors and situations – Sorting it by date is nice too Don’t restrict yourself to measured quantities – Workload response time detail is one of the most powerful graphics that you can use © Ron Kaminski 2010, All Rights Reserved80

81 Reporting Examples © Ron Kaminski 2010, All Rights Reserved81

82 Reporting Examples (UNIX) © Ron Kaminski 2010, All Rights Reserved82

83 Reporting Examples (Windows) © Ron Kaminski 2010, All Rights Reserved83

84 Reporting Examples (Windows) Tangent, Multiple Memory Leaks Here is an example of a rather severe repeating set of memory leaks – See the saw-toothing memory? – See the climbing Commit Bytes in a different sequence? © Ron Kaminski 2010, All Rights Reserved 84

85 Reporting Examples (Windows) Tangent, Multiple Memory Leaks When you dig deeper, you can see memory totals by process owner – People often want to “blame someone” – Alas sometimes the “Someone” is harder to pin down by just username © Ron Kaminski 2010, All Rights Reserved 85

86 Reporting Examples (Windows) Tangent, Multiple Memory Leaks When you dig deeper, we can see the individual process names leaking – In time you’ll find the best way to keep them unique, we use process start date/time and PID – You can show these to the Fake_Name vendor and then it is hard for Fake_Name to deny a memory leak – I believe that java is Finnish for “memory leak” © Ron Kaminski 2010, All Rights Reserved 86

87 Reporting Examples (Windows) Tangent, Multiple Memory Leaks Well it is hard to deny a leak, but some Fake_Name vendor might want raw data, so… – Since you already have it, put out some csv files to be easily mailed to the vendor, eliminating one of their stall tactics © Ron Kaminski 2010, All Rights Reserved87

88 Reporting Examples (Windows) Tangent, Multiple Memory Leaks The right way to convey the message – We detected the issue, and sent mail to the application owner, stating The exact processes with the issue They can expect to keep crashing every day or so until they get the vendor to fix it Offers to help with data or technical calls We get no response at all Three weeks later, we get a request to add memory to the machine… – The owner “Can’t get the vendor to respond quickly” and wants to reduce outage counts in the mean time Don’t get mad… – Stay positive and helpful in tone, they are just trying to help their users have less outages… – but continue to urge them to turn up the heat on their vendors, but do it in a nice way… © Ron Kaminski 2010, All Rights Reserved88

89 Reporting Examples © Ron Kaminski 2010, All Rights Reserved89

90 Reporting Examples © Ron Kaminski 2010, All Rights Reserved90

91 New! Reporting Examples Windows © Ron Kaminski 2010, All Rights Reserved91

92 New! Reporting Examples UNIX © Ron Kaminski 2010, All Rights Reserved92

93 Reporting Examples © Ron Kaminski 2010, All Rights Reserved93

94 Classic capacity planning question descriptions and proper answering techniques Capacity issues are usually an emergency to someone Roughly 93% of the requests for upgrades are nonsensical if you have any historical workload based resource consumption information – So you have to say no in a way that makes the evidence clear What to expect when you say no: – The 5 stages of grief (also called the Kübler-Ross model) http://en.wikipedia.org/wiki/K%C3%BCbler-Ross_model http://en.wikipedia.org/wiki/K%C3%BCbler-Ross_model Denial Anger Bargaining Depression Acceptance Always give them a way to succeed along with your “no”, remember that may they still have a real problem! – “No, you don’t need CPU or memory, but you are doing 5500 I/Os a second to your slow, locally attached C drive Can you turn down logging? Can you send those I/Os to fast SAN or RAM drives? Can you get help from your DBA pals? – “No, you don’t need more CPUs, you need to fix those looping processes.” © Ron Kaminski 2010, All Rights Reserved94

95 Classic capacity planning question descriptions and proper answering techniques Here is the pattern for this next section: – Real quotes from the users (disguised, slightly) – The evidence – The answer – What happened I want some interaction on these, if you did it better, speak up! Share! That is what CMG is for! These graphs used in the examples are all homebrew perl and GD:Graphics, and they are used at several firms – Yes I will share the code if you want it, but sheesh, you can do better! You are going to want some form of screen graphics capture tool – I use freeware ZScreen, downloadable from many sources, it is fabulous © Ron Kaminski 2010, All Rights Reserved95

96 Classic capacity planning question descriptions and proper answering techniques User quote – “We are keeping these machines rather heavily loaded.” but they won’t tell you why The evidence © Ron Kaminski 2010, All Rights Reserved96

97 Classic capacity planning question descriptions and proper answering techniques The answer – It turns out that this application was on three nodes, two heavily used and one lightly used They wanted a review of each – Is ustca027 too empty? – Is ustwa007 too full? – Is ustca031 too full? Let’s use Relative Response Time by hour to answer them © Ron Kaminski 2010, All Rights Reserved97

98 Is ustwa007 too full? © Ron Kaminski 2010, All Rights Reserved98

99 Is ustca031 too full? © Ron Kaminski 2010, All Rights Reserved99

100 Classic capacity planning question descriptions and proper answering techniques What happened – The users are initially shocked to see that the capacity planners, whom the view as machine stealers for VMs, are recommending that they get more hardware! – Once they started to understand relative response time graphs, they became quite sophisticated at moving workloads around – You’ll know that you’ve converted them when they e-mail you asking if their IO_Wait could be solved if they split them over more drives or better RAID choices The morals of the story – Any vendor can show totals – Favor vendors that show workload characterized historical views of consumption – Favor vendors that can show you workload relative response times, so that your answers make sense to the business © Ron Kaminski 2010, All Rights Reserved100

101 Classic capacity planning question descriptions and proper answering techniques We started getting warnings from our automated checks: 10/03/23 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used up to 392.920% of an available 400% from 2010/03/23 at 0200 until 2300. 10/03/26 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used up to 394.572% of an available 400% from 2010/03/26 at 0000 until 2300. 10/03/27 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used up to 396.000% of an available 400% from 2010/03/27 at 0000 until 2300. 10/03/28 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used up to 392.920% of an available 400% from 2010/03/23 at 0300 until 2300. The evidence (here’s what the sparkline looked like): © Ron Kaminski 2010, All Rights Reserved101

102 Classic capacity planning question descriptions and proper answering techniques More evidence: © Ron Kaminski 2010, All Rights Reserved102

103 Classic capacity planning question descriptions and proper answering techniques My initial suspicions were ‘Code improvement opportunities” so I contacted my DBA pals: © Ron Kaminski 2010, All Rights Reserved103

104 Classic capacity planning question descriptions and proper answering techniques Those CPU graphs with response time increases due to CPU_Wait when they hit the “knee in the curve”: © Ron Kaminski 2010, All Rights Reserved104

105 Classic capacity planning question descriptions and proper answering techniques The answer from my DBA pals: © Ron Kaminski 2010, All Rights Reserved105

106 Classic capacity planning question descriptions and proper answering techniques What happened (the changes went in on Mar 29 th ): © Ron Kaminski 2010, All Rights Reserved106

107 Classic capacity planning question descriptions and proper answering techniques What about the charts Ron? © Ron Kaminski 2010, All Rights Reserved107

108 Classic capacity planning question descriptions and proper answering techniques Things to learn from this example: – Not all code “innovations” work as efficiently as desired SQL developed in far flung places for even farther flung places is especially suspect “When the answer is correct, the code is done”, well maybe not… – Not all innovations will go through a rigid capacity planning review You need either automated warnings or to take the time to scan thousands of graphs often to detect and correct these You need fast graphical evidence to get fast reactions – You need to go out of your way to be nice to DBAs, they will save your firm millions if you let them, and if you only ring them up when there is real evidence of mayhem Always ask their boss to praise their efforts, those memos come in handy at review time © Ron Kaminski 2010, All Rights Reserved108

109 Classic capacity planning question descriptions and proper answering techniques Many of you will be deploying virtual terminal environments to hundreds of users – What if something goes a little wrong? The evidence: © Ron Kaminski 2010, All Rights Reserved109

110 Classic capacity planning question descriptions and proper answering techniques The answer: – We started ticketing suspicious CPU consuming VMware slices on Feb 3 rd – Most of it was Bezier curve screen savers! We banned them What happened: – We got back more than half of our VMware farm! © Ron Kaminski 2010, All Rights Reserved110

111 Classic capacity planning question descriptions and proper answering techniques User quote: I was wondering if we could get the memory increased on our Exchange 2007 CAS servers USTCAX100 and USTWAX100? Right now both servers are running 4.25GB and I would like to move them to 8GB. We are seeing performance issues with those servers and we are noticing that RAM usage is at 80%-90% or higher all of the time. Users are starting to notice this with Communicator. Due to the fact that it can’t get a response quick enough from CAS, it is putting an exclamation point on the communicator alerting them to address book issues. If we are not able to increase the memory, the only other option would be to add more CAS servers in the environment to balance the load. We also are going to be increasing the load on these servers with the 2000 users we will be adding to the North America environment from the XYZ Co. acquisition and moving South American users to North America servers. Please let me know if this is feasible or not? © Ron Kaminski 2010, All Rights Reserved111

112 Classic capacity planning question descriptions and proper answering techniques The evidence: First, look to see if anything has gone wrong recently They might be reacting to a recent problem, but don’t stop there © Ron Kaminski 2010, All Rights Reserved112

113 Classic capacity planning question descriptions and proper answering techniques The evidence: Looking deeper, we don’t see a memory shortage, (there is evidence of a slight leak) paging is very low, CommitBytes isn’t anywhere near CommitLimit, but … CPU seems in short supply, and the CPU Wait component of relative response time is huge Their short term performance issue is due to CPU shortage, not memory! © Ron Kaminski 2010, All Rights Reserved113

114 Classic capacity planning question descriptions and proper answering techniques The Answer: Along with the graphs from the previous page (and getting them to address the lsass loop) we added two virtual processors to this VMware slice Note that if you disagree with their solution, give them an alternative that fixes present issues We may give them more memory later, when they’ve earned it © Ron Kaminski 2010, All Rights Reserved114

115 Classic capacity planning question descriptions and proper answering techniques What happened: The CPU Wait disappeared immediately The user’s immediate issues were solved The users now know that decisions will be based on evidence, the results will be real, and they like it! Hardware in use for a growing application will grow, but slowly © Ron Kaminski 2010, All Rights Reserved115

116 Classic capacity planning question descriptions and proper answering techniques Hey folks, there is still one more issue, with imjpmig process, the Input Method Editor, which lets you use Japanese characters. It is looping regularly: 10/01/15 LOOP_PROBLEM: 3444 running imjpmig CPU looped from Jan 15 04:59:54 until Jan 15 23:54:53 and may still be looping. 10/01/16 LOOP_PROBLEM: 3444 running imjpmig CPU looped from Jan 16 00:07:48 until Jan 16 23:54:58 and may still be looping. 10/01/21 LOOP_PROBLEM: 5344 running imjpmig CPU looped from Jan 21 13:59:59 until Jan 21 23:54:58 and may still be looping. 10/01/22 LOOP_PROBLEM: 5344 running imjpmig CPU looped from Jan 22 00:01:27 until Jan 22 23:54:56 and may still be looping. 10/01/23 LOOP_PROBLEM: 5344 running imjpmig CPU looped from Jan 23 00:01:25 until Jan 23 23:54:53 and may still be looping. I changed the workload to just highlight Input Method Editor by itself. I also found a bunch of patches available: http://search.microsoft.com/Results.aspx?q=imjpmig+d ownloads&mkt=en- US&FORM=QBME1&l=1&refradio=0&qsc0=0 http://search.microsoft.com/Results.aspx?q=imjpmig+d ownloads&mkt=en- US&FORM=QBME1&l=1&refradio=0&qsc0=0 © Ron Kaminski 2010, All Rights Reserved116 Sometimes your own systems detect problems, so answer in a way that provides all required information

117 Classic capacity planning question descriptions and proper answering techniques Eventually they got the fix migrated to production and everything worked fine from then on – Don’t get discouraged if folks don’t always do what you want immediately – Change controls, priority conflicts and other issues may stall the fix – With enough graphical evidence, eventually you will win! © Ron Kaminski 2010, All Rights Reserved117 What happened?

118 Classic capacity planning question descriptions and proper answering techniques Ron logs in on a Saturday to work on slides for UKCMG (“Again! And what do you get paid to do this?” asks my dear wife) and sees the following: The evidence (from my pathology detection code’s morning mail) CPU saturation found: CPU_SATURATION_WARNING: Windows2000 node ustca337 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustwasbx16 used up to 99.000% of an available 100% from 2010/03/12 at 1400 until 2300. CPU_SATURATION_WARNING: Windows2003 node uktcas06 used up to 99.000% of an available 100% from 2010/03/12 at 0300 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustca227 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustca724 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustcas44 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustcas54 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustca088 used up to 99.000% of an available 100% from 2010/03/12 at 0800 until 2300. © Ron Kaminski 2010, All Rights Reserved118

119 Classic capacity planning question descriptions and proper answering techniques The evidence continued – Whenever a whole bunch of bad things happen synchronized over many machines, think global tool © Ron Kaminski 2010, All Rights Reserved119

120 Classic capacity planning question descriptions and proper answering techniques The evidence continued – Whenever a whole bunch of bad things happen synchronized over many machines, think global tool © Ron Kaminski 2010, All Rights Reserved120

121 Classic capacity planning question descriptions and proper answering techniques © Ron Kaminski 2010, All Rights Reserved121  This is really bad news, a critical Business Sensitive / Critical production server doing its normal real sqlservr workload with a Tool process going on a CPU binge and causing excessive response times due to CPU_Wait

122 Classic capacity planning question descriptions and proper answering techniques The answer – A new piece of monitoring code was installed BREAKING THE NO NEW CODE INSTALLS ON A FRIDAY rule! What happened – The code creator had deployed a new script, and he reviewed it after getting mail about all of the warnings: ”This was a bug in a script update that I made; we should be seeing this behavior on most of the attached server list. ______ is pushing out an update to the script now; once this is done we’ll have to log into each of the affected servers, verify the looping process is running sqlcheck.vbs, and kill it.” – We were able to swiftly detect and fix the issue How would your site do this? © Ron Kaminski 2010, All Rights Reserved122

123 Classic capacity planning question descriptions and proper answering techniques What we saw: – We started getting Commit_Bytes approaching Commit_Limit warnings: 10/04/05 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 5 18:00:00 until Apr 5 23:59:00 and may still be. 10/04/06 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may still be. 10/04/07 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 7 00:00:00 until Apr 7 23:59:00 and may still be. 10/04/09 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 9 00:00:00 until Apr 9 23:59:00 and may still be. 10/04/10 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 10 00:00:00 until Apr 10 23:59:00 and may still be. 10/04/11 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 11 00:00:00 until Apr 11 23:59:00 and may still be. 10/04/12 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 12 00:00:00 until Apr 12 23:59:00 and may still be. 10/04/13 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 13 00:00:00 until Apr 13 23:59:00 and may still be. © Ron Kaminski 2010, All Rights Reserved123

124 Classic capacity planning question descriptions and proper answering techniques We investigated, seeing rising total memory: © Ron Kaminski 2010, All Rights Reserved124

125 Classic capacity planning question descriptions and proper answering techniques The evidence, memory by user: © Ron Kaminski 2010, All Rights Reserved125

126 Classic capacity planning question descriptions and proper answering techniques The evidence, memory by leaking process: © Ron Kaminski 2010, All Rights Reserved126

127 Classic capacity planning question descriptions and proper answering techniques The evidence, for the spreadsheet inclined: © Ron Kaminski 2010, All Rights Reserved127

128 Classic capacity planning question descriptions and proper answering techniques The answer: – Clearly this application has a jlaunch process (run by the SAPServicePRG user) memory leak – You have two options: Get them to patch/fix the application, or Get them to reboot the machine periodically so that you don’t start paging hard and affect performance – So you notify the project leader: Hi all, If you look at memory usage over the last few months on these three severs, you’ll see steady and/or repeating ramps. http://ustwu002.kcc.com/node_reports/ustca146/memory.html http://ustwu002.kcc.com/node_reports/ustca147/memory.html http://ustwu002.kcc.com/node_reports/ustca148/memory.html This leads eventually to warnings like these: COMMIT_BYTES_PROBLEM: On ustca146, Commit Bytes were within 80% of Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may still be. COMMIT_BYTES_PROBLEM: On ustca147, Commit Bytes were within 80% of Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may still be. COMMIT_BYTES_PROBLEM: On ustca148, Commit Bytes were within 80% of Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may still be. …and after that, when commit bytes hits commit limit, you can experience rather severe application slowdowns. In every case, the major rising memory consumer seems to be jlaunch processes run by SAPServicePRG. Most recently: PID 6160 on ustca146 started Mar 2 20:54:58 PID 3772 on ustca147 started Mar 2 20:54:50 PID 8032 on ustca148 started Mar 2 20:54:56 Could someone take a look at these to see if a fix is possible? If not, could we recycle these jlaunch processes, perhaps weekly, to keep memory usage down? Thanks for looking! © Ron Kaminski 2010, All Rights Reserved128

129 Classic capacity planning question descriptions and proper answering techniques What happened: Hi Ron, Thank you for keeping an eye on these servers! You are right, there is a steady growth of memory usage by the SAP PRG processes on these application servers. This is not a surprise. There are several known issues regarding memory leaks with the current version of the Java hibernate libraries being used in the fake_name application and old fake_product. We have worked with the application vendor, fake_name, to resolve some of the more significant issues that were causing regular outages. Fake_vendor has not resolved some of the less-severe issues. There are plans to upgrade the entire application suite and change the underlying application execution platform from fake_product to new fake product. The application upgrade includes new libraries for hibernate, and the memory leak issues related to hibernate with fake_product have not appeared in new fake product. The landscape upgrade is currently scheduled for June. We will go ahead and schedule a recycle of the old fake product to recycle the Jlaunch processes you mentioned below. We will schedule regular process recycles until the system is upgraded. Please let me know if you have any additional questions or concerns. Thank you! © Ron Kaminski 2010, All Rights Reserved129

130 Classic capacity planning question descriptions and proper answering techniques What happened: – Memory leaks, key points to remember Graphics help get their attention, CSV files are there for the whackos who demand the real data – Sometimes they say that they need it “to prove to the vendor” » Believe me, the vendor usually knows all too well… – It is easy to do and nips their evasions in the bud – Remember the “stall techniques”? Sometimes they can’t, or aren’t, going to fix it – Welcome to big corporations and “priorities” – Then you need to get them to reboot periodically to get the leaked memory back Do you have the graphs and data quickly available to discover, document and communicate this? © Ron Kaminski 2010, All Rights Reserved130

131 We have this really cool way to see all of the server’s disk space for the last 90 days © Ron Kaminski 2010, All Rights Reserved131

132 Classic capacity planning question descriptions and proper answering techniques The evidence: Subject: Possible disk space issue looming on ustca479 Hi All, Here is a view of total disk space and disk space used on ustca479: Perhaps some purge/delete/cleanup is in order? Ron Kaminski © Ron Kaminski 2010, All Rights Reserved132

133 Classic capacity planning question descriptions and proper answering techniques The answer: Subject: RE: Possible disk space issue looming on ustca479 Ron, Thank you for the heads up. The increased disk space utilization is partially due to enhanced logging that we have enabled over the past few months. I have cleaned up some old logs and we will continue to monitor the disk utilization to determine if additional disk space is required. Thanks, Matt © Ron Kaminski 2010, All Rights Reserved133

134 Classic capacity planning question descriptions and proper answering techniques What happened: Well, It was a start! But alas, note the inexorable rise beginning again after the clean up. © Ron Kaminski 2010, All Rights Reserved134

135 An update from Friday… Note that the max space has grown considerably, from 83 to 112 GB © Ron Kaminski 2010, All Rights Reserved135

136 Classic capacity planning question descriptions and proper answering techniques The best way to deal with these is to avoid them proactively by making great, workload characterized consumption information available to all – Train your firm to use the capacity reporting and pathology detection systems You have automated pathology detection, all the way through ticketing issues, haven’t you? – Think graphics, not tables of numbers – If only a secret club know the capacity data, you are making a big mistake – Train OS support folks to use the “What if…?” models © Ron Kaminski 2010, All Rights Reserved136

137 Break Time! Please be back at © Ron Kaminski 2010, All Rights Reserved137

138 What I said about clouds and SAAS last year: Say goodbye to your data centers and your privileges folks! – Cloudy days are coming, and this is good Paying people in each firm to worry about OS, backup, security, and staying current was always expensive, and now it is ridiculous – Change firms a few times and note how wildly different “It has to be this way!” is Our capacity planning needs, and tools, will have to change too – Instead of vendor’s selling you software, many will sell the service running on their cloud This is great! Let the vendor maintain their own code! They are the naturally cheapest way, the expertise needed is naturally concentrated Having a year more to search for and find issues, I see a few potential storm clouds in some firm’s sunny plans! – Let’s dig into why… © Ron Kaminski 2010, All Rights Reserved138

139 Clouds and “Software as a Service” Definition: Clouds = Running our stuff on someone else’s computers, plus whatever else will be needed for the new demands that will place on us, like: Encryption, so we can run sensitive corporate data over the world wide web safely – Note that this is done on both sides, the user’s machine and in the cloud. This may be an unpleasant surprise for firms that have replaced those expensive desk top processors (and all that excess capacity) with “light desktops “ running virtual machines on shared hardware Exhaustive disk cleansing when we delete files or parts of files Network lag measuring tools, because there will be slowdowns and our users will want to direct their wrath Increased firm internet firewall bandwidth needed Increased firm internet bandwidth needed © Ron Kaminski 2010, All Rights Reserved139

140 What will those loads look like? © Ron Kaminski 2010, All Rights Reserved140

141 Cloud issues “We’ll just run everything in someone else’s cloud, so we won’t need capacity planning any more. It will be the cloud vendor’s problem!” – Clouds will place new, different, and often resource intensive new demands on our firm’s computing infrastructure – Capacity concerns will become very important, and historical records of what consumed what will be paramount for figuring things out Someone is going to have to pay for all of that extra processing and it won’t be the vendor! The “Mushroom Cloud” will be appearing at firms that ignore these risks © Ron Kaminski 2010, All Rights Reserved141

142 Clouds and “Software as a Service” Definition: Software as a Service = Letting someone else run their code on their machines to serve us, but undoubtedly with our data, plus whatever else will be needed for the new demands that will place on us – If there is customer identifiable information, we will need all of that encrypt/decrypt overhead again – Disk cleansing will be less of a priority as no one can run “disk scrapers” unlike the cloud – Network lag measuring tools, because there will be slowdowns and our users will want to direct their wrath – Increased firm internet firewall bandwidth needed – Increased firm internet bandwidth needed © Ron Kaminski 2010, All Rights Reserved142

143 Other Cloud and SaaS issues The key thing to remember is that cloud and SaaS vendors will have to eventually operate at a profit! – This will drive them to the same attempts to economize that your firms are trying now Big and cheap IO devices, that are of course much slower Virtualization will be a certainty, you will never know what fraction of what hardware you will be on Architectural choices of the firm’s past won’t make sense any more – What hardware largess do you tolerate now for “Mission critical applications”? » Hot spares? » N+1 copies of data? – Will your cloud vendor leave enough excess capacity for your theoretical worst case? » How will you be sure? © Ron Kaminski 2010, All Rights Reserved143

144 Other Cloud and SaaS issues – And remember the graph, they have to run it on 2X to 3X+ the hardware for the same loads! Unless your firm’s Data Processing division is utterly ridiculous in their spending (and many are) how can clouds be cheaper? Clouds and SaaS only make sense when the non- hardware savings exceed the hardware and network costs, or provide other business useful opportunities – Perhaps outsourcing a staff intensive application to a SaaS vendor is still a really great idea © Ron Kaminski 2010, All Rights Reserved144

145 The moral of the story: Eventually businesses may evolve into partial cloud and SaaS users when the overhead of extra processing is smaller than some fraction (I’ll go out on a limb and say half) of the average resources needed to run the application and the security demands are low, and/or the total function cost is lower – Quick! Think of a low security function at your firm that you would be happy to have some greasy haired geek intercept, and put that in a low security cloud I couldn’t think of any as an example! Can anyone here? – Almost all real corporate work will demand far more internal resources to run externally than to run internally Be sure to add those costs to your cloud and SaaS plans! © Ron Kaminski 2010, All Rights Reserved145

146 The story continues… Go out and repeat my analysis at your firm on one of your firm’s attempts to do it Publish a paper via CMG or elsewhere where you outline the specific true costs in consumption and hosting spend If your costs come out like mine did, i.e. “This doesn’t make a lot of sense!”… – expect a flood of analyst calls from “consulting groups” wanting you to expound on your “cloud computing experiences” – expect some wholehearted chuckling and agreement that it is nuts I think that many firms are “acting like” consulting groups when they are in fact trying to gather data to beat down internal pushes to “go cloud” or to “go SAAS” Or they are consulting to potential cloud providers and giving them a less than rosy view… For now, I would proceed very slowly… © Ron Kaminski 2010, All Rights Reserved146

147 Clouds, last words I used to live not far from here in Oviedo Fl – Every summer day a lot of sunlight hitting the swampy ground would create a lot of hot rising moist air, so we had clouds and thunderstorms about 3 PM each day In IT journals and analysts sessions, there is a lot of hot moist hype filled air rising – Maybe that is why they see clouds! © Ron Kaminski 2010, All Rights Reserved147

148 Last words: Don’t trust the vendor (or yourselves) © Ron Kaminski 2010, All Rights Reserved148

149 In the interests of time, we are going to skip some here But you all have the slides! © Ron Kaminski 2010, All Rights Reserved149

150 Modeling when all of the cards are stacked against you In a perfect world, when new code is written – There is a comprehensive test plan to verify functionality – All issues are corrected prior to the capacity planning tests – The capacity planning tests are performed on real (non- virtual) hardware with known characteristics – The testing group are old pros who know how to run a proper “mesa test” which has… An hour of nothing An unsaturated hour of a realistic transaction mix, with realistic think times, on a realistic indexed database, without excessive application logging or extraneous monitors adding logging loads to key disks Followed by an hour of nothing – And finally a truthful estimate of expected user loads © Ron Kaminski 2010, All Rights Reserved150

151 Modeling when all of the cards are stacked against you In the old days, we would be able to justify … – A great testing team with experience – Separate testing hardware (which also was used a test bed for getting the “promote to production” process working well – Real change control with handoffs – No deployment to production without a model (my preference) or load test at projected rates In the modern firm… – a lot of those folks and skills were eliminated or outsourced The “Hardware is so cheap, we’ll just buy more!” mantra We’ll just run on someone else’s cloud! – Outsourced code production teams may be “lacking in experience” – Severe pressures to “Get it on the web now!” lead to some decisions that are frankly dubious when viewed dispassionately – If you are lucky to have modeling or load testing tools at all, the folks running them are nowhere near as skilled as your old testing team, but they do have a shorter deadline!  So what do you do? © Ron Kaminski 2010, All Rights Reserved151

152 Modeling when all of the cards are stacked against you You make some concessions to reality Realities of this decade – People will be running tests on Virtual machines Worse yet, they will be running tests on Virtual machines that do not know that they are VMs, so their performance numbers will be way goofy at times – Often you will encounter “shared development servers” and severe problems getting folks to log off and stop using your test machines for the period of your mesa test – Your uptight project leaders are going to be oblivious to proper testing techniques and/or the realities of physics – Ron’s theorem: The chance of getting a clean unsaturated mesa test sample with a realistic transaction mix and a realistic arrival rate decreases to 1/ (number of continents that team communications span) So, you are going to get all sorts of messed up tests… © Ron Kaminski 2010, All Rights Reserved152

153 Modeling when all of the cards are stacked against you A general rule is: – “A careful modeler, who pays attention to calibration issues, can still get useful information from a sample done on VMware hosted O/Ses So, lets’ start with the sample: – Clearly there was no “quiet hour” of IIS_w3wp before and after our test, so we have to adjust for the fact that our sample consumption is likely also running extra processing – At least the background “hum” of IIS_w3wp seems consistent © Ron Kaminski 2010, All Rights Reserved153

154 Modeling when all of the cards are stacked against you – Other issues, detecting when tests will not go as planned (key “warning trigger words” highlighted) All of the following have been said to me by real people in the last year: – We are planning to go live next Thursday, and we haven’t run a test yet while also saying: » I plan on simulating 10,000 users logging on the system per second from my laptop Test’s I’ve run indicate that with modern office web connections to computer rooms that you can accurately simulate around 20 users per test machine » We’ve only got 10 machine licenses of the load testing software, and I think that I can only use 3 Smart load testing firms have clouds of testing machines (think dozens or hundreds+) and specialists experienced with the tools, (modeling from a small test still gives a better answer!) © Ron Kaminski 2010, All Rights Reserved154

155 Modeling when all of the cards are stacked against you » I plan on simulating 10,000 users logging on the system per second from my laptop What in blazes are you running? I called a very large and successful web payment system vendor where a pal of mine works and he said that their maximum updates per second ever seen (on cyber Monday no less) are 416 per second hosted by a co-located cloud of over 900 machines. You are going to get 10,000 mom’s per second putting in their address and personal information on a diaper web site? Further research showed that 1,000,000 mom’s had registered at in one year on the previous site That means all of the mom’s on earth will re-register in 100 minutes! That must be some diaper coupon! © Ron Kaminski 2010, All Rights Reserved155

156 Modeling when all of the cards are stacked against you » I’ve never used this load testing software before Rational testers start testing loads midway through a project to get info on what parts are slow so they can fix them pre-launch, and then test weekly thereafter as fixes go in to verify » We are only going to test a few functions of the customer web site, that will be enough for capacity planning purposes right? A realistic transaction mix, ideally taken from a production site with real users acting normally is the “ultimate get” for accuracy. All else is supposition and in my experience those “untested sections” can sometimes be really intense » Since production isn’t there yet, I’m going to have 20 real people from the team act like users for an hour straight Long experience with project teams lets me know whatever happens, those 20 users will not act like 20 real users, no matter what… © Ron Kaminski 2010, All Rights Reserved156

157 Modeling when all of the cards are stacked against you More common issues with real people testers: Since they know the system, they start going “ninja fast” during the sample period, generating way more load per user than normal “I thought that we were supposed to “stress the system” They know the system, so they type in perfect addresses, so the verification rules are never tripped, like they almost certainly would be in the real world After about 15 minutes, they get bored, stop for coffee or to complain to co-testers about the project leader, their boss, their stupid firm, the weather… © Ron Kaminski 2010, All Rights Reserved157

158 Modeling when all of the cards are stacked against you More common issues with real people testers: Instead of a “realistic transaction mix” the programmer pressed into service will check each and every function (also known as a functionality test) until every single function is tested, and then start all over again They “can’t be bothered” to create a realistic test user population’s data volumes, so they add their mom, sisters, cousins, their kids baseball coach, and that data will not match production data characteristics Will a database of 35 users perform like 1,000,000 distinct moms in a database? Think about the indexes, caching, etc… Whatever they say they will do, the first time that you work with a new project team your chances of getting a good “mesa test” hour with realistic transaction mixes and consistent usage in all 60 minutes is about 0% © Ron Kaminski 2010, All Rights Reserved158

159 Modeling when all of the cards are stacked against you The ever present unrealistic “think time” issue: – I was still suspicious of the consumption that was attributed to just 20 users of a contract review system. So I asked some questions: I’m modeling away, and I have one more question: Do you feel that this is a realistic rate of users activity, or much more than they would do in an hour? What I mean is, we simulated 20 users for an hour, working somewhat continuously. Is that how we expect them to act, or was this really 20 “ninja users” working much harder than normal? Let me know. As it stands, 20 users used 77% of one processor, so 400 will use 1547.6%, or almost 16 processors just for your web serving, not counting the OS, etc. I have a feeling that while they were 20 users, they were 20 really busy users, and that might be a bit distorted. Let me know how you feel or call. © Ron Kaminski 2010, All Rights Reserved159

160 Modeling when all of the cards are stacked against you What I heard back: – What is the pacing rate for the _____ script? ________ wrote this and set the pacing to a fixed delay of 30 seconds. Sometimes that can turn into a simulation of super aggressive users So I wrote back: – OK team, so now the question is: Will your users really do all of that work every 30 seconds? Don’t they ever get some coffee, take a call, or something? Maybe they get together to plot a coup against the manager who works them this hard? I can model it this way if you believe that this is realistic. Let me know. What they replied was key: – All the tasks are normal for a user accessing the Contract Browse. They wouldn’t complete all those tasks in 30 seconds, however, it’s probably over 3-5 minutes. – So I split the difference and called it 4 minutes per contract, or 1/8 th the load generated © Ron Kaminski 2010, All Rights Reserved160

161 Modeling when all of the cards are stacked against you Spreadsheet tricks – Determine the noise percentage: – Make scenario plans around their estimate © Ron Kaminski 2010, All Rights Reserved161 Note that the first scenario is labeled 160 users, or 8 times the 20 they told us. Note also that we actually had to shrink the CPU used to account for the noise in the sample to get to 160 and even 200 users The formula: ((((Test*(SP_users/S_users))/(Test+Noise))-1)*100)

162 Modeling when all of the cards are stacked against you Then deliver the modeled results, with plans for future improvements © Ron Kaminski 2010, All Rights Reserved162

163 Modeling when all of the cards are stacked against you Now get ready for the end run… – After getting the modeled results, the project leader now said that they planned on having 1000 users Do we believe them? – Of course not! Then why would they say that? – Because all project leaders like 150% surplus hardware to cover up their cruddy code or other all too common mishaps – So what do we do? Model it, and show them how it dies on disk I/O! As soon as you start making requests on them, their appetite for surplus will suddenly and mysteriously disappear… © Ron Kaminski 2010, All Rights Reserved163

164 Modeling when all of the cards are stacked against you So in the end, we did get some useful information to the project – We highlighted a probably looming disk I/O issue – We got them to deploy to 4 CPU instead of a probably too small 2 CPU system – We started their journey to being more precise forecasters of hardware – We detected and deflected an “end run” attempt at hardware piggery The morals of the story: – Sometimes you have to make do, but make due using as much science and care that you can – Expect an excessive amount of CPU_Wait on VMware based models and really watch your model’s calibration © Ron Kaminski 2010, All Rights Reserved164

165 Know a bad method when you see it Many vendors are in the consolidation business these days – zLinux – VMware – Other flavors Very few firms have great workload characterized capacity planning information Many firms use outsourced or “vendor” estimates for sizing – If you know that a firm doesn’t have proper capacity planning information, what is a vendor going to use to make their estimates? Total CPU by hour? “industry estimates”? – All of these typically grossly oversize machines Isn’t that how we got in this “too much hardware” mess to begin with? The biggest hour is never… © Ron Kaminski 2010, All Rights Reserved165

166 Know a bad method when you see it I have a management pal who says that every estimate that he has ever received based on total consumption is at least 2.5 times bigger than it needs to be – Are your resource intensive maintenance activities “Synched” across many machines? If so, any method that adds up the same hours across all machines will recommend a lot of CPU Or, you could bust up your anti-virus runs over the whole evening and get back several hundred CPUs… – Each well meaning analyst who sees an estimate based on “shaky” logic typically adds a 20% “fudge factor” as a safety net If your estimate has been reviewed in sequence by several analysts, at some point it is mostly “fudge” Stick to real consumption figures sampled from real systems taken in hours without excessive of maintenance activities © Ron Kaminski 2010, All Rights Reserved166

167 Know a bad method when you see it “Canary Machines” is the highly dubious practice of putting higher loads on a small subset of machines, and the upgrading others when the canaries fall dead – That method guarantees that some users will have slowdowns or failures you will have ongoing monitoring costs – If it is a highly dynamic or surge prone environment, will you be able to get hardware purchased, shipped, installed and working before the next surge? What if the surge lasts weeks? When people don’t have proper tools or training that enable them to see things before they happen, they may choose to find problems by letting them happen to someone – You don’t have to… © Ron Kaminski 2010, All Rights Reserved167

168 Goals to work towards Get collections of some kind, on every node and device that your firm uses. That includes: – Machines – Disk arrays – Routers/switches/network stuff, including firewall servers and the like – Your boss’s iPhone – Anything and everything! Get graphical ways to display all of that data – On the web, all of the time, no exceptions – “We’ve been having disk issues with the vendor, get us graphical evidence of issues…” Plan for needing (and producing) tabular and/or CSV data – “We want to get a quote for zLinux, could you give us every machine that we have, CPU for the last 30 days in 15 minute intervals?” – It comes in real handy when you get a new process pathology idea and want to test it © Ron Kaminski 2010, All Rights Reserved168

169 Goals to work towards Train multiple people on how to use your capacity planning tools – That way someone can go on vacation! – …or leave the firm and they aren’t left in a lurch Find bright young folks and make new capacity planners out of them – I like to get a young system administrator at the top of their game, who are maybe starting to get a little bored with OS tasks These are often your best protégés They also come in real handy when automating the installs of thousands of collectors! – Install collectors on 2000 machines? No problem! Use their natural technical curiosity to “hook them” and successful models to “reel them in” – Drag them to CMG and introduce them to Adam Grummit, Dr. Buzen, Debbie Sheetz and all of your CMG pals that you call when you are stuck Because they will need them too! © Ron Kaminski 2010, All Rights Reserved169

170 Goals to work towards Train project leaders about capacity planning before the last days of a late project when they “Need to buy hardware right away or we won’t make our date!” Get capacity planning in your firm’s defined software development lifecycle or find out what development paradigm is in force in your firm and get modeling and capacity planning in it Get a simple rule enforced: “No new hardware without a model or review” Model new projects early, way before they are “almost done” – The earlier an issue is found, the cheaper it is to fix – Saving a project leader from disaster a few times is a great way to get future cooperation Realize that sometimes you can’t do it all yourself – Bring in some great technical capacity planning staff to help Freudian slip – It is a blast when they are used to different presentation styles… © Ron Kaminski 2010, All Rights Reserved170

171 Things Not To Do While having lots of data and lots of experience can be quite powerful, it can also be scary to others without it Being seen as a “know it all” harms your effectiveness, by reducing your effective communications – The smartest person in a firm that nobody can stand to listen to is not effective Ron’s hard won “tips” – Take the time to be seen listening – Never take credit, always share successes – When someone does some great work to fix a problem that you uncovered, send a note to their boss (and maybe a few higher in the chain) asking them to thank the fixer and praising the timeliness and quality of their work – Avoid personal pronouns when describing a problem Don’t say “I noticed…” Say instead “The data shows…” – Always use graphics to deliver bad news, people can argue words – Ask your boss and other old-timers to help and offer advice © Ron Kaminski 2010, All Rights Reserved171

172 An audit list Does your firm staff for Capacity Planning success? – Management sponsors high enough in the organization to compel behavior changes? – Gurus to design and ensure proper collection, data reduction and presentation systems Ideally there should be at least one per decade or two and perhaps per major O/S – Great systems and installation staff to deploy collectors quickly – Experienced pals at other firms who have already done it? – Can project leaders and key application technical personnel be trained and maintained so that they can effectively use the capacity planning reporting systems independently? – Is there sufficient continuity in key “outsourced or co-sourced” support team members to benefit from the firm specific knowledge available? © Ron Kaminski 2010, All Rights Reserved172

173 An audit list Does your firm: – base all hardware purchase decisions on workload characterized resource consumption over historically relevant periods and/or realistic transaction mixes within relevantly sized test databases? Not just on “total CPU” or memory – have speedy (ideally web delivered and graphical) ways to check all of the “sacred five” (CPU, DISK IO, Memory, Network and Response Time for key workloads)? – have key workload specific “drill down” capabilities when needed to break down giant workloads into meaningful sub- groupings of interest to applications folks? – have regular, automated pathology detection and notification, ideally with automatic ticketing to drive down incidents of needless consumption and chaos? © Ron Kaminski 2010, All Rights Reserved173

174 An audit list Does your firm: – adequately fund an independent testing group with: dozens of servers with licenses to run your load generation software to generate realistic loads? trained testing staff who can: – run proper mesa tests? – design realistic load scripts with proper think times? sufficient disk storage to make realistically sized test data sets for multiple simultaneous development efforts? In a world where more an more web page interactions will not be on PC browsers, but instead from handhelds, can you generate significant percentages of handheld browsers in your favorite load generation tool? – Our future growth is in_____ and what browsers are they likely using? – make capacity planning a known, taught and enforced part of their documented software development life cycle that all projects must follow? – make outsourced partners follow the same rules? – invest in staff training, including conferences to keep skills current? © Ron Kaminski 2010, All Rights Reserved174

175 An audit list Does your firm’s management: – enforce proper capacity planning methods on both new and established applications? – work with development management to help them prove the cost and time savings available from proper capacity and performance management practices? – buy enough collectors? – have firm-wide instrumentation standards and enforcement? – protect the capacity planners from powerful applications developers who used to get all the hardware that they ever dreamed of, and now are angry at the “machine stealing” capacity planners? © Ron Kaminski 2010, All Rights Reserved175

176 An audit list Is your firm’s management prone to the hype storms in IT media? – Do they still justify SaaS and or “Cloud Computing” efforts using the false assumptions that “in firm” hardware costs will be reduced? – Do they believe disk array vendors that tell them that they no longer need to worry about regular defragmentation? disk layout and RAID choices? – The “Big cache” effect of too much hardware versus the disk vendor’s promises story – The “Your defragger isn’t!” story – Do they still believe that virtualization is the only answer? Or, even worse, required? Do they understand the benefits of “stacking “ different applications on one machine? Do they understand the benefits of “collapsing” an application from multiple nodes to far fewer or just one? – Think that 24 X 7 “twitch monitoring” of computer systems by staff is effective and efficient? – Do they outsource IT systems and blindly trust and expect good results? All of these are provably false, yet popular and many are still receiving venture funding right now… © Ron Kaminski 2010, All Rights Reserved176

177 An audit list Does your firm have automated recovery procedures in place for their capacity planning systems? – It is a simple fact, machines crash, including the ones that you will depend on for capacity planning. Make sure that you have automated and tested recovery procedures for: Retrieving all that remotely collected data All of your processing of data into workloads All of your web report file creation All of your pathology detection and ticketing – Ideally you will test all of this in several distinct parts of the process so that you can recover swiftly (and as automatically as possible) when it happens to you And it will happen to you… maybe when you need to work on slides… © Ron Kaminski 2010, All Rights Reserved177

178 An audit list The best audit results your system can get – When a project team member sends you and all of his group links for all of their project’s machines on your web reporting system saying that they review it weekly to find performance issues and track the effects of changes – It is even better when they find issues on their own and use the web reporting system to highlight them! © Ron Kaminski 2010, All Rights Reserved178

179 An actual user’s letter… Hi Folks, I try to do checks on the Secret Server health on a weekly basis. This week there are a few anomalies showing (they may have been temporary and gone now). It might be worth taking a look at them to ensure there are no underlying problems. http://ustwu002.kcc.com/node_reports/ustcca038a/memory.html PAW Ustcca038a memory use seems to have jumped significantly since the 11 th April. It’s not a dangerous levels right now, but it may be worth looking into what is utilizing it, just in case something is up. http://ustwu002.kcc.com/node_reports/ustcca038b/wcpu.html PAW ustcca038b CPU utilization went up significantly on the 14 th. Looked like it peaked and started to go down again, but worth a check. Thanks. North Atlantic Data Warehousing Support Secret Name Kimberly-Clark Corp. IT Services: North Atlantic Data Warehousing Kimberly-Clark Limited (registered in England under No. 308676) registered office address 1 Tower View, Kings Hill, West Malling, Kent ME19 4HA. Kimberly-Clark Europe Limited (registered in England under No. 4060641) and Kimberly-Clark European Services Limited (registered in England under No. 4071548), registered office address 40 London Road, Reigate, Surrey RH2 9QP. © Ron Kaminski 2010, All Rights Reserved179

180 Summary Proper capacity planning isn’t just running a product that produces screens of output to check off against a requirements list – Proper capacity planning data, properly presented, is useful to all strata of IT and even the end users While it does require investment and knowledge, the rewards can be immense Keep coming to CMG to stay current and see past the hype! – Get better at this than me and get up here and share how you do it with everyone else! – Write a paper for next years conference! © Ron Kaminski 2010, All Rights Reserved180

181 © Ron Kaminski 2010, All Rights Reserved181 The Book Shelf The Visual Display of Quantitative Information, by Edward R. Tufte, published by The Graphics Press – You will never look at vendor resource consumption graphics products the same way again – Join me in the “Ban 3-D and bouncing lines with oversize dots to represent sampled quantities” movement! – If you don’t already have it, run from the room right now and go buy it, it is that good Envisioning Information, by Edward R. Tufte, published by The Graphics Press – More great ways to think about human perception of numbers Handouts from any of Tufte’s lectures http://www.edwardtufte.com/tufte/ The Visual Miscellaneum: A Colorful Guide to the World's Most Consequential Trivia by David McCandless published by Collins Design

182 © Ron Kaminski 2010, All Rights Reserved182 The Book Shelf Learning Perl, by Schwartz & Christiansen, published by O’Reilly & Associates Inc., the llama book ( Learn perl in a few cross country flights!) Programming Perl, by Wall, Christiansen & Orwant, published by O’Reilly & Associates Inc., the camel book (This book is a must have) Graphics Programming with Perl, by Verbruggen, published by Manning Publications, (This book really helped me a lot when I decided to make my own reporting system, check out the online sample chapters) http://www.cpan.org

183 © Ron Kaminski 2010, All Rights Reserved183 The Book Shelf Cascading Style Sheets, The Definitive Guide, by Meyer, published by O’Reilly & Associates Inc., the salmon book HTML & XHTML, The Definitive Guide, Musciano & Kennedy, published by O’Reilly & Associates Inc., the koala book Perl Graphics Programming, Wallace, published by O’Reilly & Associates Inc., the colubus monkey book CGI Programming with Perl, Guelich, Gundavaram & Birzniks, published by O’Reilly & Associates Inc., the mouse book

184 © Ron Kaminski 2010, All Rights Reserved184 The Book Shelf New Operating Systems on your plate? – Windows® Internals: Including Windows Server 2008 and Windows Vista, Fifth Edition, Russinovitch, Solomon & Ionescu, published by Microsoft Press; 5 edition (June 17, 2009) – Unix Power Tools, Third Edition, Powers, Peek, O’Reilly & Loukides, published by O’Reilly & Associates Inc., the power drill book

185 © Ron Kaminski 2010, All Rights Reserved185 General Questions? Rules: – No “Which vendor…” questions! All vendors do great things, often in different ways, and in ways that change over time. Effective use comes from deep understanding of their methods. CMG is a great place to ask the vendors those questions and keep current! That is what the nightly drinking time is for! ;^)) – No “Which client did that…” questions! They may be in the audience!

186 © Ron Kaminski 2010, All Rights Reserved186 Stump The Modeler! I’ve dumped a lot of material on you today, and some among you have some great questions that may bring a nuance into focus – Ask them! Or you might see things differently than I do – Write your own course! I’ll come and see it! Make sure to use a lot of graphs!

187 Give Blood! I have, regularly, since I turned 16, every 8 weeks, it is good for you and society! – http://en.wikipedia.org/wiki/Blood_donation http://en.wikipedia.org/wiki/Blood_donation – http://www.redcross.org/donate/give/ http://www.redcross.org/donate/give/

188 © Ron Kaminski 2010, All Rights Reserved188 Oh Boy! Legalese Any process names, product names, lyrics, trademarks or commercial products mentioned are the property of their respective owners All opinions expressed are those of the author, not any of the author’s present or past employers Any ideas from this paper implemented by the reader are done at their own risk. The author and/or his present or past employers assume no liability or risk arising from activities suggested in this paper. Work safe, and have a good time!

189 Thank You So Much For Listening! Write A Paper For CMG!


Download ppt "Planning and Auditing Your Firm’s Capacity Planning Efforts By Ron “The Hammer” Kaminski"

Similar presentations


Ads by Google