Presentation is loading. Please wait.

Presentation is loading. Please wait.

Capacity Management for Web Operations John Allspaw Operations Engineering.

Similar presentations


Presentation on theme: "Capacity Management for Web Operations John Allspaw Operations Engineering."— Presentation transcript:

1 Capacity Management for Web Operations John Allspaw Operations Engineering

2 the book Im writing

3 ???

4 Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)

5 bugs (disguised as capacity problems) edge cases (disguised as capacity problems) security incidents real capacity problems* * (should be the last thing you need to worry about) Things that can cause downtime

6 Capacity != Performance Forget about performance for right now Measure what you have right NOW Dont count on it getting any better

7 Thank You HPC Industry! Automated Stuff Scalable Metric Collection/Display a lot of great deployment and management tricks come from them, adopted by web ops

8 Good Measuremen t Tools record and store metrics in/out custom metrics easily compare lightweight-ish I

9 Clouds need planning too Makes deployment and procurement easy and quick But clouds are still resources with costs and limits, just like your own stuff Black-boxes: you may need to pay even more attention than before

10 Metrics System Statistics

11 Metrics Application Level (photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs)

12 Metrics App-level meets system-level here, total CPU = ~1.12 * # busy apache procs (ymmv)

13 2400 photos per minute being uploaded right NOW (Tuesday afternoon)

14 Ceiling s the most amount of work your resources will allow before degradation or failure

15 Forget Benchmarking

16 Find your ceilings The End what you have left

17 Use real live production data to find ceilings Production: its like a lab, but bigger!

18 Like: database ceilings replication lag: bad!

19 Ceilings waiting on disk too much sustained disk I/O wait for >40% creates slave lag* *for us, YMMV

20 35,000 photo requests per second on a Tuesday peak

21 Safety Factors

22 Ceiling * Factor of Safety = UR LIMITZ

23 Safety Factors webserver!

24 what you have left safe ceiling @85% CPU Safety Factors 85% total CPU = ~76 busy apache procs

25 Safety Factors Yahoo Front Page link to Chinese NewYear Photos (photo requests/second) (8% spike)

26 Forecasting

27 Fictional Example: webservers

28 Forecasting Fictional example: 15 webservers. 1 week. peak of the week

29 ...bigger sample, 6 weeks....isolate the peaks... Forecasting

30 ...Add a Trendline with some decent correlation... Forecasting not too shabby now

31 Forecasting 15 servers @76 busy apache proc limit = 1140 total procs when is this? this will tell you when it is ceiling what you have left

32 Forecasting (week #10, duh) (1140-726) / 42.751 = 9.68

33 Writing excel macros is boring All we want is days remaining, so all we need is the curve-fit Forecasting Automation Use http://fityk.sf.net tohttp://fityk.sf.net automate the curve-fit

34 Forecasting Fictional Example: storage consumption

35 Forecasting Automation actual flickr storage consumption from early 2005, in GB (ceiling is fictional) this will tell you when this is

36 Forecasting Automation cmd line script output jallspaw:~]$cfityk./fit-storage.fit 1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...

37 Forecasting Automation (SAME) fityk gave: y = 0.786854x 2 + 146.657x + 14147.4 ( R 2 = 99.84) Excel gave: y = 0.7675x 2 + 146.96x + 14147.3 ( R 2 = 99.84)

38 Capacity Health 12,629 nagios checks 1314 hosts 6 datacenters 4 photo farms farm = 2 DCs (east/west)

39 High and Low Water Marks alert if higher alert if lower Per server, squid requests per second

40 A good dashboard looks something like... (yes, fictional numbers)

41 Diagonal Scaling Image processing machines Replace Dell PE860s with HP DL140G3s vertically scaling your already horizontal nodes

42 Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU usage per box)

43 ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) processing means making 4 sizes from originals Diagonal Scaling example: image processing throughput

44 3008.4 Watts 1036.8 Watts went from: 23 Dell PE860s 8 HP DL140 G3s to: 1035 photos/min 1120 photos/min (75% faster, even) 23U rack 8U rack Diagonal Scaling example: image processing !!!

45 3.52 terabytes will be consumed today (on a Tuesday)

46 2nd Order Effects (beware the wandering bottleneck) running hot, so add more

47 2nd Order Effects (beware the wandering bottleneck) running great now, so more traffic! now these run hot

48 Stupid Capacity Tricks

49 Stupid Capacity Tricks quick and dirty management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2

50 Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>

51 Stupid Capacity Tricks Turn Stuff OFF Disable heavy-ish features of the site(on/off switches) We have 195 different things to disable in case of emergency.

52 Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.

53 Host your outage/status/blog page in more than one datacenter. Tell your users WTF is going on, theyll appreciate it. Stupid Capacity Tricks Outages Happen

54 Stupid Capacity Tricks Hit the Pause Button Bake the dynamic into static Some Y! properties have a big red button to instantly bake (and un- bake) at will

55 thanks http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/

56 Were Hiring! flickr.com/jobs Come see me!

57 questions?


Download ppt "Capacity Management for Web Operations John Allspaw Operations Engineering."

Similar presentations


Ads by Google