Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Class Location: TRB.

Similar presentations


Presentation on theme: "CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Class Location: TRB."— Presentation transcript:

1 CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB 101 Lectures: TR 15:30 – 16:45 hours Slides primarily adapted from: “The Art of Computer Systems Performance Analysis” by Raj Jain, Wiley 1991. [Chapters 12, 13, and 25]

2 CPSC 531: Data Analysis2 Outline r Measures of Central Tendency m Mean, Median, Mode r How to Summarize Variability? r Comparing Systems Using Sample Data r Comparing Two Alternatives r Transient Removal

3 CPSC 531: Data Analysis3 Measures of Central Tendency (1) r Sample mean – sum of all observations divided by the total number of observations m Always exists and is unique m Mean gives equal weight to all observations m Mean is strongly affected by outliers r Sample median – list observations in an increasing order; the observation in the middle of the list is the median; m Even # of observations – mean of middle two values m Always exists and is unique m Resistant to outliers (compared to mean)

4 CPSC 531: Data Analysis4 Measures of Central Tendency (2) r Sample mode – plot histogram from the observations; find bucket with peak frequency; the middle point of this bucket is the mode; m Mode may not exists (e.g., all sample have equal weight) m More than one mode may exist (i.e. bimodal) m If only one mode then distribution is unimodal mode

5 CPSC 531: Data Analysis5 Measure of Central Tendency (3) r Is data categorical? m Yes: use mode m e.g. most used resource in a system r Is total of interest? m Yes: use mean m e.g. total response time for Web requests r Is distribution skewed? m Yes: use median Median less influenced by outlier than mean. m No: use mean. Why?

6 CPSC 531: Data Analysis6 Common Misuses of Means (1) r Usefulness of mean depends on the number of observations and the variance m E.g. two response time samples: 10 ms and 1000 ms. Mean is 505 ms! Correct index but useless. r Using mean without regard to skewness System A System B 105 95 115 104 1031 Mean: 1010 Mode:105 Min,Max : [9,11] [4,31]

7 CPSC 531: Data Analysis7 Common Misuses of Means (2) r Mean of a Product by Multiplying means m Mean of product equals product of means if the two random variables are independent. m If x and y are correlated E(xy) != E(x)E(y) m Avg. users in system 23; avg. processes/user 2. Avg. # of processes in system? Is it 46? m No! Number of processes spawned by users depends on the load.

8 CPSC 531: Data Analysis8 Outline r Measures of Central Tendency r How to Summarize Variability? r Comparing Systems Using Sample Data r Comparing Two Alternatives r Transient Removal

9 CPSC 531: Data Analysis9 Summarizing Variability r Summarizing by a single number rarely enough. m Given two systems with same mean, we generally prefer one with less variability Frequency Mean=2s Response Time 1.5 s 80% 4 s 20% Frequency Mean=2s Response Time 60% ~ 0.001 s 40% ~5 s r Indices of dispersion Range, Variance, 10- and 90-percentiles, Semi-interquantile range, and mean absolute deviation

10 CPSC 531: Data Analysis10 Range r Easy to calculate; range = max – min r In many scenarios, not very useful: m Min may be zero m Max may be an “outlier”  With more samples, max may keep increasing and min may keep decreasing → no “stable” point r Range is useful if systems performance is bounded

11 CPSC 531: Data Analysis11 Variance and Standard Deviation r Given sample of n observations {x 1, x 2, …, x n } the sample variance is calculated as: r Sample variance: s 2 (square of the unit of observation) r Sample standard deviation: s (in unit of observation) r Note the (n-1) in variance computation m (n-1) of the n differences are independent m Given (n-1) differences, the nth difference can be computed m Number of independent terms is the degrees of freedom (df)

12 CPSC 531: Data Analysis12 Standard Deviation (SD) r Standard deviation and mean have same units m Preferred! m E.g. a) Mean = 2 s, SD = 2 s; high variability? m E.g. b) Mean = 2 s, SD = 0.2 s; low variability? r Another widely used measure – C.O.V m C.O.V = Ratio of standard deviation to mean m C.O.V does not have any units m C.O.V shows magnitude of variability m C.O.V in (a) is 1 and in (b) is.1

13 CPSC 531: Data Analysis13 Percentiles, Quantiles, Quartiles r Lower and upper bounds expressed in percents or as fractions  90-percentile → 0.9-quantile m  –quantile: sort and take [(n-1)  +1] th observation [] means round to nearest integer  Quartiles divide data into parts at 25%, 50%, 75% → quartiles (Q1, Q2, Q3) m 25% of the observations ≤ Q1 (the first quartlie) m Second quartile Q2 is also the median r The range (Q3 – Q1) is interquartile range m (Q3 – Q1)/2 is semi-interquartile (SIQR) range

14 CPSC 531: Data Analysis14 Mean Absolute Deviation r Mean absolute deviation is calculated as:

15 CPSC 531: Data Analysis15 Influence of Outliers r Range: considerably r Sample variance: considerably, but less than range r Mean absolute deviation: less than variance m Doesn’t square (aka magnify) the outliers r SIQR range: very resistant r Use SIQR for index of dispersion whenever median is used as index of central tendency

16 CPSC 531: Data Analysis16 Outline r Measures of Central Tendency r How to Summarize Variability? r Comparing Systems Using Sample Data m Sample vs. Population m Confidence Interval for Mean r Comparing Two Alternatives r Transient Removal

17 CPSC 531: Data Analysis17 Comparing Systems Using Sample Data r The words “sample” and “example” have a common root – “essample” (French) r One sample does not prove a theory - a sample is just an example r The point is - definite statement cannot be made about characteristics of all systems. r However, probabilistic statements about the range of most systems can be made r Confidence interval concept as a building block

18 CPSC 531: Data Analysis18 Sample versus Population r Generate 1-million random numbers m with mean  and SD  and put them in an urn r Draw sample of n observations m {x 1, x 2, …, x n } has mean, standard deviation s r is likely different than  ! r The population mean  is unknown or impossible to obtain in many real-world scenarios m Therefore, obtain estimate of  from x x x

19 CPSC 531: Data Analysis19 Confidence Interval for the Mean r Define bounds c 1 and c 2 such that: Prob{c 1 <  < c 2 } = 1-  m (c 1, c 2 ) is confidence interval m  is significance level m 100(1-  ) is confidence level r Typically small  desired m confidence level 90%, 95% or 99% r One approach: take k samples, find sample means, sort, and take the [1+0.05(k-1)] th as c 1 and [1+0.95(k-1)] th as c 2

20 CPSC 531: Data Analysis20 Central Limit Theorem r We do not need many samples. Confidence intervals can be determined from one sample because ~ N( ,  /sqrt(n)) r SD of sample mean  /sqrt(n) called Standard error r Using the CLT, a 100(1-  )% confidence interval for a population mean is ( -z 1-  /2 s/sqrt(n), +z 1-  /2 s/sqrt(n)) m z 1-  /2 is the (1-  /2)-quantile of a unit normal variate (and is obtained from a table!) m s is the sample SD x x x

21 CPSC 531: Data Analysis21 Confidence Interval Example r CPU times obtained by repeating experiment 32 times. The sorted set consists of m {1.9,2.7,2.8,2.8,2.8,2.9,3.1,3.1,3.2,3.2,3.3,3.4,3.6,3.7,3.8,3.9,3.9,4.1,4.1,4.2,4.2,4.4,4.5,4.5,4.8,4.9,5.1,5.1,5.3,5.6,5.9} m Mean = 3.9, standard deviation (s) = 0.95, n=32 r For 90% confidence interval z 1-  /2 = 1.645, and we get {3.90 + (1.645)(0.95)/(sqrt(32))} = (3.62,4.17)

22 CPSC 531: Data Analysis22 Meaning of Confidence Interval xx - c x + c 90% chance that this interval contains  r What does this mean? With 90% confidence, we can say population mean is within the above bounds; that is, chance of error is 10%. m E.g., Take 100 samples and construct CI’s. In 10 cases, the interval will not contain population mean

23 CPSC 531: Data Analysis23 Length of Confidence Interval r Let z 1-  /2 s/sqrt(n) = c r Then, z 1-  /2 = (c.sqrt(n))/s m Larger s implies wider confidence interval m Larger n implies shorter confidence interval → with more observations, we are better able to predict population mean → square-root n relationship implies increasing observations by a factor of 4 only cuts confidence interval by a factor of 2. r Confidence Interval computation, as described here works for n ≥ 30.

24 CPSC 531: Data Analysis24 What if n not large? r For smaller samples, can construct confidence intervals only if observations come from normally distributed population m t [1-α/2;n-1] is the (1-α/2)-quantile of a t-variate with (n-1) degrees of freedom

25 CPSC 531: Data Analysis25 Testing for a Zero Mean r Check if measured value is significantly different than zero r Determine confidence interval r Then check if zero is inside interval. r Procedure applicable to any other value a 0 mean Mean is zero Mean is nonzero

26 CPSC 531: Data Analysis26 Outline r Measures of Central Tendency r How to Summarize Variability? r Comparing Systems Using Sample Data r Comparing Two Alternatives r Transient Removal

27 CPSC 531: Data Analysis27 Comparing Two Alternatives r Often interested in comparing systems m “naïve” VOD vs. “batching” VOD (assignment 3) m “SJF” vs. “FIFO” request scheduling (assignment 1) r Statistical techniques for such comparison: m Paired Observations m Unpaired Observations (we will omit this!) m Approximate Visual Test r Did you use any of these in your assignments?

28 CPSC 531: Data Analysis28 Paired Observations (1) r n experiments with one-to-one corrsp. between test on system A and test on system B m no correspondence => unpaired m This test uses the zero mean idea… r Treat the two samples as one sample of n pairs r For each pair, compute difference r Construct confidence interval for difference m CI includes zero => systems not significantly different

29 CPSC 531: Data Analysis29 Paired Observations (2) r Six similar workloads used on two systems. {(5.4, 19.1), (16.6, 3.5), (0.6,3.4), (1.4,2.5), (0.6, 3.6) (7.3, 1.7)} Is one system better? r The performance differences are {-13.7, 13.1, -2.8, -1.1, -3.0, 5.6} r Sample mean = -.32, sample SD = 9.03 r CI = -0.32 + t[sqrt(81.62/6)] = -0.32 + t(3.69) r.95 quantile of t with 5 DF’s is 2.015 r 90% confidence interval = (-7.75, 7.11) r Systems not different as zero mean in CI

30 CPSC 531: Data Analysis30 Approximate Visual Test r Compute confidence interval for means r If CI’s don’t overlap, one system better than the other mean CI’s do not overlap => alternatives different CI’s overlap and mean of one is in the CI of the other => not significantly diff. CI’s overlap but mean of one is not in the CI of the other => need more testing

31 CPSC 531: Data Analysis31 Determining Sample Size r Goal: find the smallest sample size n such that desired confidence in the results r Method: m small set of preliminary measurements m estimate variance from the measurements m use estimate to determine sample size for accuracy r r% accuracy=> +r% at 100(1-  )% confidence

32 CPSC 531: Data Analysis32 Outline r Measures of Central Tendency r How to Summarize Variability? r Comparing Systems Using Sample Data r Comparing Two Alternatives r Transient Removal

33 CPSC 531: Data Analysis33 Transient Removal r In many simulations, we are interested in steady state performance m Remove initial transient state r However, defining exactly what constitutes end of transient state is difficult! r Several heuristics developed: m Long runs m Proper initialization m Truncation m Initial data deletion m Moving average of replications m Batch means

34 CPSC 531: Data Analysis34 Long Runs r Use very long runs r Impact of transient state becomes negligible r Wasteful use of resources r How long is “long enough”? r Raj Jain text recommends that this method not be used in isolation

35 CPSC 531: Data Analysis35 Batch Means r Run simulation for long duration r Divide observations (N) into m batches, each of size n r Compute variance of batch means using procedure shown for n = 2, 3, 4, 5 … r Plot variance vs. batch size Ignore Variance of Batch means Batch Size n Transient interval


Download ppt "CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Class Location: TRB."

Similar presentations


Ads by Google