Presentation is loading. Please wait.

Presentation is loading. Please wait.

Last lecture summary Five numbers summary, percentiles, mean Box plot, modified box plot Robust statistic – mean, median, trimmed mean outlier Measures.

Similar presentations


Presentation on theme: "Last lecture summary Five numbers summary, percentiles, mean Box plot, modified box plot Robust statistic – mean, median, trimmed mean outlier Measures."— Presentation transcript:

1 Last lecture summary Five numbers summary, percentiles, mean Box plot, modified box plot Robust statistic – mean, median, trimmed mean outlier Measures of variability range, IQR

2 MEASURES OF VARIABILITY

3 Problem with IQR normal bimodal uniform

4 Options for measuring variability 1. Find the average distance between all pairs of data values. 2. Find the average distance between each data value and either the max or the min. 3. Find the average distance between each data value and the mean.

5 Preventing cancellation How can we prevent the negative and positive deviations from cancelling each out? 1. Take absolute value of each deviation. 2. Square each deviation.

6 Average absolute deviation Sample 1044 51 3-33 2-44 1913 1-55 711 1155 1-55 1 5 avg. absolute deviation = 4.6

7 Average absolute deviation

8 Squared deviations Sample 104 5 3-3 2-4 1913 1-5 71 115 1-5 1

9 Squared deviations Sample 10416 51 3-39 2-416 1913169 1-525 711 11525 1-525 1-525 avg. square deviation = 31.2

10 Variance Average squared devation has a special name – variance (rozptyl).

11 Standard deviation

12 What is so great about the standard deviation? Why don’t we just find the average absolute deviation? More on absolute vs. standard deviation: http://www.leeds.ac.uk/educol/documents/00003759.htm Empirical rule 68% - 1 s.d. 95% - 2 s.d. 99.7% - 3 s.d.

13 Empirical rule It covers 273 data values, 66.8%.

14 Empirical rule

15 Statistical inference The goal of statistical work: make rational conclusions or decisions based on the incomplete information we have in our data. This process is known as statistical inference. In inferential statistics we want to be able to answer the question: “If I see something in my data, say a difference between two groups or a relationship between two variables, could this be simply due to chance? Or is it a real difference in relationship?”

16 Statistical inference If we get results that we think are not just due to chance we'd like to know what broader conclusions we can make. Can we generalize them to a larger group or even perhaps the whole world? And when we see a relationship between two variables, we'd like to know if one variable causes the other to change. The methods we use to do so and the correctness of the conclusions that we can make all depend on how the data were collected.

17 Statistical inference fundamental feature of data: variability How can we picture this variation and how can we quantify it? Population – the group we are interested in making conclusions about. Census – a collection of data on the entire population. Sample – if we can’t conduct a census, we collect data from the sample of a population. Goal: make conclusions about that population.

18 Statistical inference A statistic is a value calculated from our observed data (sample). A parameter is a value that describes the population. We want to be able to generalize what we observe in our data to our population. In order to this, the sample needs to be representative. How to select a representative sample? Use randomization.

19

20 population (census) vs. sample parameter (population) vs. statistic (sample)

21 Random sampling Simple Random Sampling (SRS) – each possible sample from the population is equally likely to be selected. Stratified Sampling – simple random sample from subgroups of the population subgroups: gender, age groups, … Cluster sampling – divide the population into non- overlapping groups (clusters), sample is a randomly chosen cluster example: population are all students in an area, randomly select schools and create a sample from students of the given school

22 Bias If a sample is not representative, it can introduce bias into our results. bias – zkreslení, odchylka A sample is biased if it differs from the population in a systematic way. The Literary Digest poll, 1936, U. S. presidential election surveyed 10 mil. people – subscribers or owned cars or telephones 2.3 mil. responded predicting (3:2) a Republican candidate to win a Democrat candidate won What went wrong? only wealthy people were surveyed (selection bias) survey was voluntary response (nonresponse bias) – angry people or people who want a change

23 Bessel’s correction

24 Sample vs. population SD

25

26 SRS sampling with replacement Generates independent samples Two sample values are independent if that what we get on the first one doesn't affect what we get on the second. sampling without replacement Deliberately avoid choosing any member of the population more than once. This type of sampling is not independent, however it is more common. The error is small as long as 1. the sample is large 2. the sample size is no more than 10% of population size

27 Bessel’s game Now list all possible samples of 2 cards. Calculate sample averages. Now, half of you calculate sample variance using /n, and half of you using /(n-1). And then average all sample variances. Sample Sample average 04 Population of all cards in a bag 2

28 Measuring spread – summary median = $112 000 mean = $518 000 trimmed median = $112 000 trimmed mean = $128 000 33 750 44 000 45 566 65 000 95 000 103 500 112 495 138 188 141 666 181 500 185 000 190 000 194 375 195 000 205 000 292 500 301 999 4 600 000 5 600 000

29 Measuring spread – summary original datatrimmed datarobust median$112 000 mean$518 000$ 128 000 range$5 566 000$268 000 IQR$150 000$146 000 s.d.$1 389 000$84 000 33 750 44 000 45 566 65 000 95 000 103 500 112 495 138 188 141 666 181 500 185 000 190 000 194 375 195 000 205 000 292 500 301 999 4 600 000 5 600 000


Download ppt "Last lecture summary Five numbers summary, percentiles, mean Box plot, modified box plot Robust statistic – mean, median, trimmed mean outlier Measures."

Similar presentations


Ads by Google