Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic concept of statistics Measures of central Measures of central tendency Measures of dispersion & variability.

Similar presentations


Presentation on theme: "Basic concept of statistics Measures of central Measures of central tendency Measures of dispersion & variability."— Presentation transcript:

1 Basic concept of statistics Measures of central Measures of central tendency Measures of dispersion & variability

2 Measures of tendency central Arithmetic mean (= simple average) summation measurement in population index of measurement Best estimate of population mean is the sample mean, X sample size

3 Measures of variability All describe how “spread out” the data 1.Sum of squares, sum of squared deviations from the mean For a sample,

4 2.Average or mean sum of squares = variance, s 2 : For a sample, Why?

5 n – 1 represents the degrees of freedom,, or number of independent quantities in the estimate s 2. n – 1 represents the degrees of freedom,, or number of independent quantities in the estimate s 2. therefore, once n – 1 of all deviations are specified, the last deviation is already determined. Greek letter “nu”

6 3.Standard deviation, s For a sample, Variance has squared measurement units – to regain original units, take the square root

7 4.Standard error of the mean For a sample, Standard error of the mean is a measure of variability among the means of repeated samples from a population. Standard error of the mean is a measure of variability among the means of repeated samples from a population.

8 N = 28 μ = 44 σ² = 1.214 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg)

9 repeated random sampling, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 43

10 repeated random sampling, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 4344

11 repeated random sampling, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 434445

12 repeated random sampling, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 43444544

13 repeated random sampling, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 4344454444

14 repeated random sampling, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg)

15 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 46

16 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 4644

17 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 464446

18 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 46444645

19 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 4644464544

20 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg)

21 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 42

22 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg)42

23 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 424243

24 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 42424345

25 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg) 4242434543

26 Repeated random samples, each with sample size, n = 5 values … 44 45 44 42 43 46 42 44 45 A Population of Values Body Weight Data (Kg)

27 For a large enough number of large samples, the frequency distribution of the sample means (= sampling distribution), approaches a normal distribution.

28 Normal distribution: bell-shaped curve

29 Testing statistical hypotheses between 2 means Testing statistical hypotheses between 2 means 1.State the research question in terms of statistical hypotheses. It is always started with a statement that hypothesizes “no difference”, called the null hypothesis = H 0.  E.g., H 0 : Mean bill length of female hummingbirds is equal to mean bill length of male hummingbirds

30 Then we formulate a statement that must be true if the null hypothesis is false, called the alternate hypothesis = H A.  E.g., H A : Mean bill length of female hummingbirds is not equal to mean bill length of male hummingbirds If we reject H 0 as a result of sample evidence, then we conclude that H A is true.

31 2.Choose an appropriate statistical test that would allow you to reject H 0 if H 0 were false. E.g., Student’s t test for hypotheses about means William Sealey Gosset (a.k.a. “Student”)

32 Standard error of the difference between the sample means To estimate s (X 1 - X 2 ), we must first know the relation between both populations. Mean of sample 2 Mean of sample 1 t Statistic,

33 How to evaluate the success of this experimental design class Compare the score of statistics and experimental design of several student Compare the score of experimental design of several student from two serial classes Compare the score of experimental design of several student from two different classes

34 Comparing the score of Statistics and experimental experimental design of several student Similar Student Dependent populations Identical Variance Different Student Independent populations Identical Variance Not Identical Variance

35 Different Student Independent populations Identical Variance Not Identical Variance Comparing the score of experimental design of several student from two serial classes

36 Comparing the score of experimental design of several student from two classes Different Student Independent populations Identical Variance Not Identical Variance

37 Relation between populations Dependent populations Independent populations 1.Identical (homogenous ) variance 2.Not identical (heterogeneous) variance

38 Sample Null hypothesis: The mean difference is equal to  o Dependent Populations Test statistic Null distribution t with n-1 df *n is the number of pairs compare How unusual is this test statistic? P < 0.05 P > 0.05 Reject H o Fail to reject H o

39 Pooled variance: Then, Independent Population with homogenous variances

40

41 When sample sizes are small, the sampling distribution is described better by the t distribution than by the standard normal (Z) distribution. Shape of t distribution depends on degrees of freedom, = n – 1.

42 Z = t ( =  ) t ( =25) t ( =1) t ( =5) t

43 t Area of Rejection Area of Acceptance Area of Rejection Lower critical value Upper critical value 0 0.950.025 For  = 0.05 The distribution of a test statistic is divided into an area of acceptance and an area of rejection.

44 Critical t for a test about equality = t  (2),

45 Independent Population with heterogenous variances

46 Analysis of Variance Analysis of Variance(ANOVA)

47 Independent T-test Compares the means of one variable for TWO groups of cases. Compares the means of one variable for TWO groups of cases. Statistical formula: Statistical formula: Meaning: compare ‘standardized’ mean difference But this is limited to two groups. What if groups > 2? But this is limited to two groups. What if groups > 2? Pair wised T Test (previous example)Pair wised T Test (previous example) ANOVA (ANalysis Of Variance)ANOVA (ANalysis Of Variance)

48 From T Test to ANOVA 1. Pairwise T-Test If you compare three or more groups using t-tests with the usual 0.05 level of significance, you would have to compare each pairs (A to B, A to C, B to C), so the chance of getting the wrong result would be: 1 - (0.95 x 0.95 x 0.95) = 14.3% Multiple T-Tests will increase the false alarm.

49 2. Analysis Of Variance In T-Test, mean difference is used. Similar, in ANOVA test comparing the observed variance among means is used. In T-Test, mean difference is used. Similar, in ANOVA test comparing the observed variance among means is used. The logic behind ANOVA: The logic behind ANOVA: If groups are from the same population, variance among means will be small (Note that the means from the groups are not exactly the same.)If groups are from the same population, variance among means will be small (Note that the means from the groups are not exactly the same.) If groups are from different population, variance among means will be large.If groups are from different population, variance among means will be large. From T Test to ANOVA

50 What is ANOVA? ANOVA (Analysis of Variance) is a procedure designed to determine if the manipulation of one or more independent variables in an experiment has a statistically significant influence on the value of the dependent variable. ANOVA (Analysis of Variance) is a procedure designed to determine if the manipulation of one or more independent variables in an experiment has a statistically significant influence on the value of the dependent variable. Assumption Assumption Each independent variable is categorical (nominal scale). Independent variables are called Factors and their values are called levels. Each independent variable is categorical (nominal scale). Independent variables are called Factors and their values are called levels. The dependent variable is numerical (ratio scale) The dependent variable is numerical (ratio scale) The basic idea is that the “variance” of the dependent variable given the influence of one or more independent variables {Expected Sum of Squares for a Factor} is checked to see if it is significantly greater than the “variance” of the dependent variable (assuming no influence of the independent variables) {also known as the Mean-Square-Error (MSE)}. The basic idea is that the “variance” of the dependent variable given the influence of one or more independent variables {Expected Sum of Squares for a Factor} is checked to see if it is significantly greater than the “variance” of the dependent variable (assuming no influence of the independent variables) {also known as the Mean-Square-Error (MSE)}.

51 Pair-t-Test Amir69Budi82 Abas64Berta78 Abi70Bambang82 Aura67Banu81 Ana69Betty82 Anis69Bagus77 Berth78 Average6880 n67 Var. sample4.85.07

52 ANOVA TABLE OF 2 POPULATIONS S V SSDF Mean square (M.S.) Between populations Within populations SSbetween 1 MSB SSB DFB SSWithin (r1 - 1)+ (r2 - 1) SSW DFW = MSW = TOTAL SSTotal r1 + r2 -1 S²

53 Rationale for ANOVA We can break the total variance in a study into meaningful pieces that correspond to treatment effects and error. That’s why we call this Analysis of Variance. We can break the total variance in a study into meaningful pieces that correspond to treatment effects and error. That’s why we call this Analysis of Variance. The Grand Mean, taken over all observations. The mean of any group. The mean of a specific group (1 in this case). The observation or raw data for the ith subject.

54 The ANOVA Model Note: Trial i The grand mean A treatment effect Error SS Total = SS Treatment + SS Error

55 Analysis of Variance (ANOVA) can be used to test for the equality of three or more population means using data obtained from observational or experimental studies. Analysis of Variance (ANOVA) can be used to test for the equality of three or more population means using data obtained from observational or experimental studies. Use the sample results to test the following hypotheses. Use the sample results to test the following hypotheses.  H 0 :  1 = 2 = 3 =... =  k  H a : Not all population means are equal If H 0 is rejected, we cannot conclude that all population means are different. If H 0 is rejected, we cannot conclude that all population means are different. Rejecting H 0 means that at least two population means have different values. Rejecting H 0 means that at least two population means have different values. Analysis of Variance

56 Assumptions for Analysis of Variance For each population, the response variable is normally distributed. For each population, the response variable is normally distributed. The variance of the response variable, denoted  2, is the same for all of the populations. The variance of the response variable, denoted  2, is the same for all of the populations. The effect of independent variable is additive The effect of independent variable is additive The observations must be independent. The observations must be independent.

57 Analysis of Variance: Testing for the Equality of t Population Means Between-Treatments Estimate of Population Variance Between-Treatments Estimate of Population Variance Within-Treatments Estimate of Population Variance Within-Treatments Estimate of Population Variance Comparing the Variance Estimates: The F Test Comparing the Variance Estimates: The F Test ANOVA Table ANOVA Table

58 A between-treatments estimate of σ 2 is called the mean square due to treatments (MSTR). A between-treatments estimate of σ 2 is called the mean square due to treatments (MSTR). The numerator of MSTR is called the sum of squares due to treatments (SSTR). The numerator of MSTR is called the sum of squares due to treatments (SSTR). The denominator of MSTR represents the degrees of freedom associated with SSTR. The denominator of MSTR represents the degrees of freedom associated with SSTR. Between-Treatments Estimate of Population Variance

59 The estimate of  2 based on the variation of the sample observations within each treatment is called the mean square due to error (MSE). The estimate of  2 based on the variation of the sample observations within each treatment is called the mean square due to error (MSE). The numerator of MSE is called the sum of squares due to error (SSE). The numerator of MSE is called the sum of squares due to error (SSE). The denominator of MSE represents the degrees of freedom associated with SSE. The denominator of MSE represents the degrees of freedom associated with SSE. Within-Treatments Estimate of Population Variance

60 Comparing the Variance Estimates: The F Test If the null hypothesis is true and the ANOVA assumptions are valid, the sampling distribution of MSTR/MSE is an F distribution with MSTR d.f. equal to k - 1 and MSE d.f. equal to n T - k. If the null hypothesis is true and the ANOVA assumptions are valid, the sampling distribution of MSTR/MSE is an F distribution with MSTR d.f. equal to k - 1 and MSE d.f. equal to n T - k. If the means of the k populations are not equal, the value of MSTR/MSE will be inflated because MSTR overestimates σ  2. If the means of the k populations are not equal, the value of MSTR/MSE will be inflated because MSTR overestimates σ  2. Hence, we will reject H 0 if the resulting value of MSTR/MSE appears to be too large to have been selected at random from the appropriate F distribution. Hence, we will reject H 0 if the resulting value of MSTR/MSE appears to be too large to have been selected at random from the appropriate F distribution.

61 Test for the Equality of k Population Means Hypotheses Hypotheses H 0 :  1 = 2 = 3 =... =  k  H 0 :  1 = 2 = 3 =... =  k  H a : Not all population means are equal Test Statistic Test Statistic F = MSTR/MSE

62 Test for the Equality of k Population Means Rejection Rule Rejection Rule Using test statistic: Reject H 0 if F > F a Using test statistic: Reject H 0 if F > F a Using p-value: Reject H 0 if p-value < a where the value of F a is based on an F distribution with t - 1 numerator degrees of freedom and n T - t denominator degrees of freedom

63 The figure below shows the rejection region associated with a level of significance equal to  where F  denotes the critical value. Sampling Distribution of MSTR/MSE Do Not Reject H 0 Reject H 0 MSTR/MSE Critical Value FF FF

64 ANOVA Table Source of Sum of Degrees of Mean Variation Squares Freedom Squares F TreatmentSSTR k- 1MSTR MSTR/MSE Error SSE n T - MSE Error SSE n T - kMSE Total SST n T - 1 SST divided by its degrees of freedom n T - 1 is simply the overall sample variance that would be obtained if we treated the entire n T observations as one data set.

65 What does Anova tell us? ANOVA will tell us whether we have sufficient evidence to say that measurements from at least one treatment differ significantly from at least one other. It will not tell us which ones differ, or how many differ.

66 ANOVA vs t-test ANOVA is like a t-test among multiple data sets simultaneously ANOVA is like a t-test among multiple data sets simultaneously t-tests can only be done between two data sets, or between one set and a “true” valuet-tests can only be done between two data sets, or between one set and a “true” value ANOVA uses the F distribution instead of the t- distribution ANOVA uses the F distribution instead of the t- distribution ANOVA assumes that all of the data sets have equal variances ANOVA assumes that all of the data sets have equal variances Use caution on close decisions if they don’tUse caution on close decisions if they don’t

67 ANOVA – a Hypothesis Test H 0 : There is no significant difference among the results provided by treatments. H 0 : There is no significant difference among the results provided by treatments. H a : At least one of the treatments provides results significantly different from at least one other. H a : At least one of the treatments provides results significantly different from at least one other.

68 Y ij =  +  j +  ij By definition,   j = 0 t j=1 The experiment produces (r x t) Y ij data values. The analysis produces estimates of         t . (We can then get estimates of the  ij by subtraction ). Linear Model

69 Y 11 Y 12 Y 13 Y 14 Y 15 Y 16 … Y 1t Y 21 Y 22 Y 23 Y 24 Y 25 Y 26 … Y 2t Y 31 Y 32 Y 33 Y 34 Y 35 Y 36 … Y 3t Y 41 Y 42 Y 43 Y 44 Y 45 Y 46 … Y 4t......…. Y r1 Y r2 Y r3 Y r4 Y r5 Y r6 … Y rt _______________________________________________________________________________ __ __ __ __ __ __ __ Y.1 Y.2 Y.3 Y.4 Y.5 Y.6 … Y.t 1 2 3 4 56… t Y 1, Y 2, …, are Column Means _ _

70 Y =  Y j / t = “GRAND MEAN” (assuming same # data points in each column) (otherwise, Y = mean of all the data) j=1 t

71 MODEL: Y ij =  +  j +  ij Y estimates  Y j - Y estimates   j (=  j –  ) (for all j) These estimates are based on Gauss’ (1796) PRINCIPLE OF LEAST SQUARES and on COMMON SENSE

72 MODEL: Y ij =  +  j +  ij If you insert the estimates into the MODEL, (1) Y ij = Y + (Y j - Y ) +  ij. it follows that our estimate of  ij is (2)  ij = Y ij - Y j < <

73 Then, Y ij = Y + (Y j - Y ) + ( Y ij - Y j ) or, (Y ij - Y ) = (Y j - Y ) + (Y ij - Y j ) { { { (3) TOTAL VARIABILITY in Y = Variability in Y associated with X Variability in Y associated with all other factors +

74 If you square both sides of (3), and double sum both sides (over i and j), you get, [after some unpleasant algebra, but lots of terms which “cancel”]  (Y ij - Y ) 2 = R  (Y j - Y ) 2 +  (Y ij - Y j ) 2 t r j=1 i=1 { { j=1 tt r j=1 i=1 TSS TOTAL SUM OF SQUARES ==== SSB C SUM OF SQUARES BETWEEN COLUMNS ++++ SSW (SSE) SUM OF SQUARES WITHIN COLUMNS ( ( ( ( ( (

75 ANOVA TABLE S V SSDF Mean square (M.S.) Between Columns (due to brand) Within Columns (due to error) SSB c t - 1 MSB C SSB C t- 1 SSW (r - 1) t SSW (r-1)t = MSW = TOTAL TSS tr -1

76 Hypothesis, H O :  1 =  2 =  c = 0 H I : not all  j = 0 Or H O :  1 =  2 =  c H I : not all  j are EQUAL (All column means are equal)

77 The probability Law of MSB C MSW = “F calc ”, is The F - distribution with (t-1, (r-1)t) degrees of freedom Assuming H O true. Table Value 

78 Example: Reed Manufacturing Reed would like to know if the mean number of hours worked per week is the same for the department managers at her three manufacturing plants (Buffalo, Pittsburgh, and Detroit). A simple random sample of 5 managers from each of the three plants was taken and the number of hours worked by each manager for the previous week

79 Sample Data Sample Data Plant 1 Plant 2 Plant 3 ObservationBuffalo Pittsburgh Detroit ObservationBuffalo Pittsburgh Detroit 1 48 73 51 1 48 73 51 2 54 63 63 2 54 63 63 3 57 66 61 3 57 66 61 4 54 64 54 4 54 64 54 5 62 74 56 5 62 74 56 Sample Mean 55 68 57 Sample Mean 55 68 57 Sample Variance 26.0 26.5 24.5 Sample Variance 26.0 26.5 24.5 Example: Reed Manufacturing

80 Hypotheses Hypotheses H 0 :  1 = 2 = 3  H a : Not all the means are equal where:  1 = mean number of hours worked per week by the managers at Plant 1 Plant 1  2 = mean number of hours worked per week by the managers at  2 = mean number of hours worked per week by the managers at Plant 2 Plant 2  3 = mean number of hours worked per week by the managers at  Plant 3 Example: Reed Manufacturing

81 Mean Square Due to Treatments Mean Square Due to Treatments Since the sample sizes are all equal Since the sample sizes are all equal μ= (55 + 68 + 57)/3 = 60 μ= (55 + 68 + 57)/3 = 60 SSTR = 5(55 - 60) 2 + 5(68 - 60) 2 + 5(57 - 60) 2 = 490 SSTR = 5(55 - 60) 2 + 5(68 - 60) 2 + 5(57 - 60) 2 = 490 MSTR = 490/(3 - 1) = 245 MSTR = 490/(3 - 1) = 245 Mean Square Due to Error Mean Square Due to Error SSE = 4(26.0) + 4(26.5) + 4(24.5) = 308 MSE = 308/(15 - 3) = 25.667 = = Example: Reed Manufacturing

82 F - Test F - Test If H 0 is true, the ratio MSTR/MSE should be near 1 because both MSTR and MSE are estimating  2. If H a is true, the ratio should be significantly larger than 1 because MSTR tends to overestimate  2. Example: Reed Manufacturing

83 Rejection Rule Rejection Rule Using test statistic: Reject H 0 if F > 3.89 Using p-value : Reject H 0 if p-value <.05 where F.05 = 3.89 is based on an F distribution with 2 numerator degrees of freedom and 12 denominator degrees of freedom

84 Example: Reed Manufacturing Test Statistic Test Statistic F = MSTR/MSE = 245/25.667 = 9.55 Conclusion Conclusion F = 9.55 > F.05 = 3.89, so we reject H 0. The mean number of hours worked per week by department managers is not the same at each plant.

85 ANOVA Table ANOVA Table Source of Sum of Degrees of Mean Source of Sum of Degrees of Mean Variation Squares Freedom Square F Variation Squares Freedom Square F Treatments 490 2 245 9.55 Treatments 490 2 245 9.55 Error 308 12 25.667 Error 308 12 25.667 Total 798 14 Total 798 14 Example: Reed Manufacturing

86 Step 1 Select the Tools pull-down menu Step 1 Select the Tools pull-down menu Step 2 Choose the Data Analysis option Step 2 Choose the Data Analysis option Step 3 Choose Anova: Single Factor Step 3 Choose Anova: Single Factor from the list of Analysis Tools Using Excel’s Anova: Single Factor Tool

87 Step 4 When the Anova: Single Factor dialog box appears: Step 4 When the Anova: Single Factor dialog box appears: Enter B1:D6 in the Input Range box Enter B1:D6 in the Input Range box Select Grouped By Columns Select Grouped By Columns Select Labels in First Row Select Labels in First Row Enter.05 in the Alpha box Enter.05 in the Alpha box Select Output Range Select Output Range Enter A8 (your choice) in the Output Range box Enter A8 (your choice) in the Output Range box Click OK Click OK Using Excel’s Anova: Single Factor Tool

88 Value Worksheet (top portion) Value Worksheet (top portion) Using Excel’s Anova: Single Factor Tool

89 Value Worksheet (bottom portion) Value Worksheet (bottom portion) Using Excel’s Anova: Single Factor Tool

90 Using the p-Value Using the p-Value The value worksheet shows that the p-value is.00331 The value worksheet shows that the p-value is.00331 The rejection rule is “Reject H 0 if p-value <.05” The rejection rule is “Reject H 0 if p-value <.05” Thus, we reject H 0 because the p-value =.00331 <  =.05 Thus, we reject H 0 because the p-value =.00331 <  =.05 We conclude that the mean number of hours worked per week by the managers differ among the three plants We conclude that the mean number of hours worked per week by the managers differ among the three plants Using Excel’s Anova: Single Factor Tool


Download ppt "Basic concept of statistics Measures of central Measures of central tendency Measures of dispersion & variability."

Similar presentations


Ads by Google