ANOVA and non-parametric test

ANOVA and non-parametric test
Introduction to Statistical Methods for Measuring “Omics” and Field Data ANOVA and non-parametric test

Overview Measure of central tendency (summary statistics)
Measure of variability Hypothesis testing ANOVA

Why study statistics Need to understand and use data to make decisions. To do this, we must: - Decide weather existing data is adequate. - If necessary collect more data. - Summarize the available data in a useful information. - Analyze the available data. - Draw conclusions, make decisions.

Two types of statistics
Descriptive statistics - data used for descriptive purposes no predictions - use tables and graphs Inferential statistics - employ data to draw inferences (conclusions) - sample data employed to infer populations - sample must be random

Measures of central tendency
A single value that attempts to describe a set of data by identifying the central position. The most common measures of central tendency are: - Mean - Median - Mode

Inferential mean of a sample: X=(X)/n Mean of a population: =(X)/N
What is mean? The “mean” is the average score or value. Example the average age of participant in this workshop. Inferential mean of a sample: X=(X)/n Mean of a population: =(X)/N

Mean of students Student group 1 Age group 1 Student group 2
Schmuggles 28 Bopsey 31 Pallitto 49 Homer 42 Schnickerson 30 Levin 39 Honkey-Doorey 32 Zingers 80 Amy Boehmer 48 Queenie 40 Mean 41.9 36.9

The median Because the mean (average) can be sensitive to extreme values, the median is sometimes useful and more accurate. The median is simply the middle value among some scores of a variable. (no standard formula for its computation)

What is the median? Rank order and choose middle value.
Student Weight Schmuggles 165 Bopsey 213 Pallitto 189 Homer 187 Schnickerson Levin 148 Honkey-Doorey 251 Zingers 308 Boehmer 151 Queenie 132 Googles-Boop 199 Calzone 227 Mean 194.6 Weight 132 148 151 165 187 189 199 213 227 251 308 Rank order and choose middle value. If even then average between two in the middle ( ) = 188 Median = 188

The Mode The most frequent response or value for a variable.
Multiple modes are possible: bimodal or multimodal.

Figuring the Mode What is the mode? Answer: 165 Student Weight
Schmuggles 165 Bopsey 213 Pallitto 189 Homer 187 Schnickerson Levin 148 Honkey-Doorey 251 Zingers 308 Boehmer 151 Queenie 132 Googles-Boop 199 Calzone 227 What is the mode? Answer: 165

RSTUDIO Mean calculated using the function: mean
Median calculated using the function: median Mode calculated using the function: table

R-script # R Script for Data Management and GBS Workshop Thursday, 6/8/17 # Script for calculating mean, median, mode, standard deviation and variance. # Dataset of students' weight named Mean with 12 observations and 2 variables. #Variable names are: Student and Weight. #Load the data into RStudio Mean <- read.csv("/Desktop_2017/GBS/Data_Management/Thursday/Mean.csv") #Show the dimension of the data. dim(Mean) #Calculate the mean, median and the mode of weight Mean <- mean(Mean$Weight) Mean Median <- median(Mean$Weight) Median MODE <- transform(table(Mean$Weight)) MODE

Variability Measure of central tendency only gives partial information about a data set. It is important to describe how much the observations differ from one another. Measures of dispersion give us information about how much our variables vary from the mean, because if they don’t it makes it difficult to infer anything from the data. Dispersion is also known as the spread or range of variability.

Measure of dispersion Measures of dispersion tell us about variability in the data. Basic question: how much do values differ for a variable from the min to max, and distance among scores in between. We use: Range Standard Deviation Variance

Variability Measure of central tendency only gives partial information about a data set. It is important to describe how much the observations differ from one another. Measures of dispersion give us information about how much our variables vary from the mean, because if they don’t it makes it difficult to infer anything from the data. Dispersion is also known as the spread or range of variability.

Measure of dispersion Measures of dispersion tell us about variability in the data. Basic question: how much do values differ for a variable from the min to max, and distance among scores in between. We use: Standard Deviation Variance

The standard deviation
A standardized measure of distance from the mean. Very useful and something you do read about when making predictions or other statements about the data.

Formula for standard deviation
=square root =sum (sigma) X=score for each point in data _ X=mean of scores for the variable n=sample size (number of observations or cases

Measure of Variability
Standard deviation: Is a measure of how spread out scores are Example: 79, 56, 74, 22, 35, 87, 60, 61, 62, 21, 23 Start by calculating mean Sample No Score X 1 21 2 22 3 23 4 35 5 56 6 60 7 61 8 62 9 74 10 79 11 87 Total 580 Mean = = (580/11 ) = 52.7

Standard deviation: Is a measure of how spread out scores are Example: 79, 56, 74, 22, 35, 87, 60, 61, 62, 21, 23 Score X 21 = -31.7 22 = -30.7 23 = -29.7 35 = 17.7 56 = 3.3 60 = 7.3 61 = 8.3 62 = 9.3 74 = 21.3 79 = 26.3 87 =34.3 Mean = = (580/11 ) = 52.7

Standard deviation: Is a measure of how spread out scores are Example: 79, 56, 74, 22, 35, 87, 60, 61, 62, 21, 23 Score X 21 = -31.7 (-31.7)2 = 22 = -30.7 (-30.7)2 = 942.5 23 = -29.7 (-29.7)2 = 882.1 35 = 17.7 (-17.7)2 = 313.3 56 = 3.3 (3.3)2 = 10.9 60 = 7.3 (7.3)2 = 53.3 61 = 8.3 (8.3)2 = 68.9 62 = 9.3 (9.3)2 = 86.5 74 = 21.3 (21.3)2 = 453.7 79 = 26.3 (26.3)2 = 691.7 87 =34.3 (34.3)2 = Mean = = (580/11 ) = 52.7

Standard deviation: Is a measure of how spread out scores are Example: 79, 56, 74, 22, 35, 87, 60, 61, 62, 21, 23 Score X 21 = -31.7 (-31.7)2 = 1004.8 22 = -30.7 (-30.7)2 = 942.5 942.5 23 = -29.7 (-29.7)2 = 882.1 882.1 35 = 17.7 (-17.7)2 = 313.3 313.3 56 = 3.3 (3.3)2 = 10.9 10.9 60 = 7.3 (7.3)2 = 53.3 53.3 61 = 8.3 (8.3)2 = 68.9 68.9 62 = 9.3 (9.3)2 = 86.5 86.5 74 = 21.3 (21.3)2 = 453.7 453.7 79 = 26.3 (26.3)2 = 691.7 691.7 87 =34.3 (34.3)2 = 1176.5 Mean = = (580/11 ) = 52.7 =

Standard deviation: Is a measure of how spread out scores are Example: 79, 56, 74, 22, 35, 87, 60, 61, 62, 21, 23 Score X 21 = -31.7 (-31.7)2 = 1004.8 22 = -30.7 (-30.7)2 = 942.5 942.5 23 = -29.7 (-29.7)2 = 882.1 882.1 35 = 17.7 (-17.7)2 = 313.3 313.3 56 = 3.3 (3.3)2 = 10.9 10.9 60 = 7.3 (7.3)2 = 53.3 53.3 61 = 8.3 (8.3)2 = 68.9 68.9 62 = 9.3 (9.3)2 = 86.5 86.5 74 = 21.3 (21.3)2 = 453.7 453.7 79 = 26.3 (26.3)2 = 691.7 691.7 87 =34.3 (34.3)2 = 1176.5 Mean = = (580/11 ) = 52-7 = Standard deviation = = sqrt((5684.2)/(11-1)) = 23.84 Variance = standard deviation X standard deviation = X = 568.3

Standard error of the mean
Standard error of the mean is a measure that estimates how close a calculated mean is likely to be to the true mean of that population The standard error is the standard deviation of a data set divided by the square root of the number of observations The standard error of the scores = 23/sqrt(11) =

RSTUDIO Mean calculated using the function: mean
Median calculated using the function: median Mode calculated using the function: table Standard deviation calculated using: sd Variance calculated using: var

R-script SD <- sd(Mean$Weight) SD VAR <- var(Mean$Weight) VAR
# R Script for Data Management and GBS Workshop Thursday, 6/8/17 # Script for calculating mean, median, mode, standard deviation and variance. # Dataset of students' weight named Mean with 12 observations and 2 variables. #Variable names are: Student and Weight. #Load the data into RStudio Mean <- read.csv("/Desktop_2017/GBS/Data_Management/Thursday/Mean.csv") #Show the dimension of the data. dim(Mean) #Calculate the mean, median and the mode of weight Mean <- mean(Mean$Weight) Mean Median <- median(Mean$Weight) Median MODE <- transform(table(Mean$Weight)) MODE SD <- sd(Mean$Weight) SD VAR <- var(Mean$Weight) VAR

Hypothesis testing

Hypothesis testing Hypothesis is a claim or statement about the value of single population characteristics or values of several population characteristics. A test of hypothesis is a method that uses sample data to decide two competing claims (hypothesis) about a population characteristics. Null hypothesis (Ho), is a claim a bout a population characteristic that initially assumed to be true. Alternative hypothesis (Ha), is the competing claim.

Hypothesis testing The form of null hypothesis is:
- Ho: population characteristic = hypothesized value. The alternative hypothesis will have on of the following three forms: - Ha: population characteristic > hypothesized value. - Ha: population characteristic < hypothesized value. - Ha: population characteristic ≠ hypothesized value.

Hypothesis testing Methods used in hypothesis testing: - Z-test
- T-test - ANOVA - Non-parametric statistics

Completely randomized design
Subjects are randomly assigned to treatments Relies on randomization to control for the effects of extraneous variables 25 plots Five treatments - T1 = Blue - T2 = Red - T3 = Army green - T4 = Yellow - T5= Purple

Randomized complete block design
Subjects are divided into subgroups called blocks, such that the variability within blocks is less than the variability between blocks Block 1 Block 2 Block 3 Block 4 Block 5

Definition Remember: Variance = the square of the standard deviation.
9/9/2018 Definition ANOVA: ANALYSIS OF VARIANCE is the analysis of variation in an experimental outcome, especially of a statistical variance in order to determine the contributions of given factors or variables to the variance. Remember: Variance = the square of the standard deviation.

Introduction Any data set has variability
Variability exists within groups Variability exists between groups Question that ANOVA allows us to answer : Is this variability significant, or merely by chance?

ANOVA model Assumptions
Independence (response variables yi are independent). Normality (response variables are normally distributed). Homoscedasticity (the response variables have the same variance).

ANOVA model assumptions
Test for normality Shapiro-Wilk Test Testing variances Levene’s Test for Equality of Variances

What does ANOVA do? At its simplest ANOVA tests the following hypotheses: H0: The means of all the groups are equal. Ha: Not all the means are equal doesn't’t say how or which ones differ. Can follow up with “multiple comparisons”

Tell which groups are different
What ANOVA Cannot Do Tell which groups are different Post-hoc test of mean differences required Compare multiple parameters for multiple groups (so it cannot be used for multiple response variables)

ANOVA variations One-Way Factorial (Two-Way, Three-Way) ANOVA
Repeated measures ANOVA Nested ANOVA MANOVA (Multiple analysis of variance) multiple response variables Two-way ANOVA: 2 independent variables (2 factors); ex: seed type and fertilizer type MANOVA: can analyze multiple dependent variables (response) simultaneously ANCOVA: combines of simple linear regression & one-way ANOVA

Summary ANOVA: allows us to know if variability in a data set is between groups or merely within groups is more versatile than t-test can compare multiple groups at once cannot process multiple response variables does not indicate which groups are different

One-Way ANOVA One factor (manipulated variable) One response variable Two or more groups to compare

Between and within groups
Group 1 (5 students) Group 2 (5 students) Group 3 (5 students) Group 4 (5 students) Four Groups: Group 1 = Blue Group 2 = Red Group 3 = Army green Group 4 = Yellow

How ANOVA works ANOVA measures two sources of variation in the data and compares their relative sizes variation BETWEEN groups for each data value look at the difference between its group mean and the overall mean variation WITHIN groups for each data value we look at the difference between that value and the mean of its group

Between and within groups
Group 1 (5 students) Group 2 (5 students) Group 3 (5 students) Group 4 (5 students) Between groups Within groups

F-Ratio The ANOVA F-statistic is a ratio of the Between Group Variation divided by the Within Group Variation: A large F is evidence against H0, since it indicates that there is more difference between groups than within groups.

F-Ratio = MSBet/MSw If:
The ratio of Between-Groups MS: Within-Groups MS is LARGE reject H0 there is a difference between groups The ratio of Between-Groups MS: Within-Groups MS is SMALLdo not reject H0 there is no difference between groups

p-values Use F-table to determine p
Use df for numerator and denominator Choose level of significance If calculated F values > critical value, reject the null hypothesis (for one-tail test)

F-table

Computing ANOVA F statistic
overall mean: 6.44; MSG = 5.106/(3-1)= ; MSW=(1.757/(10-3) = 0.251 F = /0.251 =

F-table

ANOVA in R ANOVA is calculated using functions: aov and lm
ANOVA <- read.csv("/Desktop_2017/GBS/Data_Management/Thursday/ANOVA.csv") #We need to convert variable group to factors ANOVA$group <- as.factor(ANOVA$group) #Us aov function AOV <- aov(Data~Group, data=ANOVA) summary(AOV) #Use lm function LM <- lm(Data~Group, data=ANOVA) anova(LM)

R ANOVA Output 1 less than number of groups
Df Sum Sq Mean Sq F value Pr(>F) Group ** Residuals Total Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 1 less than number of groups number of data values - number of groups (equals df for each group added together)

R ANOVA Output Analysis of Variance Source DF SS MS F P
treatment Error Total SS stands for sum of squares ANOVA splits this into 3 parts

R ANOVA Output P-value comes from F(DFG,DFE) F = MSG / MSE
Analysis of Variance Source DF SS MS F P treatment Error Total MSG = SSG / DFG MSE = SSE / DFE P-value comes from F(DFG,DFE) F = MSG / MSE (P-values for the F statistic are in Table)

R2 Statistic R2 gives the percent of variance due to between
group variation We will discuss more on R2 when we study regression.

Where’s the Difference?
Once ANOVA indicates that the groups do not all appear to have the same means, what do we do? Post-hoc comparisons of treatments.

Post-hoc Comparisons of Treatments
If differences in group means are determined from the F-test, researchers want to compare pairs of groups. Three popular methods include: Fisher’s LSD - Upon rejecting the null hypothesis of no differences in group means, LSD method is equivalent to doing pairwise comparisons among all pairs of groups. Tukey’s Method - Specifically compares all t(t-1)/2 pairs of groups. t is the number of groups. Bonferroni’s Method - Adjusts individual comparison error rates so that all conclusions will be correct at desired confidence/significance level. Any number of comparisons can be made. Very general approach can be applied to any inferential problem

LSD and Tukey using RStudio
install.packages("agricolae") library(agricolae) LSD: function LSD.test Tukey: function HSD.test #Calculate LSD AOV <- aov(Data~Group, data=ANOVA) LSD_out <- LSD.test(AOV, "Group", group=TRUE) #Calculate HSD HSD_out <- HSD.test(AOV, "Group", group=TRUE)

Factorial ANOVA

Factorial ANOVA ANOVA in which there is more than one independent variable (factor) Used much more often than one-way (simple) ANOVA Seldom uses more than three or four factors

Factorial ANOVA A two-way ANOVA can investigate the main effects of each of two independent factor variables, as well as the effect of the interaction of these variables. Appropriate data: Two-way data. That is, one dependent variable measured across two independent factor variables. Dependent variable is interval/ratio, and is continuous. Independent variables are a factor with two or more levels. That is, two or more groups for each independent variable.

Factorial ANOVA Hypotheses:
Null hypothesis 1: The means of the dependent variable for each group in the first independent variable are equal Alternative hypothesis 1 (two-sided): The means of the dependent variable for each group in the first independent variable are not equal • Similar hypotheses are considered for the other independent variable • Similar hypotheses are considered for the interaction effect

Factorial ANOVA Interpretation:
Reporting significant results for the effect of one independent variable as “Significant differences were found among means for groups.” is acceptable. Alternatively, “A significant effect for independent variable on dependent variable was found.” Reporting significant results for the interaction effect as “A significant effect for the interaction of independent variable 1 and independent variable 2 on dependent variable was found.” Is acceptable. Reporting significant results for mean separation post-hoc tests as “Mean of variable Y for group A was different than that for group B.” is acceptable.

Components of Factorial ANOVA
Interaction – present when the differences between the groups of one independent variable on the dependent variable vary according to the level of a second independent variable [e.g., crop yield, fertilizer, varieties] Main effects – test of each independent variable (IV) when all other independent variables are held constant

Questions of Interest Generally, the questions of interest here (i.e. hypotheses to be tested) concern three questions regarding the potential effects of the factors on the response variable. Question 1: Do the effects that factors a and b have on the response variable interact, i.e. is there a significant interaction between factors a and b ?

Questions of Interest Question 1: If we conclude there is a significant interaction then we conclude the effects of both factors a and b are significant! When we have an interaction we cannot consider the effect of either factor independently of the other, therefore both factors matter.

Questions of Interest If there is not a significant interaction effect then we can consider the main effects separately, i.e. we ask the following: Question 2: Does factor a alone have a significant effect? Question 3: Does factor b alone have a significant effect?

Tests of Hypotheses Just as we had Sums of Squares and Mean Squares in One-way ANOVA, we have the same in Two-way ANOVA: Recall, Mean Squares are measures of variability across the levels of the relevant factor of interest. In balanced Two-way ANOVA, we measure the overall variability in the data.

Two-way ANOVA Table Source of Variation Degrees of Freedom Sum of Squares Mean Square F-ratio P-value Factor A a - 1 SSA MSA FA = MSA / MSE Tail area Factor B b - 1 SSB MSB FB = MSB / MSE Interaction (a – 1)(b – 1) SSAB MSAB FAB = MSAB / MSE Error ab(n – 1) SSE MSE Total abn - 1 SST This is our initial focus which is the p-value for Question 1: Is there an interaction effect?

Tests of Hypotheses If the interaction is not statistically significant (i.e. p-value > 0.05) then we conclude the main effects (if present) are independent of one another. We can then test for significance of the main effects separately, again using an F-test. If a main effect is significant we can then use multiple comparison procedures as usual to compare the mean response for different levels of the factor while holding the other factor fixed.

Two-way ANOVA Example Dose (A) A1B1=58 A1B2=128 A2B1=42 A2B2=97
10 mg 40 mg 70 mg G E N D R (B) Male 12 9 19 B1=197 17 32 11 28 18 Female 37 38 B2=382 48 15 39 20 34 43 25 46 A1=186 A2=139 A3=254 G=579

Two-way ANOVA Example Make Decisions
For dose, Fobt > Fcrit, therefore, reject H0 For gender, Fobt > Fcrit, therefore, reject H0 For interaction Fobt < Fcrit, therefore, fail to reject H0 Source df SS MS Fobt Fcrit Dose 2 835.75 5.92 3.55 Gender 1 20.21 4.41 Interaction 14.58 7.29 .10 Within 18 70.57 Total 23

Two-way ANOVA Main Effect
If a main effect is significant, it means that differences exist in one factor, regardless of the levels of the other factor. In our last example, the gender main effect was significant. This means, regardless of the type of drug, females took significantly longer to fall asleep than males.

RStudio

Reporting analyses, results, and assumptions

Procedures/Methods A description of the analysis. Give the test used, indicate the dependent variables and independent variables, and describe the experimental design or model in as much detail as is appropriate. Mention the experimental units, dates, and location. Model assumptions checks. Describe what assumptions about the data or residuals are made by the test and how you assessed your data’s conformity with these assumptions. Cite the software and packages you used. Cite a reference for statistical procedures or code

A measure of the central tendency or location of the data, such as the mean or median for groups or populations A measure of the variation for the mean or median for each group, such standard deviation, standard error, first and third quartile, or confidence intervals Size of the effect. Often this is conveyed by presenting means and medians in a plot or table.

Reporting

Results Figure

Results Table

Kruskal-Wallis H-Test
Tests the equality of more than two (p) population probability distributions Corresponds to ANOVA for more than two means Uses 2 distribution with p – 1 df

Kruskal-Wallis H-Test for Comparing k Probability Distributions
H0: The k probability distributions are identical Ha: At least two of the k probability distributions differ in location. Assume that the population is normally distributed. Allow students about 10 minutes to solve this. Test statistic:

where nj = Number of measurements in sample j Rj = Rank sum for sample j, where the rank of each measurement is computed according to its relative magnitude in the totality of data for the k samples Rj = Rj/nj = Mean rank sum for jth sample R = Mean of all ranks = (n + 1)/2 n = Total sample size = n1 + n nk Assume that the population is normally distributed. Allow students about 10 minutes to solve this.

Rejection region: H > with (k – 1) degrees of freedom Ties: Assign tied measurements the average of the ranks they would receive if they were unequal but occurred in successive order. For example, if the third-ranked and fourth-ranked measurements are tied, assign each a rank of (3 + 4)/2 = 3.5. The number should be small relative to the total number of observations. Assume that the population is normally distributed. Allow students about 10 minutes to solve this.

Conditions Required for the Validity of the Kruskal-Wallis H-Test
The k samples are random and independent. Normality is not requred. 3. The k probability distributions from which the samples are drawn are continuous Assume that the population is normally distributed. Allow students about 10 minutes to solve this.

Kruskal-Wallis H-Test Example
A production manager wants to see if three filling machines have different filling times. He assigns 15 similarly trained and experienced workers, 5 per machine, to the machines. At the .05 level of significance, is there a difference in the distribution of filling times? Mach1 Mach2 Mach 112

Kruskal-Wallis H-Test Solution
Ha:  = df = Critical Value(s): Identical Distrib. At Least 2 Differ .05 p – 1 = 3 – 1 = 2 c 2 5.991  = .05

Raw Data Mach1 Mach2 Mach Ranks Mach1 Mach2 Mach3

Raw Data Mach1 Mach2 Mach Ranks Mach1 Mach2 Mach

Raw Data Mach1 Mach2 Mach Ranks Mach1 Mach2 Mach Total

115

Ha:  = df = Critical Value(s): Identical Distrib. At Least 2 Differ Test Statistic: Decision: Conclusion: H = 11.58 .05 p – 1 = 3 – 1 = 2 c 2 Reject at  = .05 5.991  = .05 There is evidence population distrib. are different

Kruskal test and post-hoc test
Kruskal uses a function called: kruskal.test Post-hoc test: pairwise Mann–Whitney U-tests for multiple comparisons. Post-hoc test using Nemenyi-test.

Kruskal <- read.csv("/Volumes/STEPHEN_A/Kruskal.csv")#Cary out
Kruskal–Wallis to test the hypothesis that the treatments have different effects.#Carry out post-hoc test using Nemenyi-test.KRS<-kruskal.test(Protein~Group, data=Kruskal)install.packages("PMCMR")library(PMCMR)posthoc.kruskal.nemenyi.test(x=Kruskal$Protein,g=Kruskal$Group, dist="Tukey")

ANOVA and non-parametric test

Similar presentations

Presentation on theme: "ANOVA and non-parametric test"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ANOVA and non-parametric test

Similar presentations

Presentation on theme: "ANOVA and non-parametric test"— Presentation transcript:

Similar presentations

About project

Feedback