Presentation on theme: "STATISTICAL ANALYSIS. Your introduction to statistics should not be like drinking water from a fire hose!!"— Presentation transcript:
STATISTICAL ANALYSIS. Your introduction to statistics should not be like drinking water from a fire hose!!
What do you mean by data?? Nature of the Data Two main types: categorical or continuous 1. Categorical: Nominal (unordered, unequal categories) E.g.: Female=1 and male=2 Ordinal (ordered unequal or ranked categories) E.g.: 1=SD 2=D 3=N 4=A 5=SA 2. Continuous: Interval (ordered, equal intervals, no zero) E.g.: 5-point Likert scale with equal intervals or IQ score Ratio (ordered, equal intervals with absolute zero) E.g.: raw scores, class attendance (in days); age (in years) Descriptive statistics: Procedures used for summarizing the data in both numerical and graphic form. Includes, frequencies, distributions, percents, cumulative percents, pie charts, bar graphs (histograms) and scatter plots. (Cross-tabulations: summarizes relationships between two variables like a scatter plot but in a table form.) Measures of central tendency: Mean: arithmetic average (interval & ratio data only) Mode: most frequent; can be bimodal or multimodal (all types) Median: mid point with equal half above and below; (ordinal, interval and ration)
Statistics 101!! Statistics –Measures of location—mean vs. median and why –Measures of scale—range, interquartile range, standard deviation (and variance) –Measures of position—percentiles, deciles, quartiles, median Note. For categorical variables, we use proportions as the descriptive statistics
Why does lack of normality cause problems? When we calculate the p-value for an inference test, we find the probability that the sample was different due to sampling variability. Basically, we are trying to see if a recorded value occurred by chance and chance alone. When we look for a p-value, we are assuming that all samples of the given sample size are normally distributed around the mean. This is why the test statistic, which is the number of standard deviations away from the population mean the sample mean is, is able to be used. Therefore, without normality, no p-value can be found.
There are non-parametric tests which are similar to the parametric tests. The following table shows how some of the tests match up. Parametric Test Goal for Parametric Test Non-Parametric Test Goal for Non- Parametric Test Two Sample T-Test To see if two samples have identical population means Wilcoxon Rank-Sum Test To see if two samples have identical population medians One Sample T-Test To test a hypothesis about the mean of the population a sample was taken from Wilcoxon Signed Ranks Test To test a hypothesis about the median of the population a sample was taken from Chi-Squared Test for Goodness of Fit To see if a sample fits a theoretical distribution, such as the normal curve Kolmogorov- Smirnov Test To see if a sample could have come from a certain distribution ANOVA To see if two or more sample means are significantly different Kruskal-Wallis Test To test if two or more sample medians are significantly different
What is different about Non- Parametric Statistics? Sometimes statisticians use what is called “ordinal” data. This data is obtained by taking the raw data and giving each sample a rank. These ranks are then used to create test statistics. In parametric statistics, one deals with the median rather than the mean. Since a mean can be easily influenced by outliers or skewness, and we are not assuming normality, a mean no longer makes sense. The median is another judge of location, which makes more sense in a non-parametric test. The median is considered the center of a distribution.
Drawing a histogram.. the good the bad and the downright ugly!!. Many modern introductory texts and confuse frequency graphs, relative frequency graphs, and histograms. Good Bad
What's the difference between a bar chart & a Histogram??
Critical Values For a given number of degrees of freedom, by the property of the t-distribution, we know how large the t-statistic must be in order to reject the null. We call that number the “critical value” of the t-statistic and is typically determined by the values in a table of the t- statistic. If the value of the t-statistic calculated from the data is greater than this critical value, then we “reject the null hypothesis.” - This is because, for t-statistics greater than this critical value, our probability of falsely rejecting the null hypothesis is very small.
Example Suppose our null hypothesis is that X is less than 0. The sample mean is 3; The sample standard deviation is 2; There are 121 observations. Step 1. We need to establish our “critical value.” We wish to reject the null hypothesis if we are 95% certain that it is false. For 121 observations and a “one-tailed test,” the critical value is 1.66 (which we look up on the table. This corresponds to a significance level of.05 with 120 degrees of freedom). Step 2. The t-statistic = ( 3 – 0 ) / ( 2 / 121 ) 3 /.18 16.7. Step 3. Compare the t-statistic with the critical value. If the t-statistic is greater than the critical value, then you can reject the null hypothesis. In this case, 16.7 is greater than 1.66, so we can reject the null hypothesis that X is less than zero.
Example The table to the right is a sample “cross-tab” Your research hypothesis is that dog ownership and gender are related. How do you test this hypothesis? Dog- Owners No PetsTotals Men 100 400 500 Women 50 450 500 Totals 150 850 1,000
Hypothesis Tests about tables Step 1. Define null and research hypotheses. The null hypothesis will usually be that there is no relationship between the rows and the columns. Step 2. Determine your tolerance for falsely rejecting the null hypothesis of no relationship. Step 3. Empirically analyse the data to determine if there is a relationship.
Example To calculate independence: 1) Identify the number of respondents in each internal cell of the table 2) Calculate the number of respondents who would be in each cell if independent (corresponds to the second number under each total) e.g. cell 1,1 =.5 *.15 *1000 = 75 cell 1,2 =.5 *.85 *1000 = 425 3) Compute the chi-squared test statistic (next slide) Dog- Owners No PetsTotals Men 100 ( 75 ) 400 ( 425 ) 500 Women 50 ( 75 ) 450 ( 425 ) 500 Totals 150 850 1,000 1.00
The Chi-Square Test Statistic To calculate independence: 3) Compute the chi-squared test statistic The chi-squared test statistic is simply: 2 = rows columns (Observed row,column - Expected row,column ) 2 Expected row,column The chi-squared statistic follows a chi-squared distribution with degrees of freedom = (rows – 1) (columns – 1).
Example If we look at our table of the 2 with 1 degrees of freedom, the critical value for our test statistic is 3.84. 2 = (100 - 75) 2 / 75 +(400-425) 2 / 425 + (50- 75) 2 / 75 + (450-425) 2 / 425 =19.6 In this case, we reject the null hypothesis that the two populations are statistically independent because our test-statistic is greater than our critical value. Dog- Owners No PetsTotals Men 100 (75) 400 (425) 500 Women 50 (75) 450 (425) 500 Totals 150 850 1,000