# Data analysis.

## Presentation on theme: "Data analysis."— Presentation transcript:

Data analysis

The first step in any data analysis strategy is to calculate summary measures to get a general feel for the data. Summary measures for a data set are often referred to as descriptive statistics. Descriptive statistics fall into three main categories: measures of position (or central tendency) measures of variability measures of skewness

The purpose of descriptive statistics is to describe the data.
The type of data will determine which descriptive statistic is appropriate. Specifically, one can only calculate a mean with interval or ratio data, whereas a mode can be calculated with nominal, ordinal, interval or ratio data.

Measures of Position Measures of position (or central tendency) describe where the data are concentrated. Mean The Mean is simply the mathematical average of the data. T the mean provides you with a quick way of describing your data, and is probably the most used measure of central tendency. However, the mean is greatly influenced by outliers. For example, consider the following set: While the mean for this data set is 18.7, it is obvious that nine out of ten of the observation lie below the mean because of the large final observation. Consequently, the mean is not always the best measure of central tendency.

Median: The median is the middle observation in a data set
Median: The median is the middle observation in a data set. That is, 50% of the observation are above the median and 50% are below the median (for sets with an even number of observation, the median is the average of the middle two observation). The median is often used when a data set is not symmetrical, or when there are outlying observation. For example, median income is generally reported rather than mean income because of the outlying observation.

To get the median, first put your numbers in ascending or descending order. Then just use check to see which of the following two rules applies: Rule One. If you have an odd number of numbers, the median is the center number (e.g., three is the median for the numbers 1, 1, 3, 4, 9).   Rule Two. If you have an even number of numbers, the median is the average of the two innermost numbers (e.g., 2.5 is the median for the numbers 1, 2, 3, 7).

Mode: The Mode is the value around which the greatest number of observation are concentrated, or quite simply the most common observation. Mode is often used with nominal data, but is not the preferred measure for other types of data.

The mean, median, and mode are affected differently by skewness (i. e
The mean, median, and mode are affected differently by skewness (i.e., lack of symmetry) in the data.

When a variable is normally distributed, the mean, median, and mode are the same number.

When the variable is skewed to the left (i. e
When the variable is skewed to the left (i.e., negatively skewed), the mean is pulled to the leftthe most, the median is pulled to the left the second most, and the mode the least affected. Therefore, mean < median < mode.

When the variable is skewed to the right (i. e
When the variable is skewed to the right (i.e., positively skewed), the mean is pulled to the right the most, the median is pulled to the right the second most, and the mode the least affected. Therefore, mean > median > mode.

Measures of Variability
While measures of position describe where the data points are concentrated, measures of variability measure the dispersion (or spread) of the data set. Range: The range is the difference between the largest and the smallest observations in the data set. However, This is a limited measure because it depends on only two of the numbers in the data set. Using the above data set again, the range is 149, but that does not provide any information regarding the concentration of the data at the low end of the scale. Another limitation of range is that it is affected by the number of observations in the data set. Generally, the more observation there are, the more spread out they will be. One use of range in everyday life is in newspaper stock market summaries, which give the day's high and low numbers.

Measures of Variability
Measures of variability tell you how "spread out" or how much variability is present in a set of numbers. For example, which set of the following numbers appears to be the most spread out? Set A.  93, 96, 98, 99, 99, 99, 100 Set B.  10, 29, 52, 69, 87, 92, 100 Right! The numbers in set B are more "spread out." One crude indicator of variability is the range (i.e., the difference between the highest and lowest numbers).

Two commonly used indicators of variability are the variance and the standard deviation.
Variance: Unlike range, variance takes into consideration all the data points in the data set. If all the observation are the same, the variance would be zero. The more spread out the observation are, the larger the variance. The variance tells you (exactly) the average deviation from the mean, in "squared units."

Standard Deviation: Standard deviation is the positive square root of the variance, and is the most common measure of variability. Standard deviation indicates how close to or how far the numbers tend to vary from the mean. The larger the standard deviation, the more variation there is in the data set. (If the standard deviation is 7, then the numbers tend to be about 7 units from the mean. If the standard deviation is 1500, then the numbers tend to be about 1500 units from the mean.)

Virtually everyone in education is already familiar with the normal curve
An easy rule applying to data that follow the normal curve is the "68, 95, 99.7 percent rule." That is    Approximately 68% of the cases will fall within one standard deviation of the mean. Approximately 95% of the cases will fall within two standard deviations of the mean. Approximately 99.7% of the cases will fall within three standard deviations of the mean.

Higher values for both of these indicators stand for a larger amount of variability. Zero stands for no variability at all (e.g., for the data 3, 3, 3, 3, 3, 3, the variance and standard deviation will equal zero).

Frequency Distributions
One useful way to view information in a variable is to construct a frequency distribution (i.e., an arrangement in which the frequencies, and sometimes percentages, of the occurrence of each unique data value are shown). When a variable has a wide range of values, you may prefer using a grouped frequency distribution (i.e., where the data values are grouped into intervals, 0-9, 10-19, , etc., and the frequencies of the intervals are shown).

Graphic Representations of Data
Another excellent way to clearly describe your data (especially for visually oriented learners) is to construct graphical representations of the data (i.e., pictorial representations of the data in two-dimensional space).   A bar graph uses vertical bars to represent the data. The height of the bars usually represent the frequencies for the categories shown on the X axis(i.e., the horizontal axis). (By the way, the Y axis is the vertical axis.)

A line graph uses one or more lines to depict information about one or more variables.
A simple line graph might be used to show a trend over time (e.g., with the years on the X axis and the population sizes on the Y axis). Line graphs are used for many different purposes in research. For example, (GPA is on the X axis and frequency is on the Y axis)

A scatterplot is used to depict the relationship between two quantitative variables.
Typically, the independent or predictor variable is represented by the X axis (i.e., on the horizontal axis) and the dependent variable is represented by the Y axis (i.e., on the vertical axis).

The relationship is not always positive
Correlation coefficient range between -1 and +1 Interpretation of Pearson r +1 highly positvely correlated -1 highly negatively correlated Close to zero, no correlation

Correlation does not necessarily indicate causation
+.82 tells us that a person with an average score on the test will probably obtained an average score on other test

How to Interpret the Values of Correlations.
The correlation coefficient (r) represents the linear relationship between two variables. If the correlation coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the correlation between variables, it is important to know this "magnitude" or "strength" as well as the significance of the correlation.

Outliers. Outliers are atypical (by definition), infrequent observations. Outliers have a profound influence on the slope of the regression line and consequently on the value of the correlation coefficient. A single outlier is capable of considerably changing the slope of the regression line and, consequently, the value of the correlation, as demonstrated in the following example.

Analyses for Comparison
Nominal Data: Chi-Square Interval Data: t-Test Interval Data: One-Way ANOVA Interval Data: Factorial ANOVA Analyses for Association Interval Data: Pearson Product-Moment Correlation (r) Nominal Data: Phi Coefficient Ordinal Data: Spearman Rank-Order Correlation

Kruskal-Wallis analysis of ranks and the Median test.
parametric Methods Non parametric Methods t-test for independent samples Mann-Whitney U test ANOVA/MANOVA (multiple groups) Kruskal-Wallis analysis of ranks and the Median test. t-test for dependent samples (two variables measured in the same samplE) Sign test and Wilcoxon's matched pairs test

t-test for independent samples
Purpose, Assumptions. The t-test is the most commonly used method to evaluate the differences in means between two groups. For example, the t-test can be used to test for a difference in test scores between a group of patients who were given a drug and a control group who received a placebo. Theoretically, the t-test can be used even if the sample sizes are very small (e.g., as small as 10; some researchers claim that even smaller n's are possible), as long as the variables are normally distributed within each group and the variation of scores in the two groups is not reliably different

The normality assumption can be evaluated by looking at the distribution of the data (via histograms) or by performing a normality test. The equality of variances assumption can be verified with the F test, or you can use the more robust Levene's test. If these conditions are not met, then you can evaluate the differences in means between two groups using one of the nonparametric alternatives to the t- test (Nonparametrics).

Independent sample t test
Mean N Std.Deviation Std. Error Mean Talk Low stress High stress 42.20 22.07 15 24.97 27.14 6.45 7.01 Sx = SD/√15 Standard deviation of the sample means IV DV F Sig. T Df Sig. (2-tailed Mean diff Std. error diff Talk Equal variance assumed Equal variance not assumed .023 .881 2.43 2.430 28 27.808 .022  . In this case, variances are similar Levene’s test for equality of variance Tested at α = .05 You want a small F Here you want variance to equal The larger the F value the more dissimilar the varainces are

An independent t st was conducted to evaluate the hypothesis that students talk differently (amount of talkin) under different stress condition. The test was significant, t (28) = 2.43, p =.022. Students in high stress-condition talked less (M=22.07; SD = 27.14) than students in low-stressed condition (M=45.20; SD = 24.97)

t-test for dependent samples (paired sampel t-test
Test two groups of observations (that are to be compared) are based on the same sample of subjects who were tested twice (e.g., before and after a treatment ) Mean N Std.Deviation Std. Error Mean PAY SECURITY 5.67 4.50 30 1.49 1.83 .27 .33 Sx = SD/√30 Standard deviation of the sample means

Mean Std. Dev. Std. Err. Lower Upper t df Sig. (2-tailed) Pay- security 1.17 2.26 .41 .32 2.01 2.827 29  .008 A paired-sample t test was conducted to evaluate whether employees were more concerned with pay or job security. The results indicated that the mean concern for pay (M = 5.67, SD = 1.49) was significantly greater than the mean concern for security (M = 4.50, SD = 1.83), t (29) = 2.83, p = .008.

It was suggested (Marija J. Norusis) that
When reporting your results, give the exact observed significance level. It will help the rader evaluate your findings Eg: p = .008, [8 chances in 1000] you would observe the difference between the two sample. Eg; p = .08 [8 chances in 100] but you have set that you will only acet if it is [5 chances in 100]

Pearson Chi-square. The Pearson Chi-square is the most common test for significance of the relationship between categorical variables. This measure is based on the fact that we can compute the expected frequencies in a two-way table (i.e., frequencies that we would expect if there was no relationship between the variables). For example, suppose we ask 20 males and 20 females to choose between two brands of jeans (brands A and B). If there is no relationship between preference and gender, then we would expect about an equal number of choices of brand A and brand B for each sex. The Chi-square test becomes increasingly significant as the numbers deviate further from this expected pattern; that is, the more this pattern of choices for males and females differs.

The Goodness of Fit test: used to find out if the population under study follow the distribution values Ho: the population distribution is uniform, that is, each brand of cola drinks is prefered by an equal percentage of the population Ha: the population distribution is not uniform, that is, each brand of cola drinks is not prefered by an equal percentage of the population

brand O E O-E (O-E)2 (O-E)2/E A 50 60 B 65 C 45 D 70 Total 300
X 2 (df=5)= 9.18, let say the significant value is 9.49, then Ho has to rejected and we cannot say that cola brands are preferred by an equal percentage of the population Df = (r-1). (c-1)

The data are obtained from a random sample
Test of independence [ we can test the realtionship between nominal variables) The data are obtained from a random sample We use count data (frequencies) We want to test whether perception of life is independent of gender or men and women find life equaly exciting

Life excitement male female excited 300 384 684 Not excited 296 481
777 596 865 1461 Chi square 4.76, DF =1; p =.0290 What can you conclude?