Data analysis.

Slides:



Advertisements
Similar presentations
CHAPTER TWELVE ANALYSING DATA I: QUANTITATIVE DATA ANALYSIS.
Advertisements

Copyright © Allyn & Bacon (2007) Statistical Analysis of Data Graziano and Raulin Research Methods: Chapter 5 This multimedia product and its contents.
Statistical Tests Karen H. Hagglund, M.S.
Basic Data Analysis for Quantitative Research
QUANTITATIVE DATA ANALYSIS
Calculating & Reporting Healthcare Statistics
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Non-parametric statistics
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Measures of Central Tendency
Inferential Statistics
Understanding Research Results
Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242.
Statistical Analysis I have all this data. Now what does it mean?
Hypothesis Testing Charity I. Mulig. Variable A variable is any property or quantity that can take on different values. Variables may take on discrete.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
Quantitative Skills: Data Analysis and Graphing.
Fall 2013 Lecture 5: Chapter 5 Statistical Analysis of Data …yes the “S” word.
Chapter 3 Statistical Concepts.
Statistics in psychology Describing and analyzing the data.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
Chapter 3 – Descriptive Statistics
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
Types of data and how to present them 47:269: Research Methods I Dr. Leonard March 31, :269: Research Methods I Dr. Leonard March 31, 2010.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Statistics Definition Methods of organizing and analyzing quantitative data Types Descriptive statistics –Central tendency, variability, etc. Inferential.
Chapter 15 Data Analysis: Testing for Significant Differences.
© 2006 McGraw-Hill Higher Education. All rights reserved. Numbers Numbers mean different things in different situations. Consider three answers that appear.
Statistical Analysis I have all this data. Now what does it mean?
Statistics 11 Correlations Definitions: A correlation is measure of association between two quantitative variables with respect to a single individual.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
© 2006 McGraw-Hill Higher Education. All rights reserved. Numbers Numbers mean different things in different situations. Consider three answers that appear.
Descriptive Statistics
Lecture 5: Chapter 5: Part I: pg Statistical Analysis of Data …yes the “S” word.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
Copyright © 2014 by Nelson Education Limited. 3-1 Chapter 3 Measures of Central Tendency and Dispersion.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 26.
The Statistical Analysis of Data. Outline I. Types of Data A. Qualitative B. Quantitative C. Independent vs Dependent variables II. Descriptive Statistics.
12: Basic Data Analysis for Quantitative Research.
Review Hints for Final. Descriptive Statistics: Describing a data set.
Chapter Eight: Using Statistics to Answer Questions.
Data Analysis.
Data Summary Using Descriptive Measures Sections 3.1 – 3.6, 3.8
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Chapter 13 Understanding research results: statistical inference.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Interpretation of Common Statistical Tests Mary Burke, PhD, RN, CNE.
Dr.Rehab F.M. Gwada. Measures of Central Tendency the average or a typical, middle observed value of a variable in a data set. There are three commonly.
Chapter 15 Analyzing Quantitative Data. Levels of Measurement Nominal measurement Involves assigning numbers to classify characteristics into categories.
Central Bank of Egypt Basic statistics. Central Bank of Egypt 2 Index I.Measures of Central Tendency II.Measures of variability of distribution III.Covariance.
Chapter 11 Summarizing & Reporting Descriptive Data.
Descriptive Statistics ( )
Different Types of Data
APPROACHES TO QUANTITATIVE DATA ANALYSIS
STATS DAY First a few review questions.
Introduction to Statistics
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Chapter Nine: Using Statistics to Answer Questions
Descriptive Statistics
Presentation transcript:

Data analysis

The first step in any data analysis strategy is to calculate summary measures to get a general feel for the data. Summary measures for a data set are often referred to as descriptive statistics. Descriptive statistics fall into three main categories: measures of position (or central tendency) measures of variability measures of skewness

The purpose of descriptive statistics is to describe the data. The type of data will determine which descriptive statistic is appropriate. Specifically, one can only calculate a mean with interval or ratio data, whereas a mode can be calculated with nominal, ordinal, interval or ratio data.

Measures of Position Measures of position (or central tendency) describe where the data are concentrated. Mean The Mean is simply the mathematical average of the data. T the mean provides you with a quick way of describing your data, and is probably the most used measure of central tendency. However, the mean is greatly influenced by outliers. For example, consider the following set: 1 1 2 4 5 5 6 6 7 150 While the mean for this data set is 18.7, it is obvious that nine out of ten of the observation lie below the mean because of the large final observation. Consequently, the mean is not always the best measure of central tendency.

Median: The median is the middle observation in a data set Median: The median is the middle observation in a data set. That is, 50% of the observation are above the median and 50% are below the median (for sets with an even number of observation, the median is the average of the middle two observation). The median is often used when a data set is not symmetrical, or when there are outlying observation. For example, median income is generally reported rather than mean income because of the outlying observation.

  To get the median, first put your numbers in ascending or descending order. Then just use check to see which of the following two rules applies: Rule One. If you have an odd number of numbers, the median is the center number (e.g., three is the median for the numbers 1, 1, 3, 4, 9).   Rule Two. If you have an even number of numbers, the median is the average of the two innermost numbers (e.g., 2.5 is the median for the numbers 1, 2, 3, 7).  

Mode: The Mode is the value around which the greatest number of observation are concentrated, or quite simply the most common observation. Mode is often used with nominal data, but is not the preferred measure for other types of data.

The mean, median, and mode are affected differently by skewness (i. e The mean, median, and mode are affected differently by skewness (i.e., lack of symmetry) in the data.

When a variable is normally distributed, the mean, median, and mode are the same number.  

When the variable is skewed to the left (i. e When the variable is skewed to the left (i.e., negatively skewed), the mean is pulled to the leftthe most, the median is pulled to the left the second most, and the mode the least affected. Therefore, mean < median < mode.

When the variable is skewed to the right (i. e When the variable is skewed to the right (i.e., positively skewed), the mean is pulled to the right the most, the median is pulled to the right the second most, and the mode the least affected. Therefore, mean > median > mode.

Measures of Variability While measures of position describe where the data points are concentrated, measures of variability measure the dispersion (or spread) of the data set. Range: The range is the difference between the largest and the smallest observations in the data set. However, This is a limited measure because it depends on only two of the numbers in the data set. Using the above data set again, the range is 149, but that does not provide any information regarding the concentration of the data at the low end of the scale. Another limitation of range is that it is affected by the number of observations in the data set. Generally, the more observation there are, the more spread out they will be. One use of range in everyday life is in newspaper stock market summaries, which give the day's high and low numbers.

Measures of Variability Measures of variability tell you how "spread out" or how much variability is present in a set of numbers. For example, which set of the following numbers appears to be the most spread out? Set A.  93, 96, 98, 99, 99, 99, 100 Set B.  10, 29, 52, 69, 87, 92, 100 Right! The numbers in set B are more "spread out." One crude indicator of variability is the range (i.e., the difference between the highest and lowest numbers).

Two commonly used indicators of variability are the variance and the standard deviation. Variance: Unlike range, variance takes into consideration all the data points in the data set. If all the observation are the same, the variance would be zero. The more spread out the observation are, the larger the variance. The variance tells you (exactly) the average deviation from the mean, in "squared units."

Standard Deviation: Standard deviation is the positive square root of the variance, and is the most common measure of variability. Standard deviation indicates how close to or how far the numbers tend to vary from the mean. The larger the standard deviation, the more variation there is in the data set.   (If the standard deviation is 7, then the numbers tend to be about 7 units from the mean. If the standard deviation is 1500, then the numbers tend to be about 1500 units from the mean.)

Virtually everyone in education is already familiar with the normal curve An easy rule applying to data that follow the normal curve is the "68, 95, 99.7 percent rule." That is . . .    Approximately 68% of the cases will fall within one standard deviation of the mean. Approximately 95% of the cases will fall within two standard deviations of the mean. Approximately 99.7% of the cases will fall within three standard deviations of the mean.

Higher values for both of these indicators stand for a larger amount of variability. Zero stands for no variability at all (e.g., for the data 3, 3, 3, 3, 3, 3, the variance and standard deviation will equal zero).

Frequency Distributions One useful way to view information in a variable is to construct a frequency distribution (i.e., an arrangement in which the frequencies, and sometimes percentages, of the occurrence of each unique data value are shown). When a variable has a wide range of values, you may prefer using a grouped frequency distribution (i.e., where the data values are grouped into intervals, 0-9, 10-19, 20- 29, etc., and the frequencies of the intervals are shown).

Graphic Representations of Data Another excellent way to clearly describe your data (especially for visually oriented learners) is to construct graphical representations of the data (i.e., pictorial representations of the data in two-dimensional space).   A bar graph uses vertical bars to represent the data. The height of the bars usually represent the frequencies for the categories shown on the X axis(i.e., the horizontal axis). (By the way, the Y axis is the vertical axis.)

A line graph uses one or more lines to depict information about one or more variables.   A simple line graph might be used to show a trend over time (e.g., with the years on the X axis and the population sizes on the Y axis). Line graphs are used for many different purposes in research. For example, (GPA is on the X axis and frequency is on the Y axis)

A scatterplot is used to depict the relationship between two quantitative variables. Typically, the independent or predictor variable is represented by the X axis (i.e., on the horizontal axis) and the dependent variable is represented by the Y axis (i.e., on the vertical axis).

The relationship is not always positive Correlation coefficient range between -1 and +1 Interpretation of Pearson r +1 highly positvely correlated -1 highly negatively correlated Close to zero, no correlation

Correlation does not necessarily indicate causation +.82 tells us that a person with an average score on the test will probably obtained an average score on other test

How to Interpret the Values of Correlations. The correlation coefficient (r) represents the linear relationship between two variables. If the correlation coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the correlation between variables, it is important to know this "magnitude" or "strength" as well as the significance of the correlation.

Outliers. Outliers are atypical (by definition), infrequent observations. Outliers have a profound influence on the slope of the regression line and consequently on the value of the correlation coefficient. A single outlier is capable of considerably changing the slope of the regression line and, consequently, the value of the correlation, as demonstrated in the following example.

Analyses for Comparison Nominal Data: Chi-Square Interval Data: t-Test Interval Data: One-Way ANOVA Interval Data: Factorial ANOVA Analyses for Association Interval Data: Pearson Product-Moment Correlation (r) Nominal Data: Phi Coefficient Ordinal Data: Spearman Rank-Order Correlation

Kruskal-Wallis analysis of ranks and the Median test. parametric Methods Non parametric Methods t-test for independent samples Mann-Whitney U test ANOVA/MANOVA (multiple groups) Kruskal-Wallis analysis of ranks and the Median test. t-test for dependent samples (two variables measured in the same samplE) Sign test and Wilcoxon's matched pairs test

t-test for independent samples Purpose, Assumptions. The t-test is the most commonly used method to evaluate the differences in means between two groups. For example, the t-test can be used to test for a difference in test scores between a group of patients who were given a drug and a control group who received a placebo. Theoretically, the t-test can be used even if the sample sizes are very small (e.g., as small as 10; some researchers claim that even smaller n's are possible), as long as the variables are normally distributed within each group and the variation of scores in the two groups is not reliably different

The normality assumption can be evaluated by looking at the distribution of the data (via histograms) or by performing a normality test. The equality of variances assumption can be verified with the F test, or you can use the more robust Levene's test. If these conditions are not met, then you can evaluate the differences in means between two groups using one of the nonparametric alternatives to the t- test (Nonparametrics).

Independent sample t test Mean N Std.Deviation Std. Error Mean Talk Low stress High stress 42.20 22.07 15 24.97 27.14 6.45 7.01 Sx = SD/√15 Standard deviation of the sample means IV DV   F Sig. T Df Sig. (2-tailed Mean diff Std. error diff Talk Equal variance assumed Equal variance not assumed .023 .881 2.43 2.430 28 27.808 .022  . In this case, variances are similar Levene’s test for equality of variance Tested at α = .05 You want a small F Here you want variance to equal The larger the F value the more dissimilar the varainces are

An independent t st was conducted to evaluate the hypothesis that students talk differently (amount of talkin) under different stress condition. The test was significant, t (28) = 2.43, p =.022. Students in high stress-condition talked less (M=22.07; SD = 27.14) than students in low-stressed condition (M=45.20; SD = 24.97)

t-test for dependent samples (paired sampel t-test Test two groups of observations (that are to be compared) are based on the same sample of subjects who were tested twice (e.g., before and after a treatment ) Mean N Std.Deviation Std. Error Mean PAY SECURITY 5.67 4.50 30 1.49 1.83 .27 .33 Sx = SD/√30 Standard deviation of the sample means

  Mean Std. Dev. Std. Err. Lower Upper t df Sig. (2-tailed) Pay- security 1.17 2.26 .41 .32 2.01 2.827 29  .008 A paired-sample t test was conducted to evaluate whether employees were more concerned with pay or job security. The results indicated that the mean concern for pay (M = 5.67, SD = 1.49) was significantly greater than the mean concern for security (M = 4.50, SD = 1.83), t (29) = 2.83, p = .008.

It was suggested (Marija J. Norusis) that When reporting your results, give the exact observed significance level. It will help the rader evaluate your findings Eg: p = .008, [8 chances in 1000] you would observe the difference between the two sample. Eg; p = .08 [8 chances in 100] but you have set that you will only acet if it is [5 chances in 100]

Pearson Chi-square. The Pearson Chi-square is the most common test for significance of the relationship between categorical variables. This measure is based on the fact that we can compute the expected frequencies in a two-way table (i.e., frequencies that we would expect if there was no relationship between the variables). For example, suppose we ask 20 males and 20 females to choose between two brands of jeans (brands A and B). If there is no relationship between preference and gender, then we would expect about an equal number of choices of brand A and brand B for each sex. The Chi-square test becomes increasingly significant as the numbers deviate further from this expected pattern; that is, the more this pattern of choices for males and females differs.

The Goodness of Fit test: used to find out if the population under study follow the distribution values Ho: the population distribution is uniform, that is, each brand of cola drinks is prefered by an equal percentage of the population Ha: the population distribution is not uniform, that is, each brand of cola drinks is not prefered by an equal percentage of the population

brand O E O-E (O-E)2 (O-E)2/E A 50 60 B 65 C 45 D 70 Total 300 X 2 (df=5)= 9.18, let say the significant value is 9.49, then Ho has to rejected and we cannot say that cola brands are preferred by an equal percentage of the population Df = (r-1). (c-1)

The data are obtained from a random sample Test of independence [ we can test the realtionship between nominal variables) The data are obtained from a random sample We use count data (frequencies) We want to test whether perception of life is independent of gender or men and women find life equaly exciting

Life excitement male female excited 300 384 684 Not excited 296 481 777 596 865 1461 Chi square 4.76, DF =1; p =.0290 What can you conclude?