Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Data Analysis

Similar presentations


Presentation on theme: "Introduction to Data Analysis"— Presentation transcript:

1 Introduction to Data Analysis
11 Introduction to Data Analysis

2 11.1 Foundations of Data Analysis
Data analysis has important implications for conclusion validity Data analysis has three main steps Data preparation Descriptive statistics Inferential statistics

3 11.2 Conclusion Validity The extent to which conclusions or inferences regarding relationships between the major variables in your research are warranted Related to internal validity, but is also independent of it Remember, internal validity is concerned with causal relationships Conclusion validity applies to all relationships, not just causal ones

4 11.2a Threats to Conclusion Validity
Type I Error You conclude there is a relationship when there is not A false positive Type II Error You conclude there is not a relationship when there is A false negative

5 11.2a Threats to Conclusion Validity – Type I Error
Finding a relationship when there is not one (or seeing things that aren’t there) Level of statistical significance (alpha) Often set to .05, meaning five times out of a 100 you will have a false positive Fishing and error rate problem A problem that occurs as a result of conducting multiple analyses and treating each one as independent

6 11.2a Threats to Conclusion Validity – Type II Error
Finding no relationship when there is one (or, missing the needle in the haystack) Small effect size: the ratio of signal to noise Sources of noise Low reliability of measures Poor reliability of treatment implementation Random irrelevancies in the setting Random heterogeneity of respondents Weak signal: due to low strength intervention Low reliability of measures: A threat to conclusion validity that occurs because measures that have low reliability by definition have more noise than measures with higher reliability, and the greater noise makes it difficult to detect a relationship ( i.e., to see the signal of the relationship relative to the noise). Poor reliability of treatment implementation: A threat to conclusion validity that occurs in causal studies because treatments or programs with more inconsistent or unreliable implementation introduce more noise than ones that are consistently carried out, and the greater noise makes it difficult to detect a relationship (i.e., to see the signal of the relationship relative to the noise). Random irrelevancies in the setting: A threat to conclusion validity that occurs when factors in the research setting that are irrelevant to the relationship being studied add noise to the environment, which makes it harder to detect a relationship between key variables, even if one is present. Random heterogeneity of respondents: A threat to statistical conclusion validity. If you have a very diverse group of respondents, they are likely to vary more widely on your measures or observations. Some of their variety may be related to the phenomenon you are looking at, but at least part of it is likely to just constitute individual differences that are irrelevant to the relationship being observed. A low-strength intervention could be a potential cause for a weak signal. For example, suppose you want to test the effectiveness of an after-school tutoring program on student test scores. One way to attain a strong signal, and thereby a larger effect size, is to design the intervention research with the strongest possible “dose” of the treatment.

7 11.2a Threats to Conclusion Validity – Other Issues
Problems that can lead to either conclusion error Violated assumptions of statistical tests Violated assumptions in qualitative research Violated assumptions of statistical tests: The threat to conclusion validity that arises when key assumptions required by a specific statistical analysis are not met in the data. Violated assumptions in qualitative research: There are assumptions, some of which you may not even realize, behind all qualitative methods. For instance, in interview situations you might assume that the respondents are free to say anything they wish. If that is not true—if the respondent is under covert pressure from supervisors to respond in a certain way—you may erroneously see relationships in the responses that aren’t real and/or miss ones that are.

8 11.2b Improving Conclusion Validity
Increase statistical power Increase the sample size Increase the level of significance Increase the effect size Good implementation of the program (e.g. by using trained researchers) Statistical power: the probability that you will conclude there is a relationship when in fact there is one.

9 11.3 Data Preparation Logging the data in
Checking the data for accuracy Entering data into a computer Transform the data Develop a database

10 11.3a Logging the Data Use a computer program to log the data as it comes in MS Excel, Access SPSS, SAS, Minitab, Datadesk Retain and archive original data records Most researchers keep original data for 5-7 years IRB requires data be stored securely and anonymously

11 11.3b Checking the Data for Accuracy
Are the responses legible/readable? Are all important questions answered? Are the responses complete? Is all relevant contextual information included (for example, data, time, place, and researcher)?

12 11.3c Developing a Database Structure
The database structure is the system you use to store the data for the study so that it can be accessed in subsequent data analyses Develop a codebook Variable name, description, format (number, data, text), and location Instrument/method of collection, data collected, respondent or group, and notes Codebook: A written description of the data that describes each variable and indicates where and how it can be accessed.

13 11.3d Entering the Data into the Computer
Enter the data into a spreadsheet, document, or statistical program Have a unique ID number for each record Write this ID number on the paper the data was collected on Each case (record) will be on its own row Each column represents a variable Double-entry Double entry: An automated method for checking data-entry accuracy in which you enter data once and then enter them a second time, with the software automatically stopping each time a discrepancy is detected until the data enterer resolves the discrepancy. This procedure assures extremely high rates of data-entry accuracy, although it requires twice as long for data entry.

14 11.3e Data Transformations
Missing values Item reversals Scale and subscale totals Categories Variable transformations Missing values: Many analysis programs automatically treat blank values as missing. Item reversals: On scales and surveys, the use of reversal items (see Chapter 6, “Scales, Tests and Indexes”) can help reduce the possibility of a response set. When you analyze the data, you want all scores for questions or scale items to be in the same direction, where high scores mean the same thing and low scores mean the same thing. In such cases, you may have to reverse the ratings for some of the scale items to get them in the same direction as the others. Scale and subscale totals: After you transform any individual scale items, you will often want to add or average across individual items to get scores for any subscales and a total score for the scale. Categories: You may want to collapse one or more variables into categories. For instance, you may want to collapse income estimates (in dollar amounts) into income ranges. Variable transformations: In order to meet assumptions of certain statistical methods, we often need to transform particular variables. Depending on the data you have, you might transform them by expressing them in logarithm or square-root form. For example, if data on a particular variable are skewed in the positive direction, then taking its square root can make it look closer to a normal distribution—a key assumption for many statistical analyses.

15 11.4 Descriptive Statistics
Statistics used to describe the basic features of the data in a study Three main measures The distribution The central tendency The dispersion (or variability)

16 11.4a The Distribution Figure 11.2 A frequency distribution in table form. Figure 11.3 A frequency distribution bar chart. Distribution: The manner in which a variable takes different values in your data. Frequency distribution: A summary of the frequency of individual values or ranges of values for a variable.

17 11.4b Central Tendency Mean Median Mode
The average Median The centermost score (50th percentile) Mode The most frequently occurring score If the distribution is normal, the mean, median, and mode are all equal

18 11.4c Dispersion or Variability
Dispersion: spread of values around the central tendency Range: highest score minus the lowest score Standard deviation: variability of the scores around their average in a single sample 68% of scores will fall within 1 SD of the mean 95% of scores will fall within 2 SD of the mean 98% of scores will fall within 3 SD of the mean Variance: spread of scores around the mean, in squared units

19 11.4d Correlation A single number that describes the degree of relationship between two variables Positive correlations Always fall between 0 and 1 As one value increases, so does the other Negative correlations Always fall between -1 and 0 As one value increases, the other decreases

20 11.4d Correlation You can test to see if a correlation happened by chance—this is the correlation’s significance You can also construct a correlation matrix Table 11.5 Hypothetical correlation matrix for ten variables

21 11.4d Types of Correlations
Pearson Product Moment Correlation: used when both variables are interval level Spearman Rank Order Correlation (rho): used when both variables are ordinal level Kendall Rank Order Correlation (tau): used when both variables are ordinal level Point-Biserial Correlation: used when one variable is continuous, and one is dichotomous Table 11.4 Hypothetical correlation matrix for ten variables

22 Discuss and Debate Why should you be cautious when interpreting a correlation? Discuss the relationship between a frequency distribution, measures of central tendency, and variability Correlations are not causal in nature. They are simply relationships between two variables. We should be careful when interpreting correlations because there is often a third variable that is causing the correlational relationship. Researchers often begin with a frequency distribution to get an overall picture of the data. The measures of central tendency then focus on what is happening “in the middle” of this distribution, giving a numeric summary in the form of the mean (average), median (middle score), and mode (most frequently occurring score). The variability estimates of standard deviation and variance tell researchers how much spread there is in the overall distribution—how far away, on average, any given score is from the mean, or center of the distribution.


Download ppt "Introduction to Data Analysis"

Similar presentations


Ads by Google