Assumptions of multiple regression

Slides:

Advertisements

Similar presentations

Descriptive Statistics-II

Advertisements

SW388R6 Data Analysis and Computers I Slide 1 Testing Assumptions of Linear Regression Detecting Outliers Transforming Variables Logic for testing assumptions.

One-sample T-Test of a Population Mean

5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.

Strategy for Complete Regression Analysis

Assumption of normality

Detecting univariate outliers Detecting multivariate outliers

Chi-square Test of Independence

Multiple Regression – Assumptions and Outliers

Multiple Regression – Basic Relationships

SW388R7 Data Analysis & Computers II Slide 1 Computing Transformations Transforming variables Transformations for normality Transformations for linearity.

Multinomial Logistic Regression Basic Relationships

Regression Analysis We have previously studied the Pearson’s r correlation coefficient and the r2 coefficient of determination as measures of association.

Assumption of Homoscedasticity

Introduction to SPSS Short Courses Last created (Feb, 2008) Kentaka Aruga.

SW388R6 Data Analysis and Computers I Slide 1 One-sample T-test of a Population Mean Confidence Intervals for a Population Mean.

Testing Assumptions of Linear Regression

8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the.

Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.

8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.

SW388R7 Data Analysis & Computers II Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems.

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.

Correlation Question 1 This question asks you to use the Pearson correlation coefficient to measure the association between [educ4] and [empstat]. However,

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.

Assumption of linearity

SW388R7 Data Analysis & Computers II Slide 1 Analyzing Missing Data Introduction Problems Using Scripts.

SW388R6 Data Analysis and Computers I Slide 1 Chi-square Test of Goodness-of-Fit Key Points for the Statistical Test Sample Homework Problem Solving the.

8/15/2015Slide 1 The only legitimate mathematical operation that we can use with a variable that we treat as categorical is to count the number of cases.

Sampling Distribution of the Mean Problem - 1

SW318 Social Work Statistics Slide 1 Estimation Practice Problem – 1 This question asks about the best estimate of the mean for the population. Recall.

Simple Linear Regression

Example of Simple and Multiple Regression

Slide 1 SOLVING THE HOMEWORK PROBLEMS Simple linear regression is an appropriate model of the relationship between two quantitative variables provided.

8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.

SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems.

How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.

SW388R7 Data Analysis & Computers II Slide 1 Assumption of Homoscedasticity Homoscedasticity (aka homogeneity or uniformity of variance) Transformations.

SW388R7 Data Analysis & Computers II Slide 1 Analyzing Missing Data Introduction Practice Problems Homework Problems Using Scripts.

Hierarchical Binary Logistic Regression

SW388R6 Data Analysis and Computers I Slide 1 Central Tendency and Variability Sample Homework Problem Solving the Problem with SPSS Logic for Central.

Stepwise Multiple Regression

Slide 1 SOLVING THE HOMEWORK PROBLEMS Pearson's r correlation coefficient measures the strength of the linear relationship between the distributions of.

Slide 1 Hierarchical Multiple Regression. Slide 2 Differences between standard and hierarchical multiple regression  Standard multiple regression is.

SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems Homework Problems.

6/2/2016Slide 1 To extend the comparison of population means beyond the two groups tested by the independent samples t-test, we use a one-way analysis.

SW388R6 Data Analysis and Computers I Slide 1 Independent Samples T-Test of Population Means Key Points about Statistical Test Sample Homework Problem.

SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.

6/4/2016Slide 1 The one sample t-test compares two values for the population mean of a single variable. The two-sample t-test of population means (aka.

SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.

11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the.

Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Within Subjects Analysis of Variance PowerPoint.

Chi-square Test of Independence

SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.

SW318 Social Work Statistics Slide 1 One-way Analysis of Variance  1. Satisfy level of measurement requirements  Dependent variable is interval (ordinal)

September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.

SW388R6 Data Analysis and Computers I Slide 1 One-way Analysis of Variance and Post Hoc Tests Key Points about Statistical Test Sample Homework Problem.

SW318 Social Work Statistics Slide 1 Percentile Practice Problem (1) This question asks you to use percentile for the variable [marital]. Recall that the.

SW388R6 Data Analysis and Computers I Slide 1 Percentiles and Standard Scores Sample Percentile Homework Problem Solving the Percentile Problem with SPSS.

1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.

12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.

1/5/2016Slide 1 We will use a one-sample test of proportions to test whether or not our sample proportion supports the population proportion from which.

1/11/2016Slide 1 Extending the relationships found in linear regression to a population is procedurally similar to what we have done for t-tests and chi-square.

Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

2/24/2016Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.

(Slides not created solely by me – the internet is a wonderful tool) SW388R7 Data Analysis & Compute rs II Slide 1.

SW388R7 Data Analysis & Computers II Slide 1 Assumption of linearity Strategy for solving problems Producing outputs for evaluating linearity Assumption.

Assumption of normality

Computing Transformations

Multiple Regression – Split Sample Validation

Presentation transcript:

Assumptions of multiple regression Assumption of normality Assumption of linearity Assumption of homoscedasticity Script for testing assumptions Practice problems

Assumptions of Normality, Linearity, and Homoscedasticity Multiple regression assumes that the variables in the analysis satisfy the assumptions of normality, linearity, and homoscedasticity. (There is also an assumption of independence of errors but that cannot be evaluated until the regression is run.) There are two general strategies for checking conformity to assumptions: pre-analysis and post-analysis. In pre-analysis, the variables are checked prior to running the regression. In post-analysis, the assumptions are evaluated by looking at the pattern of residuals (errors or variability) that the regression was unable to predict accurately. The text recommends pre-analysis, the strategy we will follow.

Assumption of Normality The assumption of normality prescribes that the distribution of cases fit the pattern of a normal curve. It is evaluated for all metric variables included in the analysis, independent variables as well as the dependent variable. With multivariate statistics, the assumption is that the combination of variables follows a multivariate normal distribution. Since there is not a direct test for multivariate normality, we generally test each variable individually and assume that they are multivariate normal if they are individually normal, though this is not necessarily the case.

Assumption of Normality: Evaluating Normality There are both graphical and statistical methods for evaluating normality. Graphical methods include the histogram and normality plot. Statistical methods include diagnostic hypothesis tests for normality, and a rule of thumb that says a variable is reasonably close to normal if its skewness and kurtosis have values between –1.0 and +1.0. None of the methods is absolutely definitive. We will use the criteria that the skewness and kurtosis of the distribution both fall between -1.0 and +1.0.

Assumption of Normality: Histograms and Normality Plots On the left side of the slide is the histogram and normality plot for a occupational prestige that could reasonably be characterized as normal. Time using email, on the right, is not normally distributed.

Assumption of Normality: Hypothesis test of normality The hypothesis test for normality tests the null hypothesis that the variable is normal, i.e. the actual distribution of the variable fits the pattern we would expect if it is normal. If we fail to reject the null hypothesis, we conclude that the distribution is normal. The distribution for both of the variable depicted on the previous slide are associated with low significance values that lead to rejecting the null hypothesis and concluding that neither occupational prestige nor time using email is normally distributed.

Assumption of Normality: Skewness, kurtosis, and normality Using the rule of thumb that a rule of thumb that says a variable is reasonably close to normal if its skewness and kurtosis have values between –1.0 and +1.0, we would decide that occupational prestige is normally distributed and time using email is not. We will use this rule of thumb for normality in our strategy for solving problems.

Assumption of Normality: Transformations When a variable is not normally distributed, we can create a transformed variable and test it for normality. If the transformed variable is normally distributed, we can substitute it in our analysis. Three common transformations are: the logarithmic transformation, the square root transformation, and the inverse transformation. All of these change the measuring scale on the horizontal axis of a histogram to produce a transformed variable that is mathematically equivalent to the original variable.

Assumption of Normality: Computing Transformations We will use SPSS scripts as described below to test assumptions and compute transformations. For additional details on the mechanics of computing transformations, see “Computing Transformations”

Assumption of Normality: When transformations do not work When none of the transformations induces normality in a variable, including that variable in the analysis will reduce our effectiveness at identifying statistical relationships, i.e. we lose power. We do have the option of changing the way the information in the variable is represented, e.g. substitute several dichotomous variables for a single metric variable.

Assumption of Normality: Computing “Explore” descriptive statistics To compute the statistics needed for evaluating the normality of a variable, select the Explore… command from the Descriptive Statistics menu.

Assumption of Normality: Adding the variable to be evaluated Second, click on right arrow button to move the highlighted variable to the Dependent List. First, click on the variable to be included in the analysis to highlight it.

Assumption of Normality: Selecting statistics to be computed To select the statistics for the output, click on the Statistics… command button.

Assumption of Normality: Including descriptive statistics First, click on the Descriptives checkbox to select it. Clear the other checkboxes. Second, click on the Continue button to complete the request for statistics.

Assumption of Normality: Selecting charts for the output To select the diagnostic charts for the output, click on the Plots… command button.

Assumption of Normality: Including diagnostic plots and statistics First, click on the None option button on the Boxplots panel since boxplots are not as helpful as other charts in assessing normality. Finally, click on the Continue button to complete the request. Second, click on the Normality plots with tests checkbox to include normality plots and the hypothesis tests for normality. Third, click on the Histogram checkbox to include a histogram in the output. You may want to examine the stem-and-leaf plot as well, though I find it less useful.

Assumption of Normality: Completing the specifications for the analysis Click on the OK button to complete the specifications for the analysis and request SPSS to produce the output.

Assumption of Normality: The histogram An initial impression of the normality of the distribution can be gained by examining the histogram. In this example, the histogram shows a substantial violation of normality caused by a extremely large value in the distribution.

Assumption of Normality: The normality plot The problem with the normality of this variable’s distribution is reinforced by the normality plot. If the variable were normally distributed, the red dots would fit the green line very closely. In this case, the red points in the upper right of the chart indicate the severe skewing caused by the extremely large data values.

Assumption of Normality: The test of normality Since the sample size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample size were 50 or less, we would use the Shapiro-Wilk statistic instead. The null hypothesis for the test of normality states that the actual distribution of the variable is equal to the expected distribution, i.e., the variable is normally distributed. Since the probability associated with the test of normality is < 0.001 is less than or equal to the level of significance (0.01), we reject the null hypothesis and conclude that total hours spent on the Internet is not normally distributed. (Note: we report the probability as <0.001 instead of .000 to be clear that the probability is not really zero.)

Assumption of Normality: The rule of thumb for skewness and kurtosis Using the rule of thumb for evaluating normality with the skewness and kurtosis statistics, we look at the table of descriptive statistics. The skewness and kurtosis for the variable both exceed the rule of thumb criteria of 1.0. The variable is not normally distributed.

Assumption of Linearity Linearity means that the amount of change, or rate of change, between scores on two variables is constant for the entire range of scores for the variables. Linearity characterizes the relationship between two metric variables. It is tested for the pairs formed by dependent variable and each metric independent variable in the analysis. There are relationships that are not linear. The relationship between learning and time may not be linear. Learning a new subject shows rapid gains at first, then the pace slows down over time. This is often referred to a a learning curve. Population growth may not be linear. The pattern often shows growth at increasing rates over time.

Assumption of Linearity: Evaluating linearity There are both graphical and statistical methods for evaluating linearity. Graphical methods include the examination of scatterplots, often overlaid with a trendline. While commonly recommended, this strategy is difficult to implement. Statistical methods include diagnostic hypothesis tests for linearity, a rule of thumb that says a relationship is linear if the difference between the linear correlation coefficient (r) and the nonlinear correlation coefficient (eta) is small, and examining patterns of correlation coefficients.

Assumption of Linearity: Interpreting scatterplots The advice for interpreting linearity is often phrased as looking for a cigar-shaped band, which is very evident in this plot.

Assumption of Linearity: Interpreting scatterplots Sometimes, a scatterplot shows a clearly nonlinear pattern that requires transformation, like the one shown in the scatterplot.

Assumption of Linearity: Scatterplots that are difficult to interpret The correlations for both of these relationships are low. The linearity of the relationship on the right can be improved with a transformation; the plot on the left cannot. However, this is not necessarily obvious from the scatterplots.

Assumption of Linearity: Using correlation matrices Creating a correlation matrix for the dependent variable and the original and transformed variations of the independent variable provides us with a pattern that is easier to interpret. The information that we need is in the first column of the matrix which shows the correlation and significance for the dependent variable and all forms of the independent variable.

Assumption of Linearity: The pattern of correlations for no relationship The correlation between the two variables is very weak and statistically non-significant. If we viewed this as a hypothesis test for the significance of r, we would conclude that there is no relationship between these variables. Moreover, none of significance tests for the correlations with the transformed dependent variable are statistically significant. There is no relationship between these variables; it is not a problem with non-linearity.

Assumption of Linearity: Correlation pattern suggesting transformation The correlation between the two variables is very weak and statistically non-significant. If we viewed this as a hypothesis test for the significance of r, we would conclude that there is no relationship between these variables. However, the probability associated with the larger correlation for the logarithmic transformation is statistically significant, suggesting that this is a transformation we might want to use in our analysis.

Assumption of Linearity: Correlation pattern suggesting substitution Should it happen that the correlation between a transformed independent variable and the dependent variable is substantially stronger than the relationship between the untransformed independent variable and the dependent variable, the transformation should be considered even if the relationship involving the untransformed independent variable is statistically significant. A difference of +0.20 or -0.20, or more, would be considered substantial enough since a change of this size would alter our interpretation of the relationship.

Assumption of Linearity: Transformations When a relationship is not linear, we can transform one or both variables to achieve a relationship that is linear. Three common transformations to induce linearity are: the logarithmic transformation, the square root transformation, and the inverse transformation. All of these transformations produce a new variable that is mathematically equivalent to the original variable, but expressed in different measurement units, e.g. logarithmic units instead of decimal units.

Assumption of Linearity: When transformations do not work When none of the transformations induces linearity in a relationship, our statistical analysis will underestimate the presence and strength of the relationship, i.e. we lose power. We do have the option of changing the way the information in the variables are represented, e.g. substitute several dichotomous variables for a single metric variable. This bypasses the assumption of linearity while still attempting to incorporate the information about the relationship in the analysis.

Assumption of Linearity: Creating the scatterplot Suppose we are interested in the linearity of the relationship between "hours per day watching TV" and "total hours spent on the Internet". The most commonly recommended strategy for evaluating linearity is visual examination of a scatter plot. To obtain a scatter plot in SPSS, select the Scatter… command from the Graphs menu.

Assumption of Linearity: Selecting the type of scatterplot First, click on thumbnail sketch of a simple scatterplot to highlight it. Second, click on the Define button to specify the variables to be included in the scatterplot.

Assumption of Linearity: Selecting the variables First, move the dependent variable netime to the Y Axis text box. Third, click on the OK button to complete the specifications for the scatterplot. Second, move the independent variable tvhours to the X axis text box. If a problem statement mentions a relationship between two variables without clearly indicating which is the independent variable and which is the dependent variable, the first mentioned variable is taken to the be independent variable.

Assumption of Linearity: The scatterplot The scatterplot is produced in the SPSS output viewer. The points in a scatterplot are considered linear if they form a cigar-shaped elliptical band. The pattern in this scatterplot is not really clear.

Assumption of Linearity: Adding a trendline To try to determine if the relationship is linear, we can add a trendline to the chart. To add a trendline to the chart, we need to open the chart for editing. To open the chart for editing, double click on it.

Assumption of Linearity: The scatterplot in the SPSS Chart Editor The chart that we double clicked on is opened for editing in the SPSS Chart Editor. To add the trend line, select the Options… command from the Chart menu.

Assumption of Linearity: Requesting the fit line In the Scatterplot Options dialog box, we click on the Total checkbox in the Fit Line panel in order to request the trend line. Click on the Fit Options… button to request the r² coefficient of determination as a measure of the strength of the relationship.

Assumption of Linearity: Requesting r² First, the Linear regression thumbnail sketch should be highlighted as the type of fit line to be added to the chart. Third, click on the Continue button to complete the options request. Second, click on the Fit Options… Click on the Display R-square in Legend checkbox to add this item to our output.

Assumption of Linearity: Completing the request for the fit line Click on the OK button to complete the request for the fit line.

Assumption of Linearity: The fit line and r² The red fit line is added to the chart. The value of r² (0.0460) suggests that the relationship is weak.

Assumption of Linearity: Computing the transformations There are four transformations that we can use to achieve or improve linearity. The compute dialogs for these four transformations for linearity are shown.

Assumption of Linearity: Creating the scatterplot matrix To create the scatterplot matrix, select the Scatter… command in the Graphs menu.

Assumption of Linearity: Selecting type of scatterplot First, click on the Matrix thumbnail sketch to indicate which type of scatterplot we want. Second, click on the Define button to select the variables for the scatterplot.

Assumption of Linearity: Specifications for scatterplot matrix First, move the dependent variable, the independent variable and all of the transformations to the Matrix Variables list box. Second, click on the OK button to produce the scatterplot.

Assumption of Linearity: The scatterplot matrix The scatterplot matrix shows a thumbnail sketch of scatterplots for each independent variable or transformation with the dependent variable. The scatterplot matrix may suggest which transformations might be useful.

Assumption of Linearity: Creating the correlation matrix To create the correlation matrix, select the Correlate | Bivariate… command in the Analyze menu.

Assumption of Linearity: Specifications for correlation matrix First, move the dependent variable, the independent variable and all of the transformations to the Variables list box. Second, click on the OK button to produce the correlation matrix.

Assumption of Linearity: The correlation matrix The answers to the problems are based on the correlation matrix. Before we answer the question in this problem, we will use a script to produce the output.

Assumption of Homoscedasticity Homoscedasticity refers to the assumption that the dependent variable exhibits similar amounts of variance across the range of values for an independent variable. While it applies to independent variables at all three measurement levels, the methods that we will use to evaluation homoscedasticity requires that the independent variable be non-metric (nominal or ordinal) and the dependent variable be metric (ordinal or interval). When both variables are metric, the assumption is evaluated as part of the residual analysis in multiple regression.

Assumption of Homoscedasticity : Evaluating homoscedasticity Homoscedasticity is evaluated for pairs of variables. There are both graphical and statistical methods for evaluating homoscedasticity . The graphical method is called a boxplot. The statistical method is the Levene statistic which SPSS computes for the test of homogeneity of variances. Neither of the methods is absolutely definitive.

Assumption of Homoscedasticity : The boxplot Each red box shows the middle 50% of the cases for the group, indicating how spread out the group of scores is. If the variance across the groups is equal, the height of the red boxes will be similar across the groups. If the heights of the red boxes are different, the plot suggests that the variance across groups is not homogeneous. The married group is more spread out than the other groups, suggesting unequal variance.

Assumption of Homoscedasticity : Levene test of the homogeneity of variance The null hypothesis for the test of homogeneity of variance states that the variance of the dependent variable is equal across groups defined by the independent variable, i.e., the variance is homogeneous. Since the probability associated with the Levene Statistic (<0.001) is less than or equal to the level of significance, we reject the null hypothesis and conclude that the variance is not homogeneous.

Assumption of Homoscedasticity : Transformations When the assumption of homoscedasticity is not supported, we can transform the dependent variable variable and test it for homoscedasticity . If the transformed variable demonstrates homoscedasticity, we can substitute it in our analysis. We use the sample three common transformations that we used for normality: the logarithmic transformation, the square root transformation, and the inverse transformation. All of these change the measuring scale on the horizontal axis of a histogram to produce a transformed variable that is mathematically equivalent to the original variable.

Assumption of Homoscedasticity : When transformations do not work When none of the transformations results in homoscedasticity for the variables in the relationship, including that variable in the analysis will reduce our effectiveness at identifying statistical relationships, i.e. we lose power.

Assumption of Homoscedasticity : Request a boxplot Suppose we want to test for homogeneity of variance: whether the variance in "highest academic degree" is homogeneous for the categories of "marital status." The boxplot provides a visual image of the distribution of the dependent variable for the groups defined by the independent variable. To request a boxplot, choose the BoxPlot… command from the Graphs menu.

Assumption of Homoscedasticity : Specify the type of boxplot First, click on the Simple style of boxplot to highlight it with a rectangle around the thumbnail drawing. Second, click on the Define button to specify the variables to be plotted.

Assumption of Homoscedasticity : Specify the dependent variable First, click on the dependent variable to highlight it. Second, click on the right arrow button to move the dependent variable to the Variable text box.

Assumption of Homoscedasticity : Specify the independent variable Second, click on the right arrow button to move the independent variable to the Category Axis text box. First, click on the independent variable to highlight it.

Assumption of Homoscedasticity : Complete the request for the boxplot To complete the request for the boxplot, click on the OK button.

Assumption of Homoscedasticity : The boxplot Each red box shows the middle 50% of the cases for the group, indicating how spread out the group of scores is. If the variance across the groups is equal, the height of the red boxes will be similar across the groups. If the heights of the red boxes are different, the plot suggests that the variance across groups is not homogeneous. The married group is more spread out than the other groups, suggesting unequal variance.

Assumption of Homoscedasticity : Request the test for homogeneity of variance To compute the Levene test for homogeneity of variance, select the Compare Means | One-Way ANOVA… command from the Analyze menu.

Assumption of Homoscedasticity : Specify the independent variable First, click on the independent variable to highlight it. Second, click on the right arrow button to move the independent variable to the Factor text box.

Assumption of Homoscedasticity : Specify the dependent variable Second, click on the right arrow button to move the dependent variable to the Dependent List text box. First, click on the dependent variable to highlight it.

Assumption of Homoscedasticity : The homogeneity of variance test is an option Click on the Options… button to open the options dialog box.

Assumption of Homoscedasticity : Specify the homogeneity of variance test First, mark the checkbox for the Homogeneity of variance test. All of the other checkboxes can be cleared. Second, click on the Continue button to close the options dialog box.

Assumption of Homoscedasticity : Complete the request for output Click on the OK button to complete the request for the homogeneity of variance test through the one-way anova procedure.

Assumption of Homoscedasticity : Interpreting the homogeneity of variance test The null hypothesis for the test of homogeneity of variance states that the variance of the dependent variable is equal across groups defined by the independent variable, i.e., the variance is homogeneous. Since the probability associated with the Levene Statistic (<0.001) is less than or equal to the level of significance, we reject the null hypothesis and conclude that the variance is not homogeneous.

Using scripts The process of evaluating assumptions requires numerous SPSS procedures and outputs that are time consuming to produce. These procedures can be automated by creating an SPSS script. A script is a program that executes a sequence of SPSS commands. Though writing scripts is not part of this course, we can take advantage of scripts that I use to reduce the burdensome tasks of evaluating assumptions .

Using a script for evaluating assumptions The script “EvaluatingAssumptionsAndMissingData.exe” will produce all of the output we have used for evaluating assumptions. Navigate to the link “SPSS Scripts and Syntax” on the course web page. Download the script file “EvaluatingAssumptionsAnd MissingData.exe” to your computer and install it, following the directions on the web page.

Open the data set in SPSS Before using a script, a data set should be open in the SPSS data editor.

Invoke the script in SPSS To invoke the script, select the Run Script… command in the Utilities menu.

Select the script First, navigate to the folder where you put the script. If you followed the directions, you will have a file with an ".SBS" extension in the C:\SW388R7 folder. If you only see a file with an “.EXE” extension in the folder, you should double click on that file to extract the script file to the C:\SW388R7 folder. Second, click on the script name to highlight it. Third, click on Run button to start the script.

The script dialog The script dialog box acts similarly to SPSS dialog boxes. You select the variables to include in the analysis and choose options for the output.

Complete the specifications - 1 Move the the dependent and independent variables from the list of variables to the list boxes. Metric and nonmetric variables are moved to separate lists so the computer knows how you want them treated. You must also indicate the level of measurement for the dependent variable. By default the metric option button is marked.

Complete the specifications - 2 Mark the option button for the type of output you want the script to compute. Click on the OK button to produce the output. Select the transformations to be tested.

The script finishes If your SPSS output viewer is open, you will see the output produced in that window. Since it may take a while to produce the output, and since there are times when it appears that nothing is happening, there is an alert to tell you when the script is finished. Unless you are absolutely sure something has gone wrong, let the script run until you see this alert. When you see this alert, click on the OK button.

Output from the script - 1 The script will produce lots of output. Additional descriptive material in the titles should help link specific outputs to specific tasks. Scroll through the script to locate the outputs needed to answer the question.

Closing the script dialog box The script dialog box does not close automatically because we often want to run another test right away. There are two methods for closing the dialog box. Click on the X close box to close the script. Click on the Cancel button to close the script.

Problem 1 In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Use a level of significance of 0.01 for evaluating missing data and assumptions. In pre-screening the data for use in a multiple regression of the dependent variable "total hours spent on the Internet" [netime] with the independent variables "age" [age], "sex" [sex], and "income" [rincom98], the evaluation of the assumptions of normality, linearity, and homogeneity of variance did not indicate any need for a caution to be added to the interpretation of the analysis. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Level of measurement Since we are pre-screening for a multiple regression problem, we should make sure we satisfy the level of measurement before proceeding. "Total hours spent on the Internet" [netime] is interval, satisfying the metric level of measurement requirement for the dependent variable. 9. In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Use a level of significance of 0.01 for evaluating missing data and assumptions. In pre-screening the data for use in a multiple regression of the dependent variable "total hours spent on the Internet" [netime] with the independent variables "age" [age], "sex" [sex], and "income" [rincom98], the evaluation of the assumptions of normality, linearity, and homogeneity of variance did not indicate any need for a caution to be added to the interpretation of the analysis. "Age" [age] and "highest year of school completed" [educ] are interval, satisfying the metric or dichotomous level of measurement requirement for independent variables. "Sex" [sex] is dichotomous, satisfying the metric or dichotomous level of measurement requirement for independent variables. "Income" [rincom98] is ordinal, satisfying the metric or dichotomous level of measurement requirement for independent variables. Since some data analysts do not agree with this convention of treating an ordinal variable as metric, a note of caution should be included in our interpretation.

Run the script to test normality - 1 To run the script to test assumptions, choose the Run Script… command from the Utilities menu.

Run the script to test normality - 2 First, navigate to the SW388R7 folder on your computer. Second, click on the script name to select it: EvaluatingAssumptionsAndMissingData.SBS Third, click on the Run button to open the script.

Run the script to test normality - 3 First, move the variables to the list boxes based on the role that the variable plays in the analysis and its level of measurement. Second, click on the Normality option button to request that SPSS produce the output needed to evaluate the assumption of normality. Fourth, click on the OK button to produce the output. Third, mark the checkboxes for the transformations that we want to test in evaluating the assumption.

Normality of the dependent variable The dependent variable "total hours spent on the Internet" [netime] did not satisfy the criteria for a normal distribution. Both the skewness (3.532) and kurtosis (15.614) fell outside the range from -1.0 to +1.0.

Normality of transformed dependent variable Since "total hours spent on the Internet" [netime] did not satisfy the criteria for normality, we examine the skewness and kurtosis of each of the transformations to see if any of them satisfy the criteria. The "log of total hours spent on the Internet [LGNETIME=LG10(NETIME)]" satisfied the criteria for a normal distribution. The skewness of the distribution (-0.150) was between -1.0 and +1.0 and the kurtosis of the distribution (0.127) was between -1.0 and +1.0. The "log of total hours spent on the Internet [LGNETIME=LG10(NETIME)]" was substituted for "total hours spent on the Internet" [netime] in the analysis.

Normality of the independent variables - 1 The independent variable "age" [age] satisfied the criteria for a normal distribution. The skewness of the distribution (0.595) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.351) was between -1.0 and +1.0.

Normality of the independent variables - 2 The independent variable "income" [rincom98] satisfied the criteria for a normal distribution. The skewness of the distribution (-0.686) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.253) was between -1.0 and +1.0.

Run the script to test linearity - 1 If the script was not closed after it was used for normality, we can take advantage of the specifications already entered. If the script was closed, re-open it as you would for normality. First, click on the Linearity option button to request that SPSS produce the output needed to evaluate the assumption of linearity. When the linearity option is selected, a default set of transformations to test is marked.

Run the script to test linearity - 2 Since we have already decided to use the log of the dependent variable to satisfy normality, that is the form of the dependent variable we want to evaluate with the independent variables. Mark this checkbox for the dependent variable and clear the others. Click on the OK button to produce the output.

Linearity test with age of respondent The assessment of the linear relationship between "log of total hours spent on the Internet [LGNETIME=LG10(NETIME)]" and "age" [age] indicated that the relationship was weak, rather than nonlinear. The statistical probabilities associated with the correlation coefficients measuring the relationship with the untransformed independent variable (r=0.074, p=0.483), the logarithmic transformation (r=0.119, p=0.257), the square root transformation (r=0.096, p=0.362), and the inverse transformation (r=0.164, p=0.116), were all greater than the level of significance for testing assumptions (0.01). There was no evidence that the assumption of linearity was violated.

Linearity test with respondent’s income The assessment of the linear relationship between "log of total hours spent on the Internet [LGNETIME=LG10(NETIME)]" and "income" [rincom98] indicated that the relationship was weak, rather than nonlinear. The statistical probabilities associated with the correlation coefficients measuring the relationship with the untransformed independent variable (r=-0.053, p=0.658), the logarithmic transformation (r=0.063, p=0.600), the square root transformation (r=0.060, p=0.617), and the inverse transformation (r=0.073, p=0.540), were all greater than the level of significance for testing assumptions (0.01). There was no evidence that the assumption of linearity was violated.

Run the script to test homogeneity of variance - 1 If the script was not closed after it was used for normality, we can take advantage of the specifications already entered. If the script was closed, re-open it as you would for normality. First, click on the Homogeneity of variance option button to request that SPSS produce the output needed to evaluate the assumption of homogeneity. When the homogeneity of variance option is selected, a default set of transformations to test is marked.

Run the script to test homogeneity of variance - 2 In this problem, we have already decided to use the log transformation for the dependent variable, so we only need test it. Next, clear all of the transformation checkboxes except for Logarithmic. Finally, click on the OK button to produce the output.

Levene test of homogeneity of variance Based on the Levene Test, the variance in "log of total hours spent on the Internet [LGNETIME=LG10(NETIME)]" was homogeneous for the categories of "sex" [sex]. The probability associated with the Levene statistic (0.166) was p=0.685, greater than the level of significance for testing assumptions (0.01). The null hypthesis that the group variances were equal was not rejected. The homogeneity of variance assumption was satisfied.

Answer 1 In pre-screening the data for use in a multiple regression of the dependent variable "total hours spent on the Internet" [netime] with the independent variables "age" [age], "sex" [sex], and "income" [rincom98], the evaluation of the assumptions of normality, linearity, and homogeneity of variance did not indicate any need for a caution to be added to the interpretation of the analysis. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic The logarithmic transformation of the dependent variable [LGNETIME=LG10(NETIME)] solved the only problem with normality that we encountered. In that form, the relationship with the metric dependent variables was weak, but there was no evidence of nonlinearity. The variance of log transform of the dependent variable was homogeneous for the categories of the nonmetric variable sex. No cautions were needed because of a violation of assumptions. A caution was needed because respondent’s income was ordinal level. The answer to the problem is true with caution.

Problem 2 In the dataset 2001WorldFactbook, is the following statement true, false, or an incorrect application of a statistic? Use a level of significance of 0.01 for evaluating missing data and assumptions. In pre-screening the data for use in a multiple regression of the dependent variable "life expectancy at birth" [lifeexp] with the independent variables "population growth rate" [pgrowth], "percent of the total population who was literate" [literacy], and "per capita GDP" [gdp], the evaluation of the assumptions of normality, linearity, and homogeneity of variance did not indicate any need for a caution to be added to the interpretation of the analysis. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Level of measurement Since we are pre-screening for a multiple regression problem, we should make sure we satisfy the level of measurement before proceeding. "Life expectancy at birth" [lifeexp] is interval, satisfying the metric level of measurement requirement for the dependent variable. 14. In the dataset 2001WorldFactbook, is the following statement true, false, or an incorrect application of a statistic? Use a level of significance of 0.01 for evaluating missing data and assumptions. In pre-screening the data for use in a multiple regression of the dependent variable "life expectancy at birth" [lifeexp] with the independent variables "population growth rate" [pgrowth], "percent of the total population who was literate" [literacy], and "per capita GDP" [gdp], the evaluation of the assumptions of normality, linearity, and homogeneity of variance did not indicate any need for a caution to be added to the interpretation of the analysis. "Population growth rate" [pgrowth] "percent of the total population who was literate" [literacy] and "per capita GDP" [gdp] are interval, satisfying the metric or dichotomous level of measurement requirement for independent variables.

Run the script to test normality - 1 To run the script to test assumptions, choose the Run Script… command from the Utilities menu.

Run the script to test normality - 2 First, navigate to the SW388R7 folder on your computer. Second, click on the script name to select it: EvaluatingAssumptionsAndMissingData.SBS Third, click on the Run button to open the script.

Run the script to test normality - 3 First, move the variables to the list boxes based on the role that the variable plays in the analysis and its level of measurement. Second, click on the Normality option button to request that SPSS produce the output needed to evaluate the assumption of normality. Fourth, click on the OK button to produce the output. Third, mark the checkboxes for the transformations that we want to test in evaluating the assumption.

Normality of the dependent variable The dependent variable "life expectancy at birth" [lifeexp] satisfied the criteria for a normal distribution. The skewness of the distribution (-0.997) was between -1.0 and +1.0 and the kurtosis of the distribution (0.005) was between -1.0 and +1.0.

Normality of the first independent variables The independent variable "population growth rate" [pgrowth] did not satisfy the criteria for a normal distribution. Both the skewness (2.885) and kurtosis (22.665) fell outside the range from -1.0 to +1.0.

Normality of transformed independent variable Neither the logarithmic (skew=-0.218, kurtosis=1.277), the square root (skew=0.873, kurtosis=5.273), nor the inverse transformation (skew=-1.836, kurtosis=5.763) induced normality in the variable "population growth rate" [pgrowth]. A caution was added to the findings.

Normality of the second independent variables The independent variable "percent of the total population who was literate" [literacy] did not satisfy the criteria for a normal distribution. The kurtosis of the distribution (0.081) was between -1.0 and +1.0, but the skewness of the distribution (-1.112) fell outside the range from -1.0 to +1.0.

Normality of transformed independent variable Since the distribution was skewed to the left, it was necessary to reflect, or reverse code, the values for the variable before computing the transformation. The "square root of percent of the total population who was literate (using reflected values) [SQLITERA=SQRT(101-LITERACY)]" satisfied the criteria for a normal distribution. The skewness of the distribution (0.567) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.964) was between -1.0 and +1.0. The "square root of percent of the total population who was literate (using reflected values) [SQLITERA=SQRT(101-LITERACY)]" was substituted for "percent of the total population who was literate" [literacy] in the analysis.

Normality of the third independent variables The independent variable "per capita GDP" [gdp] did not satisfy the criteria for a normal distribution. The kurtosis of the distribution (0.475) was between -1.0 and +1.0, but the skewness of the distribution (1.207) fell outside the range from -1.0 to +1.0.

Normality of transformed independent variable The "square root of per capita GDP [SQGDP=SQRT(GDP)]" satisfied the criteria for a normal distribution. The skewness of the distribution (0.614) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.773) was between -1.0 and +1.0. The "square root of per capita GDP [SQGDP=SQRT(GDP)]" was substituted for "per capita GDP" [gdp] in the analysis.

Run the script to test linearity - 1 If the script was not closed after it was used for normality, we can take advantage of the specifications already entered. If the script was closed, re-open it as you would for normality. First, click on the Linearity option button to request that SPSS produce the output needed to evaluate the assumption of linearity. When the linearity option is selected, a default set of transformations to test is marked. Click on the OK button to produce the output.

Linearity test with population growth rate The assessment of the linearity of the relationship between "life expectancy at birth" [lifeexp] and "population growth rate" [pgrowth] indicated that the relationship could be considered linear because the probability associated with the correlation coefficient for the relationship (r=-0.262) was statistically signficant (p<0.001) and none of the statistically significant transformations for population growth rate had a relationship that was substantially stronger. The relationship between the untransformed variables was assumed to satisfy the assumption of linearity.

Linearity test with population literacy The transformation "square root of percent of the total population who was literate (using reflected values) [SQLITERA=SQRT(101-LITERACY)]" was incorporated in the analysis in the evaluation of normality. Additional transformations for linearity were not considered.

Linearity test with per capita GDP The transformation "square root of per capita GDP [SQGDP=SQRT(GDP)]" was incorporated in the analysis in the evaluation of normality. Additional transformations for linearity were not considered.

Run the script to test homogeneity of variance - 1 There were no nonmetric variables in this analysis, so the test of homogeneity of variance was not conducted.

Answer 2 In pre-screening the data for use in a multiple regression of the dependent variable "life expectancy at birth" [lifeexp] with the independent variables "population growth rate" [pgrowth], "percent of the total population who was literate" [literacy], and "per capita GDP" [gdp], the evaluation of the assumptions of normality, linearity, and homogeneity of variance did not indicate any need for a caution to be added to the interpretation of the analysis. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic Two transformations were substituted to satisfy the assumption of normality: the "square root of percent of the total population who was literate (using reflected values) [SQLITERA=SQRT(101-LITERACY)]" and the "square root of per capita GDP [SQGDP=SQRT(GDP)]" was substituted for "per capita GDP" [gdp] in the analysis. However, none of the transformations induced normality in the variable "population growth rate" [pgrowth]. A caution was added to the findings. The answer to the problem is false. A caution was added because "Population growth rate" [pgrowth] did not satisfy the assumption of normality and none of the transformations were successful in inducing normality.

Steps in evaluating assumptions: level of measurement The following is a guide to the decision process for answering problems about assumptions for multiple regression: Is the dependent variable metric and the independent variables metric or dichotomous? No Incorrect application of a statistic Yes

Steps in evaluating assumptions: assumption of normality for metric variable Does the dependent variable satisfy the criteria for a normal distribution? Assumption satisfied, use untransformed variable in analysis Yes No Does one or more of the transformations satisfy the criteria for a normal distribution? Assumption satisfied, use transformed variable with smallest skew Yes No Assumption not satisfied, use untransformed variable in analysis Add caution to interpretation

Steps in evaluating assumptions: assumption of linearity for metric variables Independent variable transformed for normality? If dependent variable was transformed for normality, substitute transformed dependent variable in the test for the assumption of linearity No Yes Skip test Probability of correlation (r) for relationship between IV and DV <= level of significance? Yes No Probability of correlation (r) for relationship between any transformed IV significant AND r greater than r of untransformed by 0.20? Probability of correlation (r) for relationship between any transformed IV and DV <= level of significance? No Yes Yes No Assumption satisfied, use untransformed independent variable Assumption satisfied, Use transformed variable with highest r Interpret relationship as weak, not nonlinear. No caution needed

Steps in evaluating assumptions: homogeneity of variance for nonmetric variables If dependent variable was transformed for normality, substitute transformed dependent variable in the test for the assumption of homogeneity of variance Probability of Levene statistic <= level of significance? Yes No Assumption not satisfied Add caution to interpretation Assumption satisfied