Statistical Data Analysis

Statistical Data Analysis
Moses Mugolo Kasolo Training Co-ordinator Statistician/Data Analyst (GIS) / Presentation Customized for a Research Workshop

Training Objectives By the end of the training participants shall have acquired knowledge and basic skills in statistical data analysis. Participants should therefore be able; to prepare data for analysis, to learn different data transform techniques to identify and learn when to apply the various statistical tests in analyzing quantitative data Create a data entry template and enter data (Practical) Perform some basic analysis (Practical)

What is Statistical Data Analysis?
After collecting data, the researcher becomes concerned with five things: Checking the questionnaire/schedules (Data Cleaning*) Sorting out and reducing information collected (Reliability/Factor Analysis*) Summarizing the data into tabular forms (Descriptive Statistics). Analyzing findings to bring out salient features (Inferential Statistics) Interpreting the results (Narrating the story behind the numbers) Overall, the technique of converting raw data into meaningful statements (including data editing, tabulation, disintegration, graphing, interpretation and presentation) is what is commonly referred to as Data Analysis.

Organizing, Entering & Cleaning Data
After the data collection exercise data must be organized properly to facilitate the data entry and analysis exercise. There are three stages that must be followed to organize your data. n Data Editing n Data Editing n Data Screening/Cleaning Data editing is a process whereby errors in completed interview schedules, questionnaires, etc. are identified and eliminated whenever possible. Editing is carried out in three stages: in the field by the interviewer, in the office before data is coded, using computer programs , which can edit the data before it is analyzed.

Data editing Continue…
Editing is done to check the following: Completeness Accuracy Consistency Inclusiveness(key variables) before analysis is done. Data can there after be entered in the computer using the appropriate computer package. Some basic analysis can be done to help detect some anomalies of the data entered. Questionnaires should always be numbered to help facilitate this exercise (references)

Data Coding This process is where by verbal information is converted into variables and categories of variables using numbers, so that the data can be easily entered into computers for purposes of analysis Gender Age Groups Did you vote? Variables: Categories: Codes Male Code 0 Female Code 1 No Code 0 18-25 Code 1 26-33 Code 2 34-41 Code 3 Yes Code 1

Types of Questions & Answers
The Dichotomous Question The dichotomous question is generally a "yes/no" question. It may also be any question with only two possible responses/answers? Examples: (1) Have you ever purchased a product or service from our website? Yes/No (2) What is your gender? Male/Female Coding for purposes of data analysis: Use 1 and 0 (NOT 1 and 2 as is the common practice)

The Single Choice Questions The single choice questions consists of many possible responses but you’re required to choose one. What is your highest level of education? Primary Secondary Ordinary Diploma Bachelor’s Master’s PhD Others, specify Coding for purposes of data analysis: Start with 1, 2, 3 …

The Multiple Choice Questions The multiple-choice question consists of many possible responses. They normally ask for multiple answers. These questions can vary depending on how you state them. Examples: How did you first learn about our web site? Television Radio Newspaper Magazine Word-of-mouth Internet Other: Please Specify __________ Coding for purposes of data analysis: Start with 1, 2, 3…

Name three different sources where you learnt about our web site? Television Radio Newspaper Magazine Word-of-mouth Internet Other: Please Specify _______________ Coding for purposes of data analysis: Start with 1, 2, 3 … Create three variables in SPSS, name them Source1, Source2 and Source3

Tick those sources where you have learnt about our web site? Television Radio Newspaper Magazine Word-of-mouth Internet Other: Please Specify _______________ Coding for purposes of data analysis: Create variables with the above names, Enter 1 for sources and 0 for non - sources

Rank Order Scaling Rank order scaling questions allow a certain set of brands or products to be ranked based upon a specific attribute or characteristic. Example: Please rank the following brands according to their reliability. Place a "1" next to the brand that is most reliable, a "2" next to the brand that is next most reliable, and so on. Honda Toyota Mazda Ford What are you interested in? And how do you code them?

The Rating Scale A rating scale question requires a person to rate a product or brand along a well-defined, evenly spaced continuum. Rating scales are often used to measure the direction and intensity of attitudes. The following is an example of a comparative rating scale question: Example: How do you best describe your last experience purchasing a product or service on our website? Very pleasant Somewhat pleasant Neither pleasant nor unpleasant Somewhat unpleasant Very unpleasant How do you code the above?

The Semantic Differential Scale The semantic differential scale asks a person to rate a product, brand, or company based upon a seven-point rating scale that has two bi-polar adjectives at each end. The following is an example of a semantic differential scale question. Example: Would you say our web site is: (7) Very Attractive (6) (5) (4) (3) (2) (1) Very Unattractive Notice that unlike the rating scale, the semantic differential scale does not have a neutral or middle selection. A person must choose, to a certain extent, one or the other adjective.

The Staple Scale The staple scale asks a person to rate a brand, product, or service according to a certain characteristic on a scale from +5 to -5, indicating how well the characteristic describes the product or service. The following is an example of a staple scale question: When thinking about Global Information Systems Ltd (GIS), do you believe that the word "innovative" appropriately describes or poorly describes the company? On a scale of +5 to -5 with +5 being "very good description of GIS” and -5 are being "poor description of GIS," how do you rank GIS according to the word "innovative"?

(+5) Describes very well (+4) (+3) (+2) (+1) Innovative (-1) (-2) (-3) (-4) (-5) Poorly Describes

The Constant Sum Question A constant sum question permits collection of "ratio" data, meaning that the data is able to express the relative value or importance of the options (option A is twice as important as option B). Example: The following question asks you to divide 100 points between a set of options to show the value or importance you place on each option. Distribute the 100 points giving the more important reasons a greater number of points.

When thinking about the reasons you purchased our data mining software, please rate the following reasons according to their relative importance. Seamless integration with other software __________ User friendliness of software __________ Ability to manipulate algorithms __________ Level of pre- and post-purchase service __________ Level of value for the price __________ Convenience of purchase/quick delivery __________ Others, specify __________ Total 100 points

The Open-Ended Question The open-ended question seeks to explore the qualitative, in-depth aspects of a particular topic or issue. It gives a person the chance to respond in detail. Although open-ended questions are important, they are time-consuming and should not be over-used, if you intend to perform a quantitative analysis. Example: What products of services were you looking for that were not found on our website? Note: If you want to add an "Other" answer to a multiple choice question, you would use branching instructions to come to an open ended question to find out what other really is?

The Demographic Questions Demographic questions are an integral part of any questionnaire. They are used to identify characteristics such as age, gender, income, number of children, and so forth. Examples: How old are you? What is your yearly income? How many children are you responsible for? What time of data is this? And how do you code it?

Data Screening or Cleaning
Used to identify miscoded (e.g. possible responses are either Yes – 1, N0 – 0, so there can’t be another code) Used to also identify missing data (key variables should not have missing values) Used to identify messy or inconsistent data (e.g. Smoker=No, No of Cigars per day = 40) It helps to find possible outliers, non-normal distributions, other anomalies in the data. SPSS uses Validation rules for Data Cleaning There are three types of rules for validating a data set: Single-variable rules Cross-variable rules, and Multi-case rules.

Single-Variable Rules
Validation rules that check internal inconsistencies, such as invalid values and cases within a variable, are known as Single-Variable Rules. These rules consist of a set of checks that can be applied to a variable. Normally, checks for out-of-range or invalid values and missing values are included in this category. For example, a value of 5 was entered for the ‘highest education level, whose valid codes are only 0, 1, 2 and 3. Similarly, single-variable rules can be used to check whether values other than 0 and 1 (or ‘Male’ and ‘Female’) are entered in variable ‘sex of respondent’

Single-Variable Rules
Three four in the single variable rule. Stage I: Obtain a list of valid values or ranges from the codebook. Stage II: Construct a frequency table for the variable under test. If there are no invalid values displayed, the variable under observation is ‘valid’ with the single-variable rule. Stage III: (If there are invalid values). Extract all cases with invalid values for the variable in the data set. Stage IV: Identify the questionnaires where those erroneous cases come from.

Cross-Variable Rules Rules for checking inconsistencies in a variable through the values of other variables in the same case are called Cross-Variable Rules. Users have to use cross-tabulations to identify whether invalid cases exist or not, and to apply slightly different rules for conditional selection of invalid cases. For example when you cross tabulate Age and Highest Education Level, you may discover “suspicious” cases where respondents aged 12 have University education . Or Ever fallen sick in the last week? – No Verses Action taken? - Visited a health Centre! Note: You can cross tabulate more than 2 variables at a time

Multi-Case Rule A user-defined rule that can be applied to a single variable or a combination of variables in a group of cases is a Multi-Case Rule. Multi-case rules are defined by a procedure (sequence of logical expressions) that flags invalid cases. The most common and useful application of multi-case rules is checking whether there are duplicates in the data set, such as cases that have been entered more than once for a single respondent or household, or a household that has two heads, or two respondents who have the same opinion, attitude and perception about something under study.

Data Entering Raw data is NOT very useful for purposes of analysis. This data must be entered into an appropriate computer software before actual analysis starts. Whether you have collected quantitative or qualitative data, it is important that you enter the data in a logical format that can be easily understood and analyzed. For quantitative data, you either use Microsoft Excel, Epi Info, Epi Data, Stata or SPSS (most popular data analysis software).

Measurement Levels There is need to identify the level of measurement associated with the quantitative data. The level of measurement has lot of influence on the type of analysis you can use. There are four levels of measurement: Nominal data: means that the number assigned to the data simply represents a category of object. There is no measured difference between the objects being measured. The numbers assigned are not in any logical order. Some common examples are assigning a number to Gender (Male – 1, female – 0) or Marital status (married – 1, single – 2, divorced - 3, etc.). You are just assigning a number to something for purposes of analysis. Ordinal data: means the larger the number assigned to the data for the object, then the object is truly larger in some sort of amount, value, importance or hierarchy. The data numbers assigned have a logical order, but the differences between values are not constant and important either. Examples: T-shirt size (small - 1, medium - 2, large - 3), Level of education (Primary – 1, Secondary – 2, University – 3)

Measurement Levels Interval data: means data is continuous and has a logical order; data has standardized differences between values, but no natural zero. A natural zero would mean non-existence of what is being measured. Example of interval data: Fahrenheit degrees. Zero degrees does not mean non-existence of temperature! Ratio data: means data is continuous, ordered, has standardized differences between values, and a natural zero. Examples: height, weight, age, length, etc. Zero height, weight or age means non-existence of the object being measured

Transformation of Data
Constructs are measured in very arbitrary ways. For example, height, may be measured in feet, inches, centimeters or millimeters. While weight may be measured in kilograms, gram or pounds These measurements can be converted from one to the other by a rule or formula. The measurement scale we use depends on a number of factors. Converting data from one scale into another is what is called Data Transformation.

In statistical practice there are a number of transformations that are commonly used. These include: Dichotomization Standardization Normalization Computation Aggregation

Dichotomization A variable that takes on only two values is a dichotomous variable. Examples: Male/female, yes/no, agree/disagree, true/false, present/absent, less than/more than, lowest half/highest half, experimental group/control group, are all examples of dichotomous variables We can convert continuous measurements to smaller numbers of categories by recoding the variable into two values - Dichotomization Example 1: Convert height into below average or above average (call them 0 and 1, or 1 and 2). Example 2: Convert a Likert scale (Strongly Agree, Agree, Undecided, Disagree and Strongly Disagree to Agree and Disagree, ignoring the Undecided responses (if the number is negligible)

Standardization Another useful transformation in statistics is standardization. Sometimes called "converting to Z-scores" or "taking Z-scores“. It has the effect of transforming the original distribution to one in which the mean becomes zero and the standard deviation becomes One. This helps you to compare 2 or more sets of data using a standard scale A Z-score quantifies the original score in terms of the number of standard deviations that that score is from the mean of the distribution. The formula for converting from an original or "raw" score to a Z-score is:

Data Normalization A common requirement for parametric tests is that the population of scores from which the sample observations came should be normally distributed. Data which does not meet this requirement may therefore be normalized before subjecting it to any parametric tests. The most common normalization techniques include; logarithmic, reciprocal, and square root transformations.

Computation When data is collected, there are some variables that may be derived from the already collected data. Example 1: Age may be derived from Date of birth Example 2: Suppose 20 respondents are asked a question, where the possible responses are: (1) Strongly agree, (2) Agree, (3) Not Sure, (4) Disagree, (5) Strongly Disagree. You may compute a new variable showing the average response to the question and make generalizations about the respondents. You can also derive very complex variables depending on the kind of research you are doing. (y = mx+c)

Aggregation In many cases data is collected variable by variable. Actually even variables that require multiple responses, are still constructed as “single” variables After collecting data that way, you may need to combine or aggregate some variables and create new ones For example after collecting data on household incomes, family sizes and ages, you may aggregate the data and create a new dataset showing total income, average income, number of people per LC1 or LC3.

Analyzing Quantitative Data
Once you have identified your levels of measurement, you can begin using some of the quantitative data analysis procedures. There are several procedures you can use to determine what narrative your data is telling. Below are some of the common analyses: Data tabulation (frequency distributions & percent distributions) Descriptive statistics Data disaggregation Choosing Statistical tests

Data tabulation The first thing you should do with your data is tabulate your results for the different variables in your data set. This process will give you a comprehensive picture of what your data looks like and assist you in identifying patterns. The best ways to do this are by constructing frequency and percent distributions A frequency distribution is an organized tabulation of the number of individuals or scores located in each category See the tables below showing frequency distribution for the regions and education level

REGION Frequency Percent Valid Percent Cumulative Percent Valid Eastern 31 25.8 Northern 29 24.2 50.0 Central 33 27.5 77.5 Western 27 22.5 100.0 Total 120 Observations: What is the difference between Percent and Valid Percent? What is the use of Cumulative Percent? What do we report? Frequency? Percent? Or Valid Percent?

Observations: Is there a clear difference between Percent and Valid Percent?

Descriptive statistics A descriptive refers to calculations that are used to “describe” the data set. The most common descriptives used are: Mean – the numerical average of scores for a particular variable Minimum and maximum values – the highest and lowest value for a particular variable Median – the numerical middle point or score that cuts the distribution in half for a particular variable Mode – the most common number score or value for a particular variable Standard deviation – a measure of the average spread or variation from the mean.

Statements measuring needs assessment SA A N D SD Mean S.D. Farmers attend coffee nursery planning meetings 62.5% 21.4% 3.6% 7.1% 5.4% 4.29 1.17 Likert Scale: 5- Strongly agree, 4- Agree, 3- Not Sure, 2 - Disagree, 1 - Strongly Disagree Teaser: Supposing the mean was the same for 2 different statements but the S.D. was different. What would be the interpretation? Mean: S.D1 =1.17, S.D2 = 2.62

Descriptive statistics… We can also generate more descriptives, such as below; Range- Difference between the highest and lowest values Quartiles - are the values that divide a list of numbers into quarters. Skewness - a measure of symmetry, or more precisely, the lack of symmetry. The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. Kurtosis - a measure of whether the data are peaked or flat relative to a normal distribution. The kurtosis for a standard normal distribution is three. Positive kurtosis indicates a "peaked" distribution and negative kurtosis indicates a "flat" distribution.

Disaggregation of data After tabulating the data, you can continue to explore the data by disaggregating it across different variables. The 2-way or 3-way Crosstabs allows you to disaggregate the data across multiple variables. Using data from our example, let’s explore the participant demographics (gender and education level) for each region. By looking at the table below, you can clearly see the demographic makeup of the respondents.

Choosing Statistical Tests
Statistical tests majorly focus on three aspects; Associations (relationships) Predictions (forecasting) Differences between groups

Associations: Pearson's correlation The Pearson product-moment correlation is a measure of the strength and direction of association that exists between two variables measured on at least an interval scale. Examples; Is there an association between exam performance and time spent revising; Is there an association between family size and family savings. Pearson’s correlation coefficient ranges between +1 and shows a perfect positive association, -1 perfect negative association and 0 (zero) no association

Associations: Assumptions for Pearson’s correlation Assumption #1: Your two variables should be measured at the interval or ratio level (i.e., they are continuous). Examples; Revision time (measured in hours), age(measured in years), exam performance (measured in marks from 0 to 100), weight (measured in kg), etc. Assumption #2: There needs to be a linear relationship between the two variables. You can plot the dependent variable against your independent variable on a scatterplot and then visualize it to check for linearity.

Examples of Scatterplots

Assumption #3: There should be no significant outliers. Outliers are simply single data points within your data that do not follow the usual pattern. Pearson’s r is sensitive to outliers, which can have a very large effect on the line of best fit and the Pearson correlation coefficient, leading to very difficult conclusions regarding your data.

Assumption #4: Your variables should be approximately normally distributed. In order to assess the statistical significance of the Pearson correlation, you need to have bivariate normality, but this assumption is difficult to assess, so a simpler method is more commonly used. This known as the Shapiro-Wilk test of normality, which is easily tested for using SPSS.

Associations continue… Spearman's Rank-Order Correlation The Spearman's rank-order correlation is the nonparametric version of the Pearson product-moment correlation. It measures the strength of association between two ranked variables. Assumptions The two variables must be either Nominal or ordinal They may also be Interval or ratio data (where the assumptions for Pearson have been violated)

Associations continue… Chi-Square Test for Association The chi-square test for independence, also called Pearson's chi-square test or the chi-square test of association, is used to discover if there is a relationship between two categorical variables. Examples; Is there an association between Region and Political affiliation? Is there an association between Gender and Type of learning (On-line, Books or Face-Face)? Assumption #1: Your two variables should be measured at an ordinal or nominal level (i.e., categorical data). Assumption #2: Your two variable should consist of two or more categorical, independent groups. Examples; Gender (2 groups: Males and Females).

Predicting scores Linear Regression Analysis Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable). The variable we are using to predict the other variable's value is called the independent variable(predictor or explanatory variable) Examples; Exam performance can be predicted based on revision time; A family’s Savings can be predicted based on the family size. Simple linear regression is when there is only one IV and one DV.

Assumptions Assumption #1: Your two variables should be measured at the interval or ratio level (i.e., they are continuous). Assumption #2: There needs to be a linear relationship between the two variables. Assumption #3: There should be no significant outliers. Outliers are simply single data points within your data that do not follow the usual pattern. Assumption #4: You should have independence of observations, which you can easily check using the Durbin-Watson statistic, which is a simple test to run using SPSS. Assumption #5: Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line.

Homoscedasticity Verses Heteroscedasticity

Predicting scores continues… Multiple Regression Analysis Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. Example; Use multiple regression to understand whether exam performance can be predicted based on revision time, exam anxiety, lecture attendance, and gender. Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, exam anxiety, lecture attendance and gender "as a whole", but also the "relative contribution" of each independent variable in explaining the variance.

Assumptions Assumption #1: Your dependent variable should be measured on a continuous scale . Assumption #2: You have two or more independent variables, which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e. ordinal or nominal variable). Assumption #3: You should have independence of observations. Assumption #4: There needs to be a linear relationship between (a) the dependent variable and each of your independent variables, and (b) the dependent variable and the independent variables collectively. Assumption #5: Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line.

Assumption #6: Your data must not show multi-collinearity, which occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable, as well as technical issues in calculating a multiple regression model.

Ordinal logistic regression Ordinal logistic regression (often just called 'ordinal regression') is used to predict an ordinal dependent variable given one or more independent variables. Examples; Ordinal regression can be use to predict the belief that "tax is too high" (your ordinal dependent variable, measured on a 4-point Likert item from "Strongly Disagree" to "Strongly Agree"), based on two independent variables: "age" and "income". Ordinal regression can also be used to determine whether a number of independent variables, such as "age", "gender", "level of physical activity" (amongst others), predicts the ordinal dependent variable, "obesity", where obesity is measured using three ordered categories: "normal", "overweight" and "obese".

Assumptions for ordinal regression Assumption #1: Your dependent variable should be measured at the ordinal level. Assumption #2: One or more independent variables is/are continuous, ordinal or categorical (including dichotomous variables). Assumption #3: There is no multi-collinearity

Differences between groups Independent-samples t-test The independent-samples t-test compares the means between two unrelated groups on the same continuous, dependent variable. Examples; Use an independent t-test to understand whether fresh graduate salaries differed based on gender (i.e., your dependent variable would be "fresh graduate salaries" and your independent variable would be "gender", which has two groups: “Male" and “Female"). Alternately, use an independent t-test to understand whether there is a difference in food production based on type of fertilizers (i.e., your dependent variable would be “food production" and your independent variable would be “type of fertilizers", which has two groups: “organic" and “inorganic").

Assumptions Assumption #1: Your dependent variable should be measured at the interval or ratio level (i.e., they are continuous). Assumption #2: Your independent variable should consist of two categorical, independent groups. Assumption #3: You should have independence of observations. Assumption #4: There should be no significant outliers. Assumption #5: Your dependent variable should be approximately normally distributed for each category of the independent variable. Assumption #6: There needs to be homogeneity of variances.

One-way ANOVA The one-way analysis of variance (ANOVA) is used to determine whether there are any significant differences between the means of three or more independent (unrelated) groups . Example; Use a one-way ANOVA to understand whether exam performance differed based on exam anxiety levels amongst students, dividing students anxiety into three independent groups (e.g., low, medium and high). It is important to realize that the one-way ANOVA cannot tell you which specific groups were significantly different from each other; it only tells you that at least two groups were different. This can be done by using the ANOVA with Post-hoc test.

Assumptions Assumption #1: Your dependent variable should be measured at the interval or ratio level. Assumption #2: Your independent variable should consist of more than two categorical, independent groups. Assumption #3: You should have independence of observations Assumption #4: There should be no significant outliers . Assumption #5: Your dependent variable should be approximately normally distributed for each category of the independent variable. Assumption #6: There needs to be homogeneity of variances.

Two-way ANOVA The two-way ANOVA compares the mean differences between groups that have been split on two independent variables (called factors). The primary purpose of a two-way ANOVA is to understand if there is an interaction between the two independent variables on the dependent variable. Example; Use a two-way ANOVA to understand whether there an interaction between gender and educational level on exam anxiety amongst university students, where gender (males/females) and education level (undergraduate/postgraduate) are your independent variables, and exam anxiety your dependent variable.

The two-way ANOVA cannot tell you which specific groups were significantly different from each other (e.g., it cannot tell you whether postgraduate males had greater exam anxiety levels than postgraduate females); it only tells you that at least two groups were different. Since you may have three, four, five or more groups in your study design, as well as two independent variables, determining which of these groups differ from each other is important. You can do this using a post-hoc test. Therefore, where statistically significant interactions are found, you we need to determine whether there are any "simple main effects", and if there are, what these effects are.

Assumption #1: Your dependent variable should be measured at the interval or ratio level (i.e., they are continuous). Assumption #2: Your two independent variables should each consist of two or more categorical, independent groups. Assumption #3: You should have independence of observations, which means that there is no relationship between the observations in each group or between the groups themselves. Assumption #4: There should be no significant outliers. Assumption #5: Your dependent variable should be approximately normally distributed for each combination of the categories of the two independent variables. Assumption #6: There needs to be homogeneity of variances for each combination of the categories of the two independent variables.

One-way MANOVA The one-way multivariate analysis of variance (one-way MANOVA) is used to determine whether there are any differences between independent groups on more than one continuous dependent variable. Example; Use a one-way MANOVA to understand whether there were differences in students' short-term and long-term recall of facts based on three different lengths of lecture (i.e., the two dependent variables are "short-term memory recall" and "long-term memory recall", while the independent variable is "lecture duration", which has four independent groups: "30 minutes", "60 minutes", "90 minutes" and "120 minutes").

Mann-Whitney U test The Mann-Whitney U test is used to compare differences between two independent groups when the dependent variable is either ordinal or continuous, but not normally distributed. Example; Use the Mann-Whitney U test to understand whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differ based on gender (i.e., your dependent variable would be "attitudes towards pay discrimination" and your independent variable would be "gender", which has two groups: "male" and "female").

Kruskal-Wallis H test The Kruskal-Wallis test is the nonparametric test equivalent to the one-way ANOVA, and an extension of the Mann Whitney U test to allow the comparison of more than two independent groups. It is used when we wish to compare three or more sets of scores that come from different groups. Example;, Use a Kruskal-Wallis test to understand whether exam performance differed based on exam anxiety levels amongst students, dividing students into three independent groups (e.g., low, medium and high).

Dependent T-Test (Paired-samples t-test) The dependent t-test (called the Paired-Samples T Test) compares the means between two related groups on the same continuous, dependent variable. Example; Use a dependent t-test to understand whether there was a difference in smokers' daily cigarette consumption before and after a 6 week anti-smoking programme (i.e., your dependent variable would be "daily cigarette consumption", and your two related groups would be the cigarette consumption values "before" and "after" the anti-smoking programme)

Assumptions Assumption #1: Your dependent variable should be measured at the interval or ratio level. Assumption #2: Your independent variable should consist of two categorical, "related groups" or "matched pairs". Assumption #3: There should be no significant outliers in the differences between the two related groups. Assumption #4: The distribution of the differences in the dependent variable between the two related groups should be approximately normally distributed.

ANCOVA The ANCOVA (analysis of covariance) can be thought of as an extension of the one-way ANOVA to incorporate a "covariate". Like the one-way ANOVA, the ANCOVA is used to determine whether there are any significant differences between the means of two or more independent (unrelated) groups However, the ANCOVA has the additional benefit of allowing you to "statistically control" for a third variable (sometimes known as a "confounding variable"), which may be negatively affecting your results. This third variable that could be founding your results is the "covariate" that you include in an ANCOVA.

Statistical Data Analysis
Moses Mugolo Kasolo / Global Information Systems Ltd uSoftware Development v Data Analysis wSpecialized ICT Training

Statistical Data Analysis

Similar presentations

Presentation on theme: "Statistical Data Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Data Analysis

Similar presentations

Presentation on theme: "Statistical Data Analysis"— Presentation transcript:

Similar presentations

About project

Feedback