2Shapes of Distribution A third important property of data – after location and dispersion - is its shapeDistributions of quantitative variables can be described in terms of a number of features, many of which are related to the distributions’ physical appearance or shape when presented graphically.modalitySymmetry and skewnessDegree of skewnessKurtosis
3ModalityThe modality of a distribution concerns how many peaks or high points there are.A distribution with a single peak, one value a high frequency is a unimodal distribution.
4ModalityA distribution with two or more peaks called multimodal distribution.
5Symmetry and SkewnessA distribution is symmetric if the distribution could be split down the middle to form two haves that are mirror images of one another.In asymmetric distributions, the peaks are off center, with a bull of scores clustering at one end, and a tail trailing off at the other end. Such distributions are often describes as skewed.When the longer tail trails off to the right this is a positively skewed distribution. E.g. annual income.When the longer tail trails off to the left this is called negatively skewed distribution. E.g. age at death.
6Symmetry and SkewnessShape can be described by degree of asymmetry (i.e., skewness).mean > median positive or right-skewnessmean = median symmetric or zero-skewnessmean < median negative or left-skewnessPositive skewness can arise when the mean is increased by some unusually high values.Negative skewness can arise when the mean is decreased by some unusually low values.
8Shapes of the Distribution Three common shapes of frequency distributions:ABCSymmetrical and bell shapedPositively skewed or skewed to the rightNegatively skewed or skewed to the left25 March 201788
9Shapes of the Distribution Three less common shapes of frequency distributions:ABCBimodalReverse J-shapedUniform25 March 201799
10Example: # hours to complete a task This guy took a VERY long time!1010
11Degree of SkewnessA skewness index can readily be calculated most statistical computer program in conjunction with frequency distributionsThe index has a value of 0 for perfectly symmetric distribution.A positive value if there is a positive skew, and negative value if there is a negative skew.A skewness index that is more than twice the value of its standard error can be interpreted as a departure from symmetry.
12Measures of Skewness or Symmetry Pearson’s skewness coefficientIt is nonalgebraic and easily calculated. Also it is useful for quick estimates of symmetry .It is defined as:skewness = mean-median/SDFisher’s measure of skewness.It is based on deviations from the mean to the third power.
13Pearson’s skewness coefficient For a perfectly symmetrical distribution, the mean will equal the median, and the skewness coefficient will be zero. If the distribution is positively skewed the mean will be more than the median and the coefficient will be the positive. If the coefficient is negative, the distribution is negatively skewed and the mean less than the median.Skewness values will fall between -1 and +1 SD units. Values falling outside this range indicate a substantially skewed distribution.Hildebrand (1986) states that skewness values above 0.2 or below -0.2 indicate severe skewness.
14Assumption of Normality Many of the statistical methods that we will apply require the assumption that a variable or variables are normally distributed.With multivariate statistics, the assumption is that the combination of variables follows a multivariate normal distribution.Since there is not a direct test for multivariate normality, we generally test each variable individually and assume that they are multivariate normal if they are individually normal, though this is not necessarily the case.
15Evaluating normalityThere are both graphical and statistical methods for evaluating normality.Graphical methods include the histogram and normality plot.Statistical methods include diagnostic hypothesis tests for normality, and a rule of thumb that says a variable is reasonably close to normal if its skewness and kurtosis have values between –1.0 and +1.0.None of the methods is absolutely definitive.
16TransformationsWhen a variable is not normally distributed, we can create a transformed variable and test it for normality. If the transformed variable is normally distributed, we can substitute it in our analysis.Three common transformations are: the logarithmic transformation, the square root transformation, and the inverse transformation.All of these change the measuring scale on the horizontal axis of a histogram to produce a transformed variable that is mathematically equivalent to the original variable.
17Types of Data Transformations for moderate skewness, use a square root transformation.For substantial skewness, use a log transformation.For sever skewness, use an inverse transformation.
18Computing “Explore” descriptive statistics To compute the statistics needed for evaluating the normality of a variable, select the Explore… command from the Descriptive Statistics menu.
19Adding the variable to be evaluated Second, click on right arrow button to move the highlighted variable to the Dependent List.First, click on the variable to be included in the analysis to highlight it.
20Selecting statistics to be computed To select the statistics for the output, click on the Statistics… command button.
21Including descriptive statistics First, click on the Descriptives checkbox to select it. Clear the other checkboxes.Second, click on the Continue button to complete the request for statistics.
22Selecting charts for the output To select the diagnostic charts for the output, click on the Plots… command button.
23Including diagnostic plots and statistics First, click on the None option button on the Boxplots panel since boxplots are not as helpful as other charts in assessing normality.Finally, click on the Continue button to complete the request.Second, click on the Normality plots with tests checkbox to include normality plots and the hypothesis tests for normality.Third, click on the Histogram checkbox to include a histogram in the output. You may want to examine the stem-and-leaf plot as well, though I find it less useful.
24Completing the specifications for the analysis Click on the OK button to complete the specifications for the analysis and request SPSS to produce the output.
25The histogramAn initial impression of the normality of the distribution can be gained by examining the histogram.In this example, the histogram shows a substantial violation of normality caused by a extremely large value in the distribution.
26The normality plotThe problem with the normality of this variable’s distribution is reinforced by the normality plot.If the variable were normally distributed, the red dots would fit the green line very closely. In this case, the red points in the upper right of the chart indicate the severe skewing caused by the extremely large data values.
27The test of normalityProblem 1 asks about the results of the test of normality. Since the sample size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample size were 50 or less, we would use the Shapiro-Wilk statistic instead.The null hypothesis for the test of normality states that the actual distribution of the variable is equal to the expected distribution, i.e., the variable is normally distributed. Since the probability associated with the test of normality is < is less than or equal to the level of significance (0.01), we reject the null hypothesis and conclude that total hours spent on the Internet is not normally distributed. (Note: we report the probability as <0.001 instead of .000 to be clear that the probability is not really zero.)The answer to problem 1 is false.
28The assumption of normality script An SPSS script to produce all of the output that we have produced manually is available on the course web site.After downloading the script, run it to test the assumption of linearity.Select Run Script… from the Utilities menu.
29Selecting the assumption of normality script First, navigate to the folder containing your scripts and highlight the NormalityAssumptionAndTransformations.SBS script.Second, click on the Run button to activate the script.
30Specifications for normality script First, move variables from the list of variables in the data set to the Variables to Test list box.The default output is to do all of the transformations of the variable. To exclude some transformations from the calculations, clear the checkboxes.Third, click on the OK button to run the script.
31The test of normalityThe script produces the same output that we computed manually, in this example, the tests of normality.
32When transformations do not work When none of the transformations induces normality in a variable, including that variable in the analysis will reduce our effectiveness at identifying statistical relationships, i.e. we lose power.We do have the option of changing the way the information in the variable is represented, e.g. substitute several dichotomous variables for a single metric variable.
33Fisher’s Measure of Skewness The formula for Fisher’s skewness statistic is based on deviations from the mean to the third power.The measure of skewness can be interpreted in terms of the normal curveA symmetrical curve will result in a value of 0.If the skewness value is positive, them the curve is skewed to the right, and vice versa for a distribution skewed to the left.A z-score is calculated by dividing the measure of skewness by the standard error for skewness. Values above or below are significant at the 0.05 level because 95% of the scores in a normal deviation fall between and from the mean.E.g. if Fisher’s skewness= and st.err. =0.197 the z-score = 0.195/0.197 = 0.99
34KurtosisThe distribution’s kurtosis is concerns how pointed or flat its peak.Two types:Leptokurtic distribution (mean thin).Platykurtic distribution (means flat).
35KurtosisThere is a statistical index of kurtosis that can be computed when computer programs are instructed to produce a frequency distributionFor kurtosis index, a value of zero indicates a shape that is neither flat nor pointed.Positive values on the kurtosis statistics indicate greater peakedness, and negative values indicate greater flatness.
36Fishers’ measure of Kurtosis Fisher’s measure is based on deviation from the mean to the fourth power.A z-score is calculated by dividing the measure of kurtosis by the standard error for kurtosis.
37Table of descriptive statistics To answer problem 2, we look at the values for skewness and kurtosis in the Descriptives table.The skewness and kurtosis for the variable both exceed the rule of thumb criteria of The variable is not normally distributed.The answer to problem 2 if false.
38Other problems on assumption of normality A problem may ask about the assumption of normality for a nominal level variable. The answer will be “An inappropriate application of a statistic” since there is no expectation that a nominal variable be normal.A problem may ask about the assumption of normality for an ordinal level variable. If the variable or transformed variable is normal, the correct answer to the question is “True with caution” since we may be required to defend treating an ordinal variable as metric.Questions will specify a level of significance to use and the statistical evidence upon which you should base your answer.
39Normal DistributionAlso called belt shaped curve, normal curve, or Gaussian distribution.A normal distribution is one that is unimodal, symmetric, and not too peaked or flat.Given its name by the French mathematician Quetelet who, in the early 19th century noted that many human attributes, e.g. height, weight, intelligence appeared to be distributed normally.
40Normal DistributionThe normal curve is unimodal and symmetric about its mean ().In this distribution the mean, median and mode are all identical.The standard deviation () specifies the amount of dispersion around the mean.The two parameters and completely define a normal curve.25 March 201740
41Normal DistributionAlso called a Probability density function. The probability is interpreted as "area under the curve."The random variable takes on an infinite # of values within a given intervalThe probability that X = any particular value is 0. Consequently, we talk about intervals. The probability is = to the area under the curve.The area under the whole curve = 1.4141
43Normal Distribution X is the random variable. μ is the mean value. σ is the standard deviation (std) value.e = constant.π = constant.
44Importance of Normal Distribution to Statistics Although most distributions are not exactly normal, most variables tend to have approximately normal distribution.Many inferential statistics assume that the populations are distributed normally.The normal curve is a probability distribution and is used to answer questions about the likelihood of getting various particular outcomes when sampling from a population.
45Normal DistributionProbabilities are obtained by getting the area under the curve inside of a particular interval. The area under the curve = the proportion of times under identical (repeated) conditions that a particular range of values will occur.Characteristics of the Normal distribution:It is symmetric about the mean μ.Mean = median = mode. [“bell-shaped” curve]f(X) decreases as X gets farther and farther away from the mean. It approaches horizontal axis asymptotically: - ∞ < X < + ∞. This means that there is always some probability (area) for extreme values.4545
46Why Do We Like The Normal Distribution So Much? There is nothing “special” about standard normal scoresThese can be computed for observations from any sample/population of continuous data valuesThe score measures how far an observation is from its mean in standard units of statistical distanceBut, if distribution is not normal, we may not be able to use Z-score approach.25 March 201746
47Probability Distributions Any characteristic that can be measured or categorized is called a variable.If the variable can assume a number of different values such that any particular outcome is determined by chance it is called a random variable.Every random variable has a corresponding probability distribution.The probability distribution applies the theory of probability to describe the behavior of the random variable.25 March 201747
48Discrete Probability Distributions Binomial distribution – the random variable can only assume 1 of 2 possible outcomes. There are a fixed number of trials and the results of the trials are independent.i.e. flipping a coin and counting the number of heads in 10 trials.Poisson Distribution – random variable can assume a value between 0 and infinity.Counts usually follow a Poisson distribution (i.e. number of ambulances needed in a city in a given night)25 March 201748
49Discrete Random Variable A discrete random variable X has a finite number of possible values. The probability distribution of X lists the values and their probabilities.Every probability pi is a number between 0 and 1.The sum of the probabilities must be 1.Find the probabilities of any event by adding the probabilities of the particular values that make up the event.Value of Xx1x2x3…xkProbabilityp1p2p3pk25 March 201749
50ExampleThe instructor in a large class gives 15% each of A’s and D’s, 30% each of B’s and C’s and 10% F’s. The student’s grade on a 4-point scale is a random variable X (A=4).What is the probability that a student selected at random will have a B or better?ANSWER: P (grade of 3 or 4)=P(X=3) + P(X=4)= = 0.45GradeF=0D=1C=2B=3A=4Probability0.10.15.3025 March 201750
51Continuous Probability Distributions When it follows a Binomial or a Poisson distribution the variable is restricted to taking on integer values only.Between two values of a continuous random variable we can always find a third.A histogram is used to represent a discrete probability distribution and a smooth curve called the probability density is used to represent a continuous probability distribution.25 March 201751
52Normal Distribution Is every variable normally distributed? Absolutely notThen why do we spend so much time studying the normal distribution?Some variables are normally distributed; a bigger reason is the “Central Limit Theorem”!!!!!!!!!!!!!!!!!!!!!!!!!!!???????????25 March 201752
53Central Limit Theoremdescribes the characteristics of the "population of the means" which has been created from the means of an infinite number of random population samples of size (N), all of them drawn from a given "parent population".It predicts that regardless of the distribution of the parent population:The mean of the population of means is always equal to the mean of the parent population from which the population samples were drawn.The standard deviation of the population of means is always equal to the standard deviation of the parent population divided by the square root of the sample size (N).The distribution of means will increasingly approximate a normal distribution as the size N of samples increases.
54Central Limit TheoremA consequence of Central Limit Theorem is that if we average measurements of a particular quantity, the distribution of our average tends toward a normal one.In addition, if a measured variable is actually a combination of several other uncorrelated variables, all of them "contaminated" with a random error of any distribution, our measurements tend to be contaminated with a random error that is normally distributed as the number of these variables increases.Thus, the Central Limit Theorem explains the ubiquity of the famous bell-shaped "Normal distribution" (or "Gaussian distribution") in the measurements domain.
55Normal DistributionNote that the normal distribution is defined by two parameters, μ and σ . You can draw a normal distribution for any μ and σ combination. There is one normal distribution, Z, that is special. It has a μ = 0 and a σ = 1. This is the Z distribution, also called the standard normal distribution. It is one of trillions of normal distributions we could have selected.5555
56Standard Normal Variable It is customary to call a standard normal random variable Z.The outcomes of the random variable Z are denoted by z.The table in the coming slide give the area under the curve (probabilities) between the mean and z.The probabilities in the table refer to the likelihood that a randomly selected value Z is equal to or less than a given value of z and greater than 0 (the mean of the standard normal).25 March 201756
57Normal DistributionSource: Levine et al, Business Statistics, Pearson.5757
58The 68-95-99.7 Rule for the Normal Distribution 68% of the observations fall within one standard deviation of the mean95% of the observations fall within two standard deviations of the mean99.7% of the observations fall within three standard deviations of the meanWhen applied to ‘real data’, these estimates are considered approximate!25 March 201758
59Normal Distribution Remember these probabilities (percentages): Practice: Find these values yourself using the Z table.# standard deviations from the meanApprox. area under the normal curve±1.68±1.645.90±1.96.95±2.955±2.575.99±3.99759Two Sample Z Test59
61Standard Normal Distribution 50% of probability in here –probability=0.550% of probability in here–probability=0.525 March 201761
62Standard Normal Distribution 95% of probability in here2.5% of probability in here2.5% of probability in hereStandard Normal Distribution with 95% area marked25 March 201762
63Calculating Probabilities Probability calculations are always concerned with finding the probability that the variable assumes any value in an interval between two specific points a and b.The probability that a continuous variable assumes the a value between a and b is the area under the graph of the density between a and b.25 March 201763
64Example: WeightIf the weight of males is N.D. with μ=150 and σ=10, what is the probability that a randomly selected male will weigh between 140 lbs and 155 lbs?[Important Note: Always remember that the probability that X is equal to any one particular value is zero, P(X=value) =0, since the normal distribution is continuous.] Normal Distribution6464
65Example: Weight Solution: Z = (140 – 150)/ 10 = -1.00 s.d. from mean Z = (140 – 150)/ 10 = s.d. from meanArea under the curve = (from Z table)Z = (155 – 150) / 10 =+.50 s.d. from meanArea under the curve = (from Z table)Answer: = .53286565
66ExampleFor example: What’s the probability of getting a math SAT score of 575 or less, =500 and =50?i.e., A score of 575 is 1.5 standard deviations above the meanYikes!But to look up Z= 1.5 in standard normal chart (or enter into SAS) no problem! = .9332
67Example: IQIf IQ is ND with a mean of 100 and a S.D. of 10, what percentage of the population will haveIQs ranging from 90 to 110?IQs ranging from 80 to 120?Solution:Z = (90 – 100)/10 = -1.00Z = ( )/ 10 = +1.00Area between 0 and 1.00 in the Z-table is .3413; Area between 0 and is also (Z-distribution is symmetric).Answer to part (a) is =6767
68Example: IQ (b) IQs ranging from 80 to 120? Solution: Z = (80 – 100)/10 = -2.00Z = ( )/ 10 = +2.00Area between =0 and 2.00 in the Z-table is .4772; Area between 0 and is also (Z-distribution is symmetric).Answer is =6868
69Example: SalarySuppose that the average salary of college graduates is N.D. with μ=$40,000 and σ=$10,000.What proportion of college graduates will earn $24,800 or less?What proportion of college graduates will earn $53,500 or more?What proportion of college graduates will earn between $45,000 and $57,000?Calculate the 80th percentile.Calculate the 27th percentile.6969
70Example: Salary(a) What proportion of college graduates will earn $24,800 or less? Solution: Convert the $24,800 to a Z-score: Z = ($24,800 - $40,000)/$10,000 = Always DRAW a picture of the distribution to help you solve these problems.7070
71Example: Salary$24,800First Find the area between 0 and in the Z-table. From the Z table, that area is Then, the area from to - ∞ is = Answer: 6.43% of college graduates will earn less than $24,800..4357$40,000X-1.52Z7171
72Example: Salary$40,000(b) What proportion of college graduates will earn $53,500 or more? Solution: Convert the $53,500 to a Z-score. Z = ($53,500 - $40,000)/$10,000 = Find the area between 0 and in the Z-table: is the table value. When you DRAW A PICTURE (above) you see that you need the area in the tail: Answer: Thus, 8.85% of college graduates will earn $53,500 or more..4115.0885$53,500+1.35Z7272
73Example: Salary$40k.4554.1915(c) What proportion of college graduates will earn between $45,000 and $57,000? Z = $45,000 – $40,000 / $10,000 = .50 Z = $57,000 – $40,000 / $10,000 = 1.70 From the table, we can get the area under the curve between the mean (0) and .5; we can get the area between 0 and 1.7. From the picture we see that neither one is what we need. What do we do here? Subtract the small piece from the big piece to get exactly what we need. Answer: − = .2639$45k$57k.51.7Z7373
74Z-scores and percentiles Parts (d) and (e) of this example ask you to compute percentiles. Every Z-score is associated with a percentile. A Z-score of 0 is the 50th percentile. This means that if you take any test that is normally distributed (e.g., the SAT exam), and your Z-score on the test is 0, this means you scored at the 50th percentile. In fact, your score is the mean, median, and mode.7474
75Example: Salary Solution: First, what Z-score is associated $40,000(d) Calculate the 80th percentile.Solution:First, what Z-score is associatedwith the 80th percentile?A Z-score of approximately will give you about of the area under the curve. Also, the area under the curve between -∞ and 0 is Therefore, a Z-score of +.84 is associated with the 80th percentile.Now to find the salary (X) at the 80th percentile:Just solve for X: +.84 = (X−$40,000)/$10,000X = $40,000 + $8,400 = $48,400..5000.3000.84ZANSWER7575
76Example: Salary Solution: First, what Z-score is associated $40,000(e) Calculate the 27th percentile.Solution: First, what Z-score is associatedwith the 27th percentile? A Z-scoreof approximately -.61will give youabout of the area under the curve, with in the tail. (The area under the curve between 0 and -.61 is which we are rounding to .2300). Also, the area under the curve between 0 and ∞ is Therefore, a Z-score of is associated with the 27th percentile.Now to find the salary (X) at the 27th percentile:Just solve for X: =(X−$40,000)/$10,000X = $40,000 - $6,100 = $33,900.2300.5000.2700-.61ZANSWER7676
77T-DistributionSimilar to the standard normal in that it is unimodal, bell-shaped and symmetric.The tail on the distribution are “thicker” than the standard normalThe distribution is indexed by “degrees of freedom” (df).The degrees of freedom measure the amount of information available in the data set that can be used for estimating the population variance (df=n-1).Area under the curve still equals 1.Probabilities for the t-distribution with infinite df equals those of the standard normal.25 March 201777
78Graphical Methods Frequency Distribution Histogram Frequency Polygon Cumulative Frequency GraphPie Chart.25 March 20177878
79Presenting DataTableCondenses data into a form that can make them easier to understand;Shows many details in summary fashion;BUTSince table shows only numbers, it may not be readily understood without comparing it to other values.
80Principles of Table Construction Don’t try to do too much in a tableUs white space effectively to make table layout pleasing to the eye.Make sure tables & test refer to each other.Use some aspect of the table to order & group rows & columns.
81Principles of Table Construction If appropriate, frame table with summary statistics in rows & columns to provide a standard of comparison.Round numbers in table to one or two decimal places to make them easily understood.When creating tables for publication in a manuscript, double-space them unless contraindicated by journal.
82Frequency Distributions A useful way to present data when you have a large data set is the formation of a frequency table or frequency distribution.Frequency – the number of observations that fall within a certain range of the data.25 March 20178282
83Frequency Table Age Number of Deaths <1 564 1-4 86 5-14 127 15-24 49025-346635-4480645-541,42555-643,51165-746,93275-8410,10185+9825Total34,5248383
84Relative Frequency (%) Cumulative Relative Frequency (%) Frequency TableData IntervalsFrequencyCumulative FrequencyRelative Frequency (%)Cumulative Relative Frequency (%)10-19520-29182330-39103340-49134650-5945060-695470-79256Total8484
85Cumulative Relative Frequency Cumulative Relative Frequency – the percentage of persons having a measurement less than or equal to the upper boundary of the class interval.i.e. cumulative relative frequency for the 3rd interval of our data example:= 59.6%- We say that 59.6% of the children have weights below 39.5 pounds.25 March 20178585
86Number of IntervalsThere is no clear-cut rule on the number of intervals or classes that should be used.Too many intervals – the data may not be summarized enough for a clear visualization of how they are distributed.Too few intervals – the data may be over-summarized and some of the details of the distribution may be lost.25 March 20178686
87Presenting DataChart- Visual representation of a frequency distribution that helps to gain insight about what the data mean.- Built with lines, area & text: barchartsEx: bar chart, pie chart
88Bar Chart Simplest form of chart Used to display nominal or ordinal data
91Pie Chart Alternative to bar chart Circle partitioned into percentage distributions of qualitative variables with total area of 100%
92Histogram Appropriate for interval, ratio and sometimes ordinal data Similar to bar charts but bars are placed side by sideOften used to represent both frequencies and percentagesMost histograms have from 5 to 20 bars
94Pictures of Data: Histograms Blood pressure data on a sample of 113 menHistogram of the Systolic Blood Pressure for 113 men. Each bar spans a width of 5 mmHg on the horizontal axis. The height of each bar represents the number of individuals with SBP in that range.25 March 20179494
95Frequency PolygonFirst place a dot at the midpoint of the upper base of each rectangular bar.The points are connected with straight lines.At the ends, the points are connected to the midpoints of the previous and succeeding intervals (these intervals have zero frequency).25 March 20179595
96Hallmarks of a Good Chart Simple & easy to readPlaced correctly within textUse color only when it has a purpose, not solely for decorationMake sure others can understand chart; try it out on somebody firstRemember: A poor chart is worse than no chart at all.
97Cumulative Frequency Plot Place a point with a horizontal axis marked at the upper class boundary and a vertical axis marked at the corresponding cumulative frequency.Each point represents the cumulative relative frequency and the points are connected with straight lines.The left end is connected to the lower boundary of the first interval that has data.25 March 20179797
98The Uses of Frequency Distributions Becoming familiar with dataset.Cleaning the data.Outliers-values that lie outside the normal range of values for other cases.Inspecting the data for missing values.Testing assumptions for statistical tests.Assumption is a condition that is presumed to be true and when ignored or violated can lead to misleading or invalid results.When DV is not normally distributed researchers have to choose between three options:Select a statistical test that does not assume a normal distribution.Ignore the violation of the assumption.Transform the variable to better approximate a distribution that is normal. Please consult the various data transformation.
99The Uses of Frequency Distributions Obtaining information about sample characteristics.Directing answering research questions.
100OutliersAre values that are extreme relative to the bulk of scores in the distribution.They appear to be inconsistent with the rest of the data.Advantages:They may indicate characteristics of the population that would not be known in the normal course of analysis.Disadvantages:They do not represent the populationRun counter to the objectives of the analysisCan distort statistical tests.
101Sources of Outliers An error in the recording of the data. A failure of data collection, such as not following sample criteria (e.g. inadvertently admitting a disoriented patient into a study), a subject not following instructions on a questionnaire, or equipment failure.An actual extreme value from an unusual subjects.
102Methods to Identify Outliers Traditional way of labeling outliers, any value more than 3SD from the mean.Values that are more than 3 IQRs from the upper or lower edge of the box plot are extreme outliers.
103Handling Outliers Analyze the data two ways: With the outliers in the distributionWith outliers removed.If the results are similar, as they are likely to be if the sample size is large, then the outliers may be ignored.If the results are not similar, then a statistical analysis that is resistant to outliers can be used (e.g. median and IQR).If you want to use a mean with outliers, then the trimmed mean is an option. If calculated with a certain percentage of the extreme values removed from both ends of the distribution (e.g. n=100, then 5% trimmed mean is the mean of the middle 90% of the observation).
104Handling Outliers Another alternative is a Winsorized mean. The highest and lowest extremes are replaced by the next-to-highest value and by the next-to-lowest value.For Univariate outliers, Tabachnick and Fidell (2001) suggest changing the scores on the variables for the outlying cases so they are deviant. E.g. if the two largest scores in the distribution are 125 and 122 and the next largest score 87. recode 122 as 88 and 125 as 89.
105Missing DataAny systematic event external to the respondent (such as data entry errors or data collection problems) or action on the part of the respondent (such as refusal to answer) that leads to missing data.It means that analyses are based on fewer study participants than were in the full study sample. This, in turn, means less statistical power, which can undermine statistical conclusion validity-the degree to which the statistical results are accurate.Missing data can also affect internal validity-the degree to which inferences about the causal effect of the dependent variable on the dependent variable are warranted, and also affect the external validity-generalizability.
106Strategies to avoid Missing Data Persistent follow-upFlexibility in scheduling appointmentsPaying incentives.Using well-proven methods to track people who have moved.Performing a thorough review of completed data forms prior to excusing participants.
107Techniques for Handling Missing Data Deletion techniques. Involve excluding subjects with missing data from statistical calculation.Imputation techniques. Involve calculating an estimate of each missing value and replacing, or imputing, each value by its respective estimate.Note: techniques for handling missing data often vary in the degree to which they affect the amount of dispersion around true scores, and the degree of bias in the final results. Therefore, the selection of a data handling technique should be carefully considered.
108Deletion TechniquesDeletion methods involve removal of cases or variables with missing data.Listwise deletion. Also called complete case analysis. It is simply the analysis of those cases for which there are no missing data. It eliminates an entire case when any of its items/variables has a missing data point, whether or not that data point is part of the analysis. It is the default of the SPSS.Pairwise deletion. Called the available case analysis (unwise deletion). Involves omitting cases from the analysis on a variable-by-variable basis. It eliminates a case only when that case has missing data for variables or items under analysis.Note: deletion techniques are widely criticized because they assume that the data are MCAR (which is very difficult to ascertain), pose a risk for bias, and lead to reduction of sample size and power.
109Imputation Techniques Imputation is the process of estimating missing data based on valid values of other variables or cases in the sample.The goal of imputation is to use known relationship that can be identified in the valid values of the sample to help estimate the missing data
110Types of Imputation Techniques Using prior knowledge.Inserting mean values.Using regression.
111Prior KnowledgeInvolves replacing a missing value with a value based on an educational guess.It is a reasonable method if the researcher has a good working knowledge of the research domain, the sample is large, and the number of missing values is small.
112Mean ReplacementAlso called median replacement for skewed distribution.Involves calculating mean values from a available data on that variable and using them to replace missing values before analysis.It is a conservative procedure because the distribution mean as a whole does not change and the researcher does not have to guess at missing values.
113Using RegressionInvolves using other variables in the dataset as independent variables to develop a regression equation for the variable with missing data serving as the dependent variable.Cases with complete data are used to generate the regression equation.The equation is then used to predict missing values for incomplete cases.More regressions are computed, using the predicted values from the previous regression to develop the next equation, until the predicted values from one step to the next are comparable.Prediction from the last regression are the ones used to replace missing values.