# Session 6.1 Univariate Data Analysis

## Presentation on theme: "Session 6.1 Univariate Data Analysis"— Presentation transcript:

Session 6.1 Univariate Data Analysis
LIS 570 Session 6.1 Univariate Data Analysis

Objectives: Have answers to the following questions
Why is the normal distribution important for statistical analysis (the ones presented) to make sense? What is the logic behind inferential statistics? (On what theories is it based?) What is a Confidence Interval? In what ways can we summarize quantitative data? What are some visualization techniques to help us summarize and make sense of data?

Agenda Exercise: understand “the problem” Vocabulary
Functions of statistics When to use what type Descriptive statistics Inferential statistics

Why and What Why know statistics? What is a statistic?
Informed consumer… Informed user… Informed professional… What is a statistic? a descriptive summary (index) of a sample

Sample and Population Population (Universe) Sample
The totality of things we are interested in (e.g., the population of all students at the UW) Sample A set of observations, instances, individuals drawn from a population, usually intended to represent the population in a study Population New vocabulary Sample Average = 4.5 Average = 4.55 statistic parameter A statistic is a characteristic of a sample, while the same characteristic, if descriptive of a population, is called a population parameter.

2 major functions of statistics
Help us describe characteristics of sample Descriptive statistics Procedures to summarize, organize, and simplify data Help us describe characteristics of population Inferential statistics Techniques for studying samples, and then make generalizations about the population from which the samples were selected.* * Source: Gravetter, F. J. and Wallnau, L. B. (2002). Essentials of Statistics for the Behavioral Sciences. 4th edition. Pacific Grove, CA: Wadsworth, p. 5

Vocabulary Variable—characteristic which has more than one value
e.g., Sex—male, female; hours of work/week—anything from 0 – 168 Independent variable (X)—manipulated by the researcher or believed to be the cause of… Dependent variable (Y)—variable observed to assess the effect of the manipulation, or changes depending on the independent variable Data—observations (measurements) taken on the units of analysis

Choosing the Statistical Technique*
Specific research question or hypothesis Determine # of variables in question Univariate analysis Bivariate analysis Multivariate analysis Determine level of measurement of variables Choose univariate method of analysis * Source: De Vaus, D.A. (1991) Surveys in Social Research. Third edition. North Sydney, Australia: Allen & Unwin Pty Ltd., p133 Choose relevant descriptive statistics Choose relevant inferential statistics

What To Do with a Bunch of Numbers
Organize the observations Interested primarily in normality and deviations from normality Examine Central tendency Dispersion Shape of distribution Visualization aids Frequency distribution (percentile) tables and charts Histograms Bar & pie charts (nominal data) Frequency polygon Cumulative percentage curve Stem and leaf diagrams Box plots

Frequency Distributions
Ungrouped frequency distribution A list of each of the values of the variable The number of times and/or the percent of times each value occurs Grouped frequency distribution A table or graph Shows frequencies or percent for ranges of values

Frequency distributions
Include in frequency distribution tables: Table number and title Labels for the categories of the variables Column headings Total number of cases (N) The number of missing cases Source of the data Footnotes to explain anomalies and notes * Source: De Vaus, D.A. (1991) Surveys in Social Research. Third edition. North Sydney, Australia: Allen & Unwin Pty Ltd., p133

Grouped frequency distribution
Table 1—Example of grouped frequency distribution Score Range (Your value label) Real Limits* Frequencies (ƒ) Cumulative frequencies (Cf) Percent (%) Cumulative Percent 9-10 3 20 15 100 7-8 4 17 85 5-6 7 13 35 65 3-4 6 30 1-2 2 10 Total (N) Valid cases: Missing cases: 0 Note 1: “Real limits” of a score extend from one-half of the smallest unit of measurement below the value of the score to one half unit above. Note 2: Percent (%) = (ƒ /N) * 100, Cumulative % = (Cf/N) * 100

score intervals ƒ 45-49 1 50-54 2 55-59 4 60-64 65-69 7 70-74 9 75-79 16 80-84 10 85-89 90-94 6 95-99 The height of the bar corresponds to the frequency (ƒ) The width of the bar extends to the real limits of the score Used only on interval and ratio scales No space between bars (that’s a bar chart)

What do graphs (histograms) show?
Normality (normal distributions) [Why are normal distributions important?] Deviations from normality Positive skewness Negative skewness Bimodality And more… keep the question on normality in your heads, as they will be answered in coming slides

Shapes of distribution
Normal distribution: symmetrical Bell-shaped curve symmetrical asymmetrical Positively skewed: tail on the right, cluster towards low end of the variable Negatively skewed: tail on the left, cluster towards high-end of the variable Bimodality: A double peak

Central Tendency Central tendency is a single summary figure that ideally, is the most representative value of all values in the distribution. Used to describe “typical” or representative value Mean (arithmetic mean), m Sum all the observations; divide by N: use for interval variables when appropriate Median: Value that divides the distribution so that an equal number of values are above the median and an equal number below Mode: Value with the greatest frequency (uni-modal, bi-modal, etc.)

Why do we care about anything besides central tendency? Variability refers to spread or dispersion The extent to which a set of scores scatter about or cluster together Measures of variability Range Interquartile range Sum-of-squares Variance Standard deviation Kurtosis Equal means, unequal variability

Kurtosis Two distributions: the same mean & variance
Karl Pearson suggested names Longer tailed: leptokurtic Shorter tailed: platykurtic

Mode (Mo): most common value
Best for nominal level data Cautions: most common may not measure typicality not sensitive to outliers (good and bad) may be more than one mode unstable from sample to sample Dispersion variation ratio (v) % of people not in the modal category

Median (Mdn): Even split of sample
For interval or ratio data, good for skewed distributions (mean would not be a good measure of central tendency) Minimal calculation (need to know frequencies) Reasonably insensitive to outliers (as long as there are only a few) Reasonably stable from sample to sample Example of ordinal variables people are ranked from low to high (e.g., height) median is the middle case the median category is the one to which the middle person belongs

Median– simple examples
Mdn = 4 Mdn = 5.5 by interpolation between 5 & 6 (5+6)/2 = 11/2 = 5.5

Dispersion The nth percentile of a set of numbers is a value such that n percent of the numbers fall below it and the rest fall above. The median is the 50th percentile The lower quartile is the 25th percentile The upper quartile is the 75th percentile Summary of sample using 5 numbers: median, mean, variance, and extremes

Dispersion Interquartile range Bottom 25% Top 25% Lower Upper Median

Boxplot Variable 1 Variable 2 Variable 3 4 6 8 10 12 14 16
Interquartile range (IQR) Variable 1 Variable 2 Variable 3 4 6 8 10 12 14 16

Mean Uses the actual numerical values of the observations
Most stable from sample to sample Most common measure of center Makes sense only for interval or ratio data Frequently computed for ordinal variables as well Not a good representation of central tendency for skewed samples

Mean--Dispersion The standard deviation and variance measure spread about the mean as centre. Deviation: distance and direction from the mean Doesn’t work as a measure of variability because adds up to zero (see next slide). Variance mean of the squared deviation scores (of the deviations of observations from the mean). Standard deviation Conceptually: the typical distance of scores from the mean Technically: the square root of the variance

Example Data (6,7,5,3,4) x = 6+7+5+3+4 = 25 = 5 5 5 Variance (S2)
5 5 Variance (S2) Calculate the mean for the variable Take each observation and subtract the mean from it Square the result from the above Add (sum) all the individual results Divide by n

Variance (s2) Variance = sum of the sq deviations = 10 = 2
number of observations

Standard deviation (s)
Square root of the variance 2 = 1.4 An average deviation of the observations from their mean Influenced by outliers Best used with symmetrical distributions

Summary Descriptive statistics – univariate analysis (central tendency, frequency distribution, dispersion) Determine if variable is nominal, ordinal or interval Nominal: frequency tables, mode Ordinal Frequency tables (grouped frequency tables) histogram Median and five number summary Mode

Summary Interval Determine whether the distribution is skewed or symmetrical Compare median and mean Use the mean and the standard deviation if the distribution is not markedly skewed; otherwise use five number summary (median, extremes, mid-quartile numbers) Use the mode in addition if it adds anything

Abstract and Elevator Speech
20-30 second synopsis; intent: to elicit interest Who you are and what you are doing With whom Where/How Why: What you hope to find, why the results may be important words; elicit interest and summarize What type of study How approached When, where Why: what you hope to find, why the results may be important

Selecting analysis and statistical techniques*
Specific research question or hypothesis Determine # of variables in question Univariate analysis Bivariate analysis Multivariate analysis Determine level of measurement of variables Choose univariate method of analysis * Source: De Vaus, D.A. (1991) Surveys in Social Research. Third edition. North Sydney, Australia: Allen & Unwin Pty Ltd., p133 Choose relevant descriptive statistics Choose relevant inferential statistics

Exercise—sampling distribution
Coins, coins! Probability of head or tails—50% Each of you is a “sample” for this activity. Flip the coin 7 times, count the # of times you get a “head”. Live demo:

Why is normality important?
68% Why is normality important? 100% 95% Use proportions of the normal distribution to determine probabilities associated with any specific sample. Sampling Error Standard Error (SE)—a way for defining and measuring sampling error (exactly, how much error, on average, should exist between a sample mean and the unknown population mean, simply due to chance.

Standard Error of the mean
Standard error of the mean (Sm) Sm = N Standard error is inversely related to square root of sample size To reduce standard error, increase sample size Standard error is directly related to standard deviation When N = 1, standard error is equal to standard deviation S Standard deviation S Total number in the sample

Inferential statistics - univariate analysis
Interval estimates and interval variables Estimation of sample mean accuracy—based on random sampling and probability theory Standardize the sample mean to estimate population mean: t = sample mean – population mean estimated SE Population mean = sample mean + t * (estimated SE)

Confidence Interval Utilizes probability theory, assumes normal distribution 95% of the samples will fall within 1 to 2 standard deviations from the population mean By the same token, for 95% of samples, the population mean will be within + or - 2 standard error units from the sample mean E.g., for C.I. 80%, first find the lower and upper t-values that bind 80% area of the distribution. Can state: with 80% confidence interval, the population mean is: sample mean + t (SE)

Standard Error (for nominal & ordinal data)
Variable must have only two categories (may have to combine categories to achieve this) SB = PQ N P = the % in one category of the variable Q = the % in the other category of the variable Total number in the sample Standard error for binominal distribution