# Overview of Statistical Concepts

## Presentation on theme: "Overview of Statistical Concepts"— Presentation transcript:

Overview of Statistical Concepts
Lincoln Jiang Statistical Consultant Western Michigan University The Graduate College Graduate Center for Research and Retention

Definition of Statistics
Statistics is the art of making numerical conjectures about puzzling questions. --- Statistics Fourth Edition by Freedman

Definition of Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. ---From Wikipedia

Definition of Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. ---From Wikipedia

Definition of Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. ---From Wikipedia

Definition of Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. ---From Wikipedia

Definition of Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. ---From Wikipedia

Basic Terms Variables Values Scores
Characteristics that can take on any number of different values Values Possible numbers or categories that of a variable can have Scores A particular person’s value on a variable

Types of Data Qualitative data --nonnumeric
eg: types of material {straw, sticks, bricks} Quantitative -- numeric Discrete data --numeric data that have a finite number of possible values eg: counting numbers, {1,2,3,4,5} Continuous data --numeric data that have a infinite number of possible values eg: Real numbers

Types of Scale Nominal---have no order and thus only gives names or labels to various categories. Variables assessed on a nominal scale are called categorical variables Ordinal---have order, but the interval between measurements is not meaningful. Interval---have meaningful intervals between measurements, but there is no true starting point (zero). Eg: temperature with the Celsius scale Ratio---have the highest level of measurement. Ratios between measurements as well as intervals are meaningful because there is a starting point (zero). Eg: length, time, plane angle, energy Interval: Ratios between numbers on the scale are not meaningful, so operations such as multiplication and division cannot be carried out directly. But ratios of differences can be expressed; for example, one difference can be twice another.

EX

Definition of Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. ---From Wikipedia

Collecting Data “Twenty-five percent of Americans doubt that the Holocaust ever occurred.” --- a news in 1993 Census Sample Survey

Why Study Samples? Often not practical to study an entire population
Instead, researchers attempt to make samples representative of populations Random selection Each member of population has an equal chance of being sampled Good but difficult Haphazard selection Take steps to ensure samples do not differ from the population in systematic ways Not as good but much more practical Budget. Or impossible.

Sample vs. Population Sample Population
Relatively small number of instances that are studied in order to make inferences about a larger group from which they were drawn Population The larger group from which a sample is drawn

Sample vs. Population Examples
a. pot of beans b. larger circle c. histogram Sample a. spoonful b. smaller circle c. shaded scores

Sampling Methods Simple Random Sampling Systematic sampling
Stratified sampling Cluster sampling Other samplings: Quota sampling, Mechanical sampling and so on

Definition of Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. ---From Wikipedia

After Collecting…….Before Analyzing….
Pre- processing or pre-treatment.

Frequency Tables Frequency table Grouped frequency table
Shows how many times each value was used for a particular variable Percentage of scores of each value Grouped frequency table Range of scores in each of several equally sized intervals

Steps for Making a Frequency Table
1. Make a list of each possible value, from highest to lowest 2. Go one by one through the data, making a mark for each data next to its value on the list 3. Make a table showing how many times each value on your list was used 4. Figure the percentage of data for each value

A Frequency Table Stress rating Frequency Percent,% 10 14 9.3 9 15 9.9
8 26 17.2 7 31 20.5 6 13 8.6 5 18 11.9 4 16 10.6 3 12 7.9 2 2.0 1 0.7 1.3

A Grouped Frequency Table
Stress rating interval Frequency Percent 10-11 14 9 8-9 41 27 6-7 44 29 4-5 34 23 2-3 15 10 0-1 3 2

Frequency Graphs Histogram
Depicts information from a frequency table or a grouped frequency table as a bar graph EX2

Shapes of Frequency Distributions
Unimodal Having one peak Bimodal Having two peaks Multimodal Having two or more peaks Rectangular Having no peaks

Symmetrical vs. Skewed Frequency Distributions
Symmetrical distribution Approximately equal numbers of observations above and below the middle Skewed distribution One side is more spread out that the other, like a tail Direction of the skew Right or left (i.e., positive or negative) Side with the fewer scores Side that looks like a tail

Skewed Frequency Distributions
Skewed right (b) Fewer scores right of the peak Positively skewed Can be caused by a floor effect Skewed left (c) Fewer scores left of the peak Negatively skewed Can be caused by a ceiling effect

Ceiling and Floor Effects
Ceiling effects Occur when scores can go no higher than an upper limit and “pile up” at the top e.g., scores on an easy exam, as shown on the right Causes negative skew Floor effects Occur when scores can go no lower than a lower limit and pile up at the bottom e.g., household income Causes positive skew

Kurtosis Degree to which tails of the distribution are “heavy” or “light” heavy tails = higher Kurtosis(b) Light tails = lower Kurtosis(c) Normal distribution= Zero Kurtosis (a)

Measures of Central Tendency
Central tendency = representative or typical value in a distribution mean, the median and the mode can measure central tendency. Mean Computed by Summing all the scores (sigma, ) Dividing by the number of scores (N)

Measures of Central Tendency
Mean Often the best measure of central tendency Most frequently reported in research articles Think of the mean as the “balancing point” of the distribution

Measures of Central Tendency
Mode Most common single number in a distribution If distribution is symmetrical and unimodal, the mode = the mean Typical way of describing central tendency of a nominal variable

Measures of Central Tendency
Median Middle value in a group of scores Point at which half the scores are above half the scores are below Unaffected by extremity of individual scores Unlike the mean Preferable as a measure of central tendency when a distribution has some extreme scores

Measures of Central Tendency
Examples of means as balancing points of various distributions Does not have to be a score exactly at the median Note that a score’s distance from the balancing point matters in addition to the number of scores above or below it

Measures of Central Tendency
Examples of means and modes

Measures of Central Tendency
Steps to computing the median 1. Line up scores from highest to lowest 2. Figure out how many scores to the middle Add 1 to number of scores Divide by 2 3. Count up to middle score If there is 1 middle score, that’s the median If there are 2 middle scores, median is their average Ex3

Measures of Variation Variation = how spread out data is Variance
Measure of variation Average of each score’s squared deviations (differences) from the mean

Measures of Variation Steps to computing the variance
1. Subtract the mean from each data 2. Square each deviation value 3. Add up the squared deviation scores 4. Divide sum by the number of scores Ex4

Measures of Variation Standard deviation To compute standard deviation
Another measure of variation, roughly the average amount that scores differ from the mean Used more widely than variance Abbreviated as “SD” To compute standard deviation Compute variance Simply take the square root SD is square root of variance Variance is SD squared

Two Branches of Statistical Methods
Descriptive statistics Summarize and describe a group of numbers such as the results of a research study Inferential statistics Allow researchers to draw conclusions and inferences that are based on the numbers from a research study, but go beyond these numbers

The Normal Curve Often seen in social and behavioral science research and in nature generally Particular characteristics Bell-shaped Unimodal Symmetrical Average tails Bean Machine

Z Scores indicates how many standard deviations an observation is above or below the mean If Z>0, indicate the data > mean If Z<0, indicate the data < mean Z score of 1.0 is one SD above the mean Z score of -2.5 is two-and-a-half SDs below the mean Z score of 0 is at the mean

Z Scores When values in a distribution are converted to Z scores, the distribution will have Mean of 0 Standard deviation of 1 Useful Allows variables to be compared to one another Provides a generalized standard of comparison

Z Scores To compute a Z score, subtract the mean from a raw score and divide by the SD To convert a Z score back to a raw score, multiply the Z score by the SD and then add the mean Ex5

Confidence Interval confidence interval (CI) is a particular kind of interval estimate of a population parameter. How likely the interval is to contain the parameter is determined by the confidence level "95% confidence interval" ex6 Animation

Correlation A statistic for describing the relationship between two variables Examples Price of a bottle of wine and its quality Hours of studying and grades on a statistics exam Income and happiness Caffeine intake and alertness

Graphing Correlations on a Scatter Diagram
Graph that shows the degree and pattern of the relationship between two variables Horizontal axis Usually the variable that does the predicting e.g., price, studying, income, caffeine intake Vertical axis Usually the variable that is predicted e.g., quality, grades, happiness, alertness

Graphing Correlations on a Scatter Diagram
Steps for making a scatter diagram 1. Draw axes and assign variables to them 2. Determine the range of values for each variable and mark the axes 3. Mark a dot for each person’s pair of scores

Correlation Linear correlation Curvilinear correlation
Pattern on a scatter diagram is a straight line Example above Curvilinear correlation More complex relationship between variables Pattern in a scatter diagram is not a straight line Example below

Correlation Positive linear correlation Negative linear correlation
High scores on one variable matched by high scores on another Line slants up to the right Negative linear correlation High scores on one variable matched by low scores on another Line slants down to the right

Correlation Zero correlation
No line, straight or otherwise, can be fit to the relationship between the two variables Two variables are said to be “uncorrelated”

Correlation Review a. Negative linear correlation b. Curvilinear correlation c. Positive linear correlation d. No correlation

Correlation Coefficient
Correlation coefficient, r, indicates the precise degree of linear correlation between two variables Computed by taking “cross-products” of Z scores Multiply Z score on one variable by Z score on the other variable Compute average of the resulting products Can vary from -1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation)

Linear Correlation Examples

Correlation and Causality
When two variables are correlated, three possible directions of causality X->Y X<-Y X<-Z->Y Inherent ambiguity in correlations Knowing that two variables are correlated tells you nothing about their causal relationship

Prediction Correlations can be used to make predictions about scores
Predictor X variable Variable being predicted from Criterion Y variable Variable being predicted Sometimes called “regression”

Multiple Correlation and Multiple Regression
Association between criterion variables and two or more predictor variables Multiple regression Making predictions about criterion variables based on two or more predictor variables Unlike prediction from one variable, standardized regression coefficient is not the same as the ordinary correlation coefficient

Proportion of Variance Accounted For
Correlation coefficients Indicate strength of a linear relationships Cannot be compared directly e.g., an r of .40 is more than twice as strong as an r of .20 To compare correlation coefficients, square them An r of .40 yields an r2 of .16; an r of .20 an r2 of .04 Squared correlation indicates the proportion of variance on the criterion variable accounted for by the predictor variable R-square

Most Commonly Used Statistical Techniques
Linear Regression (Predicts the value of one numerical variable given another variable) - How much does the maximum legibility distance of Highway signs decrease when age is increased?

Data on winning bid price for 12 Saturn cars on eBaY in July 2002
Simple linear regression  is a data analysis technique that tries to find a linear pattern in the data. In linear regression, we use all of the data to calculate a straight line which may be used to predict Price based on Miles. Since Miles is used to predict Price, Miles is called an `Explanatory (Independent) Variable'    while Price is called a `Response (Dependent) Variable'.

The slope of the line is -
The slope   of the line is , which means that predicted Price tends to drop 5 cents for every additional mile driven, or about \$ for every 10,000 miles. The intercept (or Y-intercept)    of the line is \$8136; this should not be interpreted as the predicted price of a car with 0 mileage because the data provides information only for Saturn cars between 9,300 miles and 153,260 miles We can now use the line to predict   the selling price of a car with miles. What is the height or Y value of the line at X=60000? The answer is

Most Commonly Used Statistical Techniques
T-test (for the means) - What is the mean time that college students watch TV per day? What is the mean pulse rate of women?

Hypothesis Testing Procedure for deciding whether the outcome of a study supports a particular theory or practical innovation

Core Logic of Hypothesis Testing
Approach can seem curious or even backwards Researcher considers the probability that the experimental procedure had no effect and that the observed result could have occurred by chance alone If that probability is sufficiently low, researcher will… Reject the notion that experimental procedure had no effect Affirm the hypothesis that the procedure did have an effect

The Null Hypothesis and the Research Hypothesis
Null hypothesis (H0) Opposite of desired result Usually that manipulation had no effect Research hypothesis (H1) Also called the “alternative hypothesis” Opposite of the null hypothesis What the experimenter desired or expected all along—that the manipulation did have an effect

One-tailed vs. Two-tailed Hypothesis Tests
Directional prediction Researcher expects experimental procedure to have an effect in a particular direction One-tailed significance tests may be used Nondirectional prediction Research expects experimental procedure to have an effect but does not predict a particular direction Two-tailed significance test appropriate Takes into account that the sample could be extreme at either tail of the comparison distribution

One-tailed vs. Two-tailed Hypothesis Tests
Two-tailed tests More conservative than one-tailed tests Some believe that two-tailed tests should always be used, even when an experimenter makes a directional prediction

Significance Level Cutoffs for One- and Two-Tailed Tests
The .05 significance level The .01 significance level

Decision Errors When the right procedure leads to the wrong conclusion
Type I Error Reject the null hypothesis when it is true Conclude that a manipulation had an effect when in fact it did not Type II Error Fail to reject the null when it is false Conclude that a manipulation did not have an effect when in fact it did

P-value is the probability of obtaining a result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. Frequent misunderstandings For more details, please refer to Wikipedia.

Decision Errors Setting a strict significance level (e.g., p < .001) Decreases the possibility of committing a Type I error Increases the possibility of committing a Type II error Setting a lenient significance level (e.g., p < .10) Increases the possibility of committing a Type I error Decreases the possibility of committing a Type II error

Test Statistic value computed from sample information
Basis for rejecting/ not rejecting the null hypothesis used to compute the p-value Example:

T-test A t-test is most commonly applied when the test statistic would follow a normal distribution. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic follows a Student's t distribution.

t-test One-sample t test Two-sample t test Independent two-sample
Equal sample size, equal variance Unequal sample size, equal variance

The Hypothesis Testing Process
Restate the research question as a research hypothesis and a null hypothesis about the populations Set the level of significance, . Collect the sample and compute for the test statistic. Assume Ho is true, compute the p-value. If p-value < , reject Ho. State your conclusion. Summary of Hypothesis Tests Ex7,8

Most Commonly Used Statistical Techniques
Analysis of Variance (testing differences of means for 2 or more groups) Is GPA related to where a student likes to sit (front, middle, back)? Which internet search engine is the fastest?

Analysis of Variance Abbreviated as “ANOVA”
Used to compare the means of more than two groups Null hypothesis is that all populations being studied have the same mean Reject null if at least one population has a mean that differs from the others Actually works by analyzing variances

Two Different Ways of Estimating Population Variance
Estimate population variance from variation within each group Is not affected by whether or not null hypothesis is true Estimate population variance from variation between each group Is affected by whether or not null hypothesis is true

Two Important Questions
How to estimate population variation from variance between groups? How is that estimate affected by whether or not the null is true?

Estimate population variance from variation between means of groups
First, variation among means of samples is related directly to the amount of variation within each population from which samples are taken The more variation within each population, the more variation in means of samples taken from those populations Note that populations on the right produce means that are more scattered

Estimate population variance from variation between means of groups
And second, when null is false there is an additional source of variation When null hypothesis is true (left), variation among means of samples caused by Variation within the populations When null hypothesis is false (right), variation among means of samples caused by And also by variation among the population means

Basic Logic of ANOVA ANOVA entails a comparison between two estimates of population variance Ratio of between-groups estimate to within-groups estimate called an F ratio Compare obtained F value to an F distribution

Assumptions of an ANOVA
Populations follow a normal curve Populations have equal variances As for t tests, ANOVAs often work fairly well even when those assumptions are violated

Rejecting the Null Hypothesis
A significant F tells you that at least one of the means differs from the others Does not indicate how many differ Does not indicate which one(s) differ For more specific conclusions, a researcher must conduct follow-up t tests Problem: Lots of t tests increases the chances of finding a significant result just by chance (i.e., increases chances beyond p = .05)

ANOVA (continue) Procedure that allows one to examine two or more variables in the same study Efficient Allows for examination of interaction effects An ANOVA with only one variable is a one-way ANOVA, an ANOVA with two variables is a two-way ANOVA, and so on

Main Effects vs. Interactions
A main effect refers to the effect of one variable, averaging across the other(s) An interaction effect refers to a case in which the effect of one variable depends on the level of another variable

Main Effects vs. Interactions

Most Commonly Used Statistical Techniques
Chi-square test of independence (Relationship of 2 categorical variables) -With whom is it easier to make friends with? - Does the opinion on legalization of marijuana depend on one’s religion?

Chi-Square Tests Hypothesis testing procedure for nominal variables
Focus on number of people/items in each category (e.g., hair color, political party, gender) Compare how well an observed distribution fits an expected distribution Expected distribution can be based on A theory Prior results Assumption of equal distribution across categories

Chi-Square Test for Goodness of Fit
Single nominal variable Degrees of freedom = number of categories minus 1

Chi-Square Statistic Compares observed frequency distribution to expected frequency distribution Compute difference between observed and expected and square each one Weight each by its expected frequency Sum them Ex9

Chi-Square Distribution
Compare obtained chi-square to a chi-square distribution Does mismatch between observed and expected frequency exceed what would be expected by chance alone?

Chi-Square Test for Independence
Two nominal variables Independence means no relation between variables To determine degrees of freedom… Contingency table Lists number of observations for each combination of categories To determine expected frequencies…

Most Commonly Used Statistical Techniques
Correlation (Relationship of 2 numerical variables) Is there a connection between the average verbal SAT and the percent of graduates who took the SAT in a state?

Other Statistical Techniques
Factor analysis (reducing independent variables which are highly correlated) Cluster analysis (grouping observations with similar characteristics) Correspondence Analysis (grouping the levels of 2 or more categorical variables) Time Series Analysis And so on……..

Inference with highest confidence level

Definition of Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. ---From Wikipedia

Presentation of Data FOR CATEGORICAL DATA ---Bar Chart ---Pie Chart

Presentation of Data FOR NUMERICAL DATA --- Stem-and-Leaf Plot
--- Histogram --- Boxplot

Overview of Statistical Techniques

Upcoming Workshops 10/26/2009 Overview of SPSS
12/02/ Overview of SAS

How to lie with statistics
The Sample with Built-in Bias. Well-Chosen Average. The Gee-Whiz Graph. Correlation and Causation.