Review of Basic Statistical Concepts

Slides:



Advertisements
Similar presentations
CHAPTER 1 Exploring Data
Advertisements

Estimation in Sampling
CHAPTER 24: Inference for Regression
Business Statistics for Managerial Decision
Review of Basic Statistical Concepts Farideh Dehkordi-Vakil.
The Simple Regression Model
Business Statistics for Managerial Decision
Topic 2: Statistical Concepts and Market Returns
1.2: Describing Distributions
Sampling Distributions
Discovering and Describing Relationships
Inferences About Process Quality
Inferential Statistics
CHAPTER 2: Describing Distributions with Numbers
Chapter 2 Describing distributions with numbers. Chapter Outline 1. Measuring center: the mean 2. Measuring center: the median 3. Comparing the mean and.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Business Statistics for Managerial Decision Making
Chapter 5 Sampling Distributions
Describing distributions with numbers
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
+ DO NOW What conditions do you need to check before constructing a confidence interval for the population proportion? (hint: there are three)
Estimating a Population Mean
Chapter 3 – Descriptive Statistics
More About Significance Tests
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
© Copyright McGraw-Hill CHAPTER 3 Data Description.
CHAPTER 18: Inference about a Population Mean
Significance Tests: THE BASICS Could it happen by chance alone?
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Business Statistics for Managerial Decision Comparing two Population Means.
Stat 1510 Statistical Inference: Confidence Intervals & Test of Significance.
Essential Statistics Chapter 131 Introduction to Inference.
INTRODUCTION TO INFERENCE BPS - 5th Ed. Chapter 14 1.
CHAPTER 14 Introduction to Inference BPS - 5TH ED.CHAPTER 14 1.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Confidence intervals are one of the two most common types of statistical inference. Use a confidence interval when your goal is to estimate a population.
Applied Business Forecasting and Regression Analysis
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Section Inference about Two Means: Independent Samples 11.3.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.3 Estimating a Population Mean.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Estimating with Confidence Section 11.1 Estimating a Population Mean.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
+ Chapter 8 Estimating with Confidence 8.1Confidence Intervals: The Basics 8.2Estimating a Population Proportion 8.3Estimating a Population Mean.
BPS - 5th Ed. Chapter 231 Inference for Regression.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
CHAPTER 1 Exploring Data
Chapter 9 Hypothesis Testing.
CHAPTER 12 More About Regression
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 18: Inference about a Population Mean
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Presentation transcript:

Review of Basic Statistical Concepts Farideh Dehkordi-Vakil

Review of Basic Statistical Concepts Descriptive Statistics Methods that organize and summarize data. Numerical summary Graphical Methods Inferential Statistics Generalizing from a sample to the population from which it was selected. Estimation Hypothesis testing

Review of Basic Statistical Concepts Population The entire collection of individuals or objects about which information is desired. Sample A subset of the population selected in some prescribed manner for study.

Review of Basic Statistical Concepts Numerical summaries Measure of central tendencies Mean Median Measure of variability Variance, Standard deviation Range Quartiles

Review of Basic Statistical Concepts The Mean To find the mean of a set of observations, add their values and divide by the number of observations. If the n observations are , their mean is In a more compact notation,

Example: Books Page Length A sample of n = 8 books is selected from a library’s collection, and page length of each one is determined, resulting in the following data set. X1=247, X2=312, X3=198, X4=780, X5=175, X6=286, X7=293, X8=258

Review of Basic Statistical Concepts Median M The Median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger. To find the median of a distribution: Arrange all observations in order of size, from smallest to largest. If the number of observations n is odd, the median M is the center observation in the ordered list. If the number of observations n is even, the median M is the mean of the two center observations in the ordered list.

Review of Basic Statistical Concepts Quartiles Q1 and Q3 To calculate the quartiles: Arrange the observations in increasing order and locate the median M in the ordered list of observations. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median.

Example: Books Page Length Median: Order the list 175, 198, 247, 258, 286, 293, 312, 780 There are two middle numbers 258, and 286. The median is the average of these two numbers

Review of Basic Statistical Concepts The Five Number Summary and Box-Plot The five number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. In symbols, the five number summary is: Minimum Q1 M Q3 Maximum

Review of Basic Statistical Concepts A box-plot is a graph of the five number Summary. A central box spans the quartiles. A line in the box marks the median. Lines extend from the box out to the smallest and largest observations. Box-plots are most useful for side-by-side comparison of several distributions.

Review of Basic Statistical Concepts

Review of Basic Statistical Concepts The Variance s2 The Variance s2 of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of n observations is or, more compactly,

Review of Basic Statistical Concepts The Standard Deviation s The standard deviation s is the square root of the variance s2: Computational formula for variance

Example: Book page length

Review of Basic Statistical Concepts Choosing a Summary The five number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with extreme outliers. Use , and s only for reasonably symmetric distributions that are free of outliers.

Review of Basic Statistical Concepts Introduction to Inference The purpose of inference is to draw conclusions from data. Conclusions take into account the natural variability in the data, therefore formal inference relies on probability to describe chance variation. We will go over the two most prominent types of formal statistical inference Confidence Intervals for estimating the value of a population parameter. Tests of significance which asses the evidence for a claim. Both types of inference are based on the sampling distribution of statistics.

Review of Basic Statistical Concepts Parameters and Statistics A parameter is a number that describes the population. A parameter is a fixed number, but in practice we do not know its value. A statistic is a number that describes a sample. The value of a statistic is known when we have taken a sample, but it can change from sample to sample. We often use statistic to estimate an unknown parameter.

Review of Basic Statistical Concepts Since both methods of formal inference are based on sampling distributions, they require probability model for the data. The model is most secure and inference is most reliable when the data are produced by a properly randomized design. When we use statistical inference we assume that the data come from a randomly selected sample (SRS) or a randomized experiment.

Example:Consumer attitude towards shopping A recent survey asked a nationwide random sample of 2500 adults if they agreed or disagreed with the following statement I like buying new cloths, but shopping is often frustrating and time consuming. Of the respondents, 1650 said they agreed. The proportion of the sample who agreed that cloths shopping is often frustrating is:

Example:Consumer attitude towards shopping The number = .66 is a statistic. The corresponding parameter is the proportion (call it P) of all adult U.S. residents who would have said “Yes” if asked the same question. We don’t know the value of parameter P, so we use as its estimate.

Review of Basic Statistical Concepts If the marketing firm took a second random sample of 2500 adults, the new sample would have different people in it. It is almost certain that there would not be exactly 1650 positive responses. That is, the value of will vary from sample to sample. Random samples eliminate bias from the act of choosing a sample, but they can still be wrong because of the variability that results when we choose at random.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Review of Basic Statistical Concepts The first advantage of choosing at random is that it eliminates bias. The second advantage is that if we take lots of random samples of the same size from the same population, the variation from sample to sample will follow a predictable pattern. All statistical inference is based on one idea: to see how trustworthy a procedure is, ask what would happen if we repeated it many times.

Review of Basic Statistical Concepts Sampling Distribution of Statistics Suppose that exactly 60% of adults find shopping for cloths frustrating and time consuming. That is, the truth about the population is that P = 0.6. (parameter) We select a SRS (Simple Random Sample) of size 100 from this population and use the sample proportion( , statistic) to estimate the unknown value of the population proportion P. What is the distribution of ?

Review of Basic Statistical Concepts To answer this question: Take a large number of samples of size 100 from this population. Calculate the sample proportion for each sample. Make a histogram of the values of . Examine the distribution displayed in the histogram for shape, center, and spread, as well as outliers or other deviations.

Review of Basic Statistical Concepts The result of many SRS have a regular pattern. Here we draw 1000 SRS of size 100 from the same population. The histogram shows the distribution of the 1000 sample proportions

Review of Basic Statistical Concepts Sampling Distribution The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.

Normal Distribution These curves, called normal curves, are Symmetric Single peaked Bell shaped Normal curves describe normal distributions.

Normal Density Curve The exact density curve for a particular normal distribution is described by giving its mean  and its standard deviation . The mean is located at the center of the symmetric curve and it is the same as the median. The standard deviation  controls the spread of a normal curve.

Normal Density Curve

Standard Normal Distribution The standard Normal distribution is the Normal distribution N(0, 1) with mean  = 0 and standard deviation  =1.

Standard Normal Distribution If a variable x has any normal distribution N(, ) with mean  and standard deviation , then the standardized variable has the standard Normal distribution.

The Standard Normal Table Table A is a table of the area under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.

The Standard Normal Applet Or you can use this applet http:/www.stat.sc.edu~west/applets/normaldemo.html

The Standard Normal Table What is the area under the standard normal curve to the right of z = - 2.15? Compact notation: P = 1 - .0158 =.9842

The Standard Normal Table What is the area under the standard normal curve between z = 0 and z = 2.3? Compact notation: P = .9893 - .5 =.4893

Example:Annual rate of return on stock indexes The annual rate of return on stock indexes (which combine many individual stocks) is approximately Normal. Since 1954, the Standard & Poor’s 500 stock index has had a mean yearly return of about 12%, with standard deviation of 16.5%. Take this Normal distribution to be the distribution of yearly returns over a long period. The market is down for the year if the return on the index is less than zero. In what proportion of years is the market down?

Example:Annual rate of return on stock indexes State the problem Call the annual rate of return for Standard & Poor’s 500-stocks Index x. The variable x has the N(12, 16.5) distribution. We want the proportion of years with X < 0. Standardize Subtract the mean, then divide by the standard deviation, to turn x into a standard Normal z:

Example:Annual rate of return on stock indexes Draw a picture to show the standard normal curve with the area of interest shaded. Use the table The proportion of observations less than - 0.73 is .2327. The market is down on an annual basis about 23.27% of the time.

Example:Annual rate of return on stock indexes What percent of years have annual return between 12% and 50%? State the problem Standardize

Example:Annual rate of return on stock indexes Draw a picture. Use table. The area between 0 and 2.30 is the area below 2.30 minus the area below 0. 0.9893- .50 = .4893

Estimation So far, we have used our sample estimates as point estimates of parameters, for example: These estimators have properties.

Estimators They are both unbiased estimators The expected value of an unbiased estimator is equal to the parameter that it is trying to estimate.

Estimators For Example: It tends to give an answer that is a little too small.

Estimators is also a minimum variance estimator of . This means that it has the smallest variability among all estimators of . What if we want to do more than just provide a point estimate?

Estimating with Confidence Suppose we are interested in the value of some parameter, and we want to construct a confidence interval for it, with some desired level of confidence

Estimating with Confidence Suppose we can estimate this parameter from sample data, and we know the distribution of this estimator, then we can use this knowledge and construct a probability statement involving both the estimator and the true value of the parameter. This statement is manipulated mathematically to produce confidence intervals.

Confidence intervals The general form of a confidence interval is: sample value of estimator  (Factor)(SE of estimator) The value of the factor will depend on the level of confidence desired, and the distribution of the estimator.

Estimating with Confidence Community banks are banks with less than a billion dollars of assets. There are approximately 7500 such banks in the United States. In many studies of the industry these banks are considered separately from banks that have more than a billion dollars of assets. The latter banks are called “large institutions.” The community bankers Council of the American bankers Association (ABA) conducts an annual survey of community banks. For the 110 banks that make up the sample in a recent survey, the mean assets are = 220 (in millions of dollars). What can we say about , the mean assets of all community banks?

Estimating with Confidence The sample mean is the natural estimator of the unknown population mean . We know that is an unbiased estimator of . The law of large numbers says that the sample mean must approach the population mean as the size of the sample grows. Therefore, the value = 220 appears to be a reasonable estimate of the mean assets  for all community banks. But, how reliable is this estimate?

Standard Error of Estimator An estimate without an indication of its variability is of limited value. Questions about variation of an estimator is answered by looking at the spread of its sampling distribution. According to Central Limit theorem: If the entire population of community bank assets has mean  and standard deviation , then in repeated samples of size 110 the sample mean approximately follows the N(, 110) distribution

Standard Error of Estimator Suppose that the true standard deviation  is equal to the sample standard deviation s = 161. This is not realistic, although it will give reasonably accurate results for samples as large as 100. Later on we will learn how to proceed when  is not known. Therefore, by Central Limit theorem. In repeated sampling the sample mean is approximately normal, centered at the unknown population mean ,with standard deviation

Confidence Interval for the Population Mean We use the sampling distribution of the sample mean to construct a level C confidence interval for the mean  of a population. We assume that data are a SRS of size n. The sampling distribution is exactly N( ) when the population has the N(, ) distribution. The Central Limit Theorem says that this same sampling distribution is approximately correct for large samples whenever the population mean and standard deviation are  and .

Confidence Interval for a Population Mean Choose a SRS of size n from a population having unknown mean  and known standard deviation . A level C confidence interval for  is Here z* is the critical value with area C between –z* and z* under the standard Normal curve. The quantity is the margin of error. The interval is exact when the population distribution is normal and is approximately correct when n is large in other cases.

Confidence Interval for a Population Mean Recall the community bank Example What is a 90% confidence interval for the mean assets of all community banks? Estimator = Standard error of the sample mean = Factor =

Example: Banks’ loan –to-deposit ration The ABA survey of community banks also asked about the loan-to-deposit ratio (LTDR), a bank’s total loans as a percent of its total deposits. The mean LTDR for the 110 banks in the sample is and the standard deviation is s = 12.3. This sample is sufficiently large for us to use s as the population  here. Find a 95% confidence interval for the mean LTDR for community banks.

Confidence Interval for a Population Mean What if the sample size is small and the population standard deviation is not known? Then the sampling distribution of will be student’s t. The Student’s t distribution has a symmetric, bell-shaped density centered at zero, and depends on a parameter called the degrees of freedom. The number of degrees of freedom depends upon the sample size.

Students’ T Distribution The density of the Student’s t distribution differs from that of the standard normal density. The distribution of the density in the tails and flanks is different from the normal distribution: The tails are higher and wider than that of a standard normal density, indicating that the standard deviation is larger than the standard normal, especially for small sample sizes.0

Students’ T Distribution T distribution with 1 degree of freedom http://www.wordiq.com/definition/Image:T_distribution_1df.png

Students’ T Distribution As the sample size becomes larger, the degrees of freedom of the Student's t distribution also become larger. As the degrees of freedom become larger, the Student's t distribution approaches the standard normal distribution.

Students’ T Distribution http://www.etfos.hr/~fridl/primjer6.htm

Students’ T Distribution A level C confidence interval for  when the sample size is small and the population standard deviation is not known is: The t-distribution has n-1 degrees of freedom. Given the confidence level. the value can be determined from published tables for t-distribution.

Students’ T Distribution Example: If the sample size is 15, the critical value for a 95% confidence interval from a t-table is: 2.14. Note the degrees of freedom is 14. If the sample size is 25, what is the critical value for a 90% confidence interval?

Tests of Significance Confidence intervals are appropriate when our goal is to estimate a population parameter. The second type of inference is directed at assessing the evidence provided by the data in favor of some claim about the population. A significance test is a formal procedure for comparing observed data with a hypothesis whose truth we want to assess. The hypothesis is a statement about the parameters in a population or model. The results of a test are expressed in terms of a probability that measures how well the data and the hypothesis agree.

Example: Bank’s net income The community bank survey described in previously also asked about net income and reported the percent change in net income between the first half of last year and the first half of this year. The mean change for the 110 banks in the sample is Because the sample size is large, we are willing to use the sample standard deviation s = 26.4% as if it were the population standard deviation . The large sample size also makes it reasonable to assume that is approximately normal.

Example: Bank’s net income Is the 8.1% mean increase in a sample good evidence that the net income for all banks has changed? The sample result might happen just by chance even if the true mean change for all banks is  = 0%. To answer this question we ask another Suppose that the truth about the population is that = 0% (this is our hypothesis) What is the probability of observing a sample mean at least as far from zero as 8.1%?

Example: Bank’s net income The answer is: Because this probability is so small, we see that the sample mean is incompatible with a population mean of  = 0. We conclude that the income of community banks has changed since last year.

Example: Bank’s net income The fact that the calculated probability is very small leads us to conclude that the average percent change in income is not in fact zero. Here is why. If the true mean is  = 0, we would see a sample mean as far away as 8.1% only six times per 10000 samples. So there are only two possibilities:  = 0 and we have observed something very unusual, or  is not zero but has some other value that makes the observed data more probable

Example: Bank’s net income We calculated a probability taking the first of these choices as true ( = 0 ). That probability guides our final choice. If the probability is very small, the data don’t fit the first possibility and we conclude that the mean is not in fact zero.

Tests of Significance: Formal details The first step in a test of significance is to state a claim that we will try to find evidence against. Null Hypothesis H0 The statement being tested in a test of significance is called the null hypothesis. The test of significance is designed to assess the strength of the evidence against the null hypothesis. Usually the null hypothesis is a statement of “no effect” or “no difference.” We abbreviate “null hypothesis” as H0.

Tests of Significance: Formal details A null hypothesis is a statement about a population, expressed in terms of some parameter or parameters. The null hypothesis in our bank survey example is H0 :  = 0 It is convenient also to give a name to the statement we hope or suspect is true instead of H0. This is called the alternative hypothesis and is abbreviated as Ha. In our bank survey example the alternative hypothesis states that the percent change in net income is not zero. We write this as Ha :   0

Tests of Significance: Formal details Since Ha expresses the effect that we hope to find evidence for we often begin with Ha and then set up H0 as the statement that the Hoped-for effect is not present. Stating Ha is not always straight forward. It is not always clear whether Ha should be one-sided or two-sided. The alternative Ha :   0 in the bank net income example is two-sided. In any give year, income may increase or decrease, so we include both possibilities in the alternative hypothesis.

Tests of Significance: Formal details Test statistics We will learn the form of significance tests in a number of common situations. Here are some principles that apply to most tests and that help in understanding the form of tests: The test is based on a statistic that estimate the parameter appearing in the hypotheses. Values of the estimate far from the parameter value specified by H0 gives evidence against H0.

Example: bank’s income The test statistic In our banking example The null hypothesis is H0:  = 0, and a sample gave the . The test statistic for this problem is the standardized version of : This statistic is the distance between the sample mean and the hypothesized population mean in the standard scale of z-scores.

Example: Bank’s net income p-values P-value is the probability that the test statistic would take a value as large or larger than one observed assuming that H0 is true. The smaller the p-value, the stronger the evidence against H0.

Example: Bank’s net income Conclusion One approach is to state in advance how much evidence against H0 we will require in order to reject it. The level that says “this evidence is strong enough “ is called significance level and is denoted by letter . We compare the p-value with the significance level. We reject H0 if the p-value is smaller than the significance level, and say that the data are statistically significant at level .

One sample t-test Suppose we have a simple random sample of size n from a Normally distributed population with mean  and standard deviation . The standardized sample mean, or one-sample z statistic has the standard Normal distribution N(0, 1). When we substitute the standard deviation of the mean (standard error) s /n for the /n, the statistic does not have a Normal distribution.

The t-distribution t-test Suppose that a SRS of size n is drawn from a N(, ) population. Then the one sample t statistic has the t-distribution with n-1 degrees of freedom. There is a different t distribution for each sample size. A particular t distribution is specified by giving the degrees of freedom.

Exploring Relationships between Two Quantitative Variables Scatter plots Represent the relationship between two different continuous variables measured on the same subjects. Each point in the plot represents the values for one subject for the two variables.

Exploring Relationships between Two Quantitative Variables Example: Data reported by the organization for Economic Development and Cooperation on its 29 member nations in 1998. Per capita gross domestic product is on x-axis Per capita health care expenditures is on y-axis.

Exploring Relationships between Two Quantitative Variables We can describe the overall pattern of scatter plot by Form or shape Direction strength

Exploring Relationships between Two Quantitative Variables Form or shape The form shown by the scatter plot is linear if the points lie in a straight-line pattern. Strength The relation ship is strong if the points lie close to a line, with little scatter.

Exploring Relationships between Two Quantitative Variables Direction Positive and negative association Two variables are positively associated when above-average values of one variable tend to occur in individuals with above average values for the other variable, and below average values of both also tend to occur together. Two variable are negatively associated when above average values for one tend to occur in subjects with below average values of the other, and vice-versa

Exploring Relationships between Two Quantitative Variables Per capita health care example “subjects” studied are countries Form of relationship is roughly linear The direction is positive The relationship is strong.

Correlation It is often useful to have a measure of degree of association between two variables. For example, you may believe that sales may be affected by expenditures on advertising, and want to measure the degree of association between sales and advertising. Correlation coefficient is a numeric measure of the direction and strength of linear relationship between two continuous variables The notation for sample correlation coefficient is r.

Correlation There are several alternative ways to write the algebraic expression for the correlation coefficient. The following is one. X and Y represent the two variables of interest. For example advertising and sales or per capita gross domestic product, and per capita health care expenditure. n is the number of subjects in the sample The notation for population correlation coefficient is .

Correlation Facts about correlation coefficient r has no unit. r > 0 indicates a positive association; r < 0 indicates a negative association r is always between –1 and +1 Values of r near 0 imply a very weak linear relationship Correlation measures only the strength of linear association.

Correlation We could perform a hypothesis test to determine whether the value of a sample correlation coefficient (r) gives us reason to believe that the population correlation () is significantly different from zero The hypothesis test would be H0:  = 0 Ha:   0

Correlation The test statistic would be Reject H0 if The test statistic has a t-distribution with n-2 degrees of freedom. Reject H0 if

Example: Do wages rise with experience? Many factors affect the wages of workers: the industry they work in, their type of job, their education and their experience, and changes in general levels of wages. We will look at a sample of 59 married women who hold customer service jobs in Indiana banks. The following table gives their weekly wages at a specific point in time also their length of service with their employer, in month. The size of the place of work is recorded simply as “large” (100 or more workers) or “small.” Because industry, job type, and the time of measurement are the same for all 59 subjects, we expect to see a clear relationship between wages and length of service.

Example: Do wages rise with experience?

Example: Do wages rise with experience?

Example: Do wages rise with experience? The correlation between wages and length of service for the 59 bank workers is r = 0.3535. We expect a positive correlation between length of service and wages in the population of all married female bank workers. Is the sample result convincing that this is true?

Example: Do wages rise with experience? To compute correlation: we need: Replacing these in the formula We want to test H0:  = 0 Ha:  > 0 The test statistic is

Example: Do wages rise with experience? Comparing t = 2.853 with critical values from the t-table with n - 2 = 57 degrees of freedom help us to make our decision. Conclusion: Since P( t > 2.853) < .005, we reject H0. There is a positive correlation between wages and length of service.

T-distribution applet Tail probability for student’s t-distribution can computed using the applet at the following site. http://www.acs.ucalgary.ca/~nosal/src/Applets/T-TailProb/T-TailProb.html