Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression –Multiple Regression 2.Analysis of Variance and Experimental.

Similar presentations


Presentation on theme: "Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression –Multiple Regression 2.Analysis of Variance and Experimental."— Presentation transcript:

1 Stats 845 Applied Statistics

2 This Course will cover: 1.Regression –Non Linear Regression –Multiple Regression 2.Analysis of Variance and Experimental Design

3 The Emphasis will be on: 1.Learning Techniques through example: 2.Use of common statistical packages. SPSS Minitab SAS SPlus

4 What is Statistics? It is the major mathematical tool of scientific inference - the art of drawing conclusion from data. Data that is to some extent corrupted by some component of random variation (random noise)

5 An analogy can be drawn to data that is affected by random components of variation to signals that are corrupted by noise.

6 Quite often sounds that are heard or received by some radio receiver can be thought of as signals with superimposed noise.

7 The objective in signal theory is to extract the signal from the received sound (i.e. remove the noise to the greatest extent possible). The same is true in data analysis.

8 Example A: Suppose we are comparing the effect of three different diets on weight loss.

9 An observation on weight loss can be thought of as being made up of two components:

10 1.A component due to the effect of the diet being applied to the subject (the signal) 2. A random component due to other factors affecting weight loss not considered (initial weight of the subject, sex of the subject, metabolic makeup of the subject.) random noise.

11 Note: that random assignment of subjects to diets will ensure that this component will be a random effect.

12 Example B In this example we again are comparing the effect of three diets on weight gain. Subjects are randomly divided into three groups. Diets are randomly distributed amongst the groups. Measurements on weight gain are taken at the following times - - one month - two months - 6 months and - 1 year after commencement of the diet.

13 In addition to both the factors Time and Diet effecting weight gain there are two random sources of variation (noise) - between subject variation and - within subject variation

14 This can be illustrated in a schematic fashion as follows: Deterministic factors Diet Time Random Noise within subject between subject Response weight gain

15 The circle of Research Questions arise about a phenomenon A decision is made to collect data A decision is made as how to collect the data The data is collected The data is summarized and analyzed Conclusion are drawn from the analysis Statistics

16 Notice the two points on the circle where statistics plays an important role: 1.The analysis of the collected data. 2.The design of a data collection procedure

17 The analysis of the collected data. This of course is the traditional use of statistics. Note that if the data collection procedure is well thought out and well designed, the analysis step of the research project will be straightforward. Usually experimental designs are chosen with the statistical analysis already in mind. Thus the strategy for the analysis is usually decided upon when any study is designed.

18 It is a dangerous practice to select the form of analysis after the data has been collected ( the choice may to favour certain pre- determined conclusions and therefore in a considerable loss in objectivity ) Sometimes however a decision to use a specific type of analysis has to be made after the data has been collected (It was overlooked at the design stage)

19 The design of a data collection procedure the importance of statistics is quite often ignored at this stage. It is important that the data collection procedure will eventually result in answers to the research questions.

20 And will result in the most accurate answers for the resources available to research team. Note the success of a research project should not depend on the answers that it comes up with but the accuracy of the answers. This fact is usually an indicator of a valuable research project..

21 Some definitions important to Statistics

22 A population: this is the complete collection of subjects (objects) that are of interest in the study. There may be (and frequently are) more than one in which case a major objective is that of comparison.

23 A case (elementary sampling unit): This is an individual unit (subject) of the population.

24 A variable: a measurement or type of measurement that is made on each individual case in the population.

25 Types of variables Some variables may be measured on a numerical scale while others are measured on a categorical scale. The nature of the variables has a great influence on which analysis will be used..

26 For Variables measured on a numerical scale the measurements will be numbers. Ex: Age, Weight, Systolic Blood Pressure For Variables measured on a categorical scale the measurements will be categories. Ex: Sex, Religion, Heart Disease

27 Types of variables In addition some variables are labeled as dependent variables and some variables are labeled as independent variables.

28 This usually depends on the objectives of the analysis. Dependent variables are output or response variables while the independent variables are the input variables or factors.

29 Usually one is interested in determining equations that describe how the dependent variables are affected by the independent variables

30 A sample: Is a subset of the population

31 Types of Samples different types of samples are determined by how the sample is selected.

32 Convenience Samples In a convenience sample the subjects that are most convenient to the researcher are selected as objects in the sample. This is not a very good procedure for inferential Statistical Analysis but is useful for exploratory preliminary work.

33 Quota samples In quota samples subjects are chosen conveniently until quotas are met for different subgroups of the population. This also is useful for exploratory preliminary work.

34 Random Samples Random samples of a given size are selected in such that all possible samples of that size have the same probability of being selected.

35 Convenience Samples and Quota samples are useful for preliminary studies. It is however difficult to assess the accuracy of estimates based on this type of sampling scheme. Sometimes however one has to be satisfied with a convenience sample and assume that it is equivalent to a random sampling procedure

36 A population statistic (parameter): Any quantity computed from the values of variables for the entire population.

37 A sample statistic: Any quantity computed from the values of variables for the cases in the sample.

38 Statistical Decision Making

39 Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from some phenomena, a decision will have to be made about the phenomena

40 Decisions are generally broken into two types: Estimation decisions and Hypothesis Testing decisions.

41 Probability Theory plays a very important role in these decisions and the assessment of error made by these decisions

42 Definition: A random variable X is a numerical quantity that is determined by the outcome of a random experiment

43 Example : An individual is selected at random from a population and X = the weight of the individual

44 The probability distribution of a random variable (continuous) is describe by: its probability density curve f(x).

45 i.e. a curve which has the following properties : 1. f(x) is always positive. 2. The total are under the curve f(x) is one. 3. The area under the curve f(x) between a and b is the probability that X lies between the two values.

46

47 Examples of some important Univariate distributions

48 1.The Normal distribution A common probability density curve is the “Normal” density curve - symmetric and bell shaped Comment: If  = 0 and  = 1 the distribution is called the standard normal distribution Normal distribution with  = 50 and  =15 Normal distribution with  = 70 and  =20

49

50 2.The Chi-squared distribution with degrees of freedom

51

52 Comment: If z 1, z 2,..., z are independent random variables each having a standard normal distribution then U = has a chi-squared distribution with degrees of freedom.

53 3. The F distribution with  degrees of freedom in the numerator and  degrees of freedom in the denominator if x  0 where K =

54

55 Comment: If U 1 and U 2 are independent random variables each having Chi-squared distribution with 1 and 2 degrees of freedom respectively then F = has a F distribution with  degrees of freedom in the numerator and  degrees of freedom in the denominator

56 4.The t distribution with degrees of freedom where K =

57

58 Comment: If z and U are independent random variables, and z has a standard Normal distribution while U has a Chi- squared distribution with degrees of freedom then t = has a t distribution with degrees of freedom.

59 An Applet showing critical values and tail probabilities for various distributionsApplet 1.Standard Normal 2.T distribution 3.Chi-square distribution 4.Gamma distribution 5.F distribution

60 The Sampling distribution of a statistic

61 A random sample from a probability distribution, with density function f(x) is a collection of n independent random variables, x 1, x 2,...,x n with a probability distribution described by f(x).

62 If for example we collect a random sample of individuals from a population and –measure some variable X for each of those individuals, –the n measurements x 1, x 2,...,x n will form a set of n independent random variables with a probability distribution equivalent to the distribution of X across the population.

63 A statistic T is any quantity computed from the random observations x 1, x 2,...,x n.

64 Any statistic will necessarily be also a random variable and therefore will have a probability distribution described by some probability density function f T (t). This distribution is called the sampling distribution of the statistic T.

65 This distribution is very important if one is using this statistic in a statistical analysis. It is used to assess the accuracy of a statistic if it is used as an estimator. It is used to determine thresholds for acceptance and rejection if it is used for Hypothesis testing.

66 Some examples of Sampling distributions of statistics

67 Distribution of the sample mean for a sample from a Normal popululation Let x 1, x 2,...,x n is a sample from a normal population with mean  and standard deviation  Let

68 Than has a normal sampling distribution with mean and standard deviation

69

70 Distribution of the z statistic Let x 1, x 2,...,x n is a sample from a normal population with mean  and standard deviation  Let Then z has a standard normal distibution

71 Comment: Many statistics T have a normal distribution with mean  T and standard deviation  T. Then will have a standard normal distribution.

72 Distribution of the  2 statistic for sample variance Let x 1, x 2,...,x n is a sample from a normal population with mean  and standard deviation  Let = sample variance and = sample standard deviation

73 Let Then  2 has chi-squared distribution with = n-1 degrees of freedom.

74 The chi-squared distribution

75 Distribution of the t statistic Let x 1, x 2,...,x n is a sample from a normal population with mean  and standard deviation  Let then t has student’s t distribution with = n-1 degrees of freedom

76 Comment: If an estimator T has a normal distribution with mean  T and standard deviation  T. If s T is an estimatior of  T based on degrees of freedom Then will have student’s t distribution with degrees of freedom..

77 t distribution standard normal distribution

78 Point estimation A statistic T is called an estimator of the parameter  if its value is used as an estimate of the parameter . The performance of an estimator T will be determined by how “close” the sampling distribution of T is to the parameter, , being estimated.

79 An estimator T is called an unbiased estimator of  if  T, the mean of the sampling distribution of T satisfies  T = . This implies that in the long run the average value of T is .

80 An estimator T is called the Minimum Variance Unbiased estimator of  if T is an unbiased estimator and it has the smallest standard error  T amongst all unbiased estimators of . If the sampling distribution of T is normal, the standard error of T is extremely important. It completely describes the variability of the estimator T.

81 Interval Estimation (confidence intervals) Point estimators give only single values as an estimate. There is no indication of the accuracy of the estimate. The accuracy can sometimes be measured and shown by displaying the standard error of the estimate.

82 There is however a better way. Using the idea of confidence interval estimates The unknown parameter is estimated with a range of values that have a given probability of capturing the parameter being estimated.

83 The interval T L to T U is called a (1 -  )  100 % confidence interval for the parameter , if the probability that  lies in the range T L to T U is equal to 1 -  Here, T L to T U, are –statistics –random numerical quantities calculated from the data. Confidence Intervals

84 Examples Confidence interval for the mean of a Normal population (based on the z statistic). is a (1 -  )  100 % confidence interval for , the mean of a normal population. Here z  /2 is the upper  /2  100 % percentage point of the standard normal distribution.

85 More generally if T is an unbiased estimator of the parameter  and has a normal sampling distribution with known standard error  T then is a (1 -  )  100 % confidence interval for .

86 Confidence interval for the mean of a Normal population (based on the t statistic). is a (1 -  )  100 % confidence interval for , the mean of a normal population. Here t  /2 is the upper  /2  100 % percentage point of the Student’s t distribution with = n-1 degrees of freedom.

87 More generally if T is an unbiased estimator of the parameter  and has a normal sampling distribution with estmated standard error s T, based on n degrees of freedom, then is a (1 -  )  100 % confidence interval for .

88 Common Confidence intervals

89 Multiple Confidence intervals In many situations one is interested in estimating not only a single parameter, , but a collection of parameters,  1,  2,  3,.... A collection of intervals, T L1 to T U1, T L2 to T U2, T L3 to T U3,... are called a set of (1 -  )  100 % multiple confidence intervals if the probability that all the intervals capture their respective parameters is 1 - 

90 Hypothesis Testing Another important area of statistical inference is that of Hypothesis Testing. In this situation one has a statement (Hypothesis) about the parameter(s) of the distributions being sampled and one is interested in deciding whether the statement is true or false.

91 In fact there are two hypotheses –The Null Hypothesis (H 0 ) and –the Alternative Hypothesis (H A ). A decision will be made either to –Accept H 0 (Reject H A ) or to –Reject H 0 (Accept H A ). The following table gives the different possibilities for the decision and the different possibilities for the correctness of the decision

92 The following table gives the different possibilities for the decision and the different possibilities for the correctness of the decision Accept H 0 Reject H 0 H 0 is true Correct Decision Type I error H 0 is false Type II error Correct Decision

93 Type I error - The Null Hypothesis H 0 is rejected when it is true. The probability that a decision procedure makes a type I error is denoted by , and is sometimes called the significance level of the test. Common significance levels that are used are  =.05 and  =.01

94 Type II error - The Null Hypothesis H 0 is accepted when it is false. The probability that a decision procedure makes a type II error is denoted by . The probability 1 -  is called the Power of the test and is the probability that the decision procedure correctly rejects a false Null Hypothesis.

95 A statistical test is defined by 1. Choosing a statistic for making the decision to Accept or Reject H 0. This statisitic is called the test statistic. 2. Dividing the set of possible values of the test statistic into two regions - an Acceptance and Critical Region.

96 If upon collection of the data and evaluation of the test statistic, its value lies in the Acceptance Region, a decision is made to accept the Null Hypothesis H 0. If upon collection of the data and evaluation of the test statistic, its value lies in the Critical Region, a decision is made to reject the Null Hypothesis H 0.

97 The probability of a type I error, , is usually set at a predefined level by choosing the critical thresholds (boundaries between the Acceptance and Critical Regions) appropriately.

98 The probability of a type II error, , is decreased (and the power of the test, 1 - , is increased) by 1. Choosing the “best” test statistic. 2. Selecting the most efficient experimental design. 3. Increasing the amount of information (usually by increasing the sample sizes involved) that the decision is based.

99 Some common Tests

100 The p-value approach to Hypothesis Testing

101 1.A test statistic 2.A Critical and Acceptance region for the test statistic In hypothesis testing we need The Critical Region is set up under the sampling distribution of the test statistic. Area =  (0.05 or 0.01) above the critical region. The critical region may be one tailed or two tailed

102 The Critical region:  /2 Accept H 0 Reject H 0

103 1.Computing the value of the test statistic 2.Making the decision a.Reject if the value is in the Critical region and b.Accept if the value is in the Acceptance region. In test is carried out by

104 The value of the test statistic may be in the Acceptance region but close to being in the Critical region, or The it may be in the Critical region but close to being in the Acceptance region. To measure this we compute the p-value.

105 Definition – Once the test statistic has been computed form the data the p-value is defined to be: p-value = P[the test statistic is as or more extreme than the observed value of the test statistic] more extreme means giving stronger evidence to rejecting H 0

106 Example – Suppose we are using the z –test for the mean m of a normal population and  = 0.05. Z 0.025 = 1.960 p-value = P[the test statistic is as or more extreme than the observed value of the test statistic] = P [ z > 2.3] + P[z < -2.3] = 0.0107 + 0.0107 = 0.0214 Thus the critical region is to reject H 0 if Z 1.960. Suppose the z = 2.3, then we reject H 0

107 p - value 2.3 -2.3 Graph

108 p-value = P[the test statistic is as or more extreme than the observed value of the test statistic] = P [ z > 1.2] + P[z < -1.2] = 0.1151 + 0.1151 = 0.2302 If the value of z = 1.2, then we accept H 0 23.02% chance that the test statistic is as or more extreme than 1.2. Fairly high, hence 1.2 is not very extreme

109 p - value 1.2 -1.2 Graph

110 Properties of the p -value 1.If the p-value is small (<0.05 or 0.01) H 0 should be rejected. 2.The p-value measures the plausibility of H 0. 3.If the test is two tailed the p-value should be two tailed. 4. If the test is one tailed the p-value should be one tailed. 5.It is customary to report p-values when reporting the results. This gives the reader some idea of the strength of the evidence for rejecting H 0

111 Multiple testing Quite often one is interested in performing collection (family) of tests of hypotheses. 1. H 0,1 versus H A,1. 2. H 0,2 versus H A,2. 3. H 0,3 versus H A,3. etc.

112 Let  * denote the probability that at least one type I error is made in the collection of tests that are performed. The value of  *, the family type I error rate, can be considerably larger than , the type I error rate of each individual test. The value of the family error rate,  *, can be controlled by altering the thresholds of each individual test appropriately. A testing procedure of this nature is called a Multiple testing procedure.

113 Independent variables Dependent Variables CategoricalContinuousContinuous & Categorical Categorical Multiway frequency Analysis (Log Linear Model) Discriminant Analysis Continuous ANOVA (single dep var) MANOVA (Mult dep var) MULTIPLE REGRESSION (single dep variable) MULTIVARIATE MULTIPLE REGRESSION (multiple dependent variable) ANACOVA (single dep var) MANACOVA (Mult dep var) Continuous & Categorical ?? A chart illustrating Statistical Procedures

114 Next topic: Fitting equations to data Link


Download ppt "Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression –Multiple Regression 2.Analysis of Variance and Experimental."

Similar presentations


Ads by Google