Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multivariate Data.

Similar presentations


Presentation on theme: "Multivariate Data."— Presentation transcript:

1 Multivariate Data

2 Descriptive techniques for Multivariate data
In most research situations data is collected on more than one variable (usually many variables)

3 Graphical Techniques The scatter plot The two dimensional Histogram

4 The Scatter Plot xi = the value of X for case i
For two variables X and Y we will have a measurements for each variable on each case: xi, yi xi = the value of X for case i and yi = the value of Y for case i.

5 To Construct a scatter plot we plot the points: (xi, yi)
for each case on the X-Y plane. (xi, yi) yi xi

6 The following table gives data on Verbal IQ, Math IQ,
Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program Initial Final Verbal Math Reading Reading Student IQ IQ Acheivement Acheivement

7

8 (84,80)

9

10 Some Scatter Patterns

11

12

13 Circular No relationship between X and Y Unable to predict Y from X

14

15

16 Ellipsoidal Positive relationship between X and Y Increases in X correspond to increases in Y (but not always) Major axis of the ellipse has positive slope

17

18 Example Verbal IQ, MathIQ

19

20 Some More Patterns

21

22

23 Ellipsoidal (thinner ellipse)
Stronger positive relationship between X and Y Increases in X correspond to increases in Y (more freqequently) Major axis of the ellipse has positive slope Minor axis of the ellipse much smaller

24

25 Increased strength in the positive relationship between X and Y
Increases in X correspond to increases in Y (almost always) Minor axis of the ellipse extremely small in relationship to the Major axis of the ellipse.

26

27

28 Perfect positive relationship between X and Y
Y perfectly predictable from X Data falls exactly along a straight line with positive slope

29

30

31 Ellipsoidal Negative relationship between X and Y Increases in X correspond to decreases in Y (but not always) Major axis of the ellipse has negative slope slope

32

33 The strength of the relationship can increase until changes in Y can be perfectly predicted from X

34

35

36

37

38

39 Some Non-Linear Patterns

40

41

42 In a Linear pattern Y increase with respect to X at a constant rate
In a Non-linear pattern the rate that Y increases with respect to X is variable

43 Growth Patterns

44

45

46 Growth patterns frequently follow a sigmoid curve
Growth at the start is slow It then speeds up Slows down again as it reaches it limiting size

47 Review the scatter plot

48 Some Scatter Patterns

49

50 Non-Linear Patterns

51 Measures of strength of a relationship (Correlation)
Pearson’s correlation coefficient (r) Spearman’s rank correlation coefficient (rho, r)

52 Assume that we have collected data on two variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

53 From this data we can compute summary statistics for each variable.
The means and

54 The standard deviations

55 These statistics: give information for each variable separately but give no information about the relationship between the two variables

56 Consider the statistics:

57 The first two statistics:
are used to measure variability in each variable they are used to compute the sample standard deviations and

58 The third statistic: is used to measure correlation If two variables are positively related the sign of will agree with the sign of

59 When is positive will be positive.
When xi is above its mean, yi will be above its mean When is negative will be negative. When xi is below its mean, yi will be below its mean The product will be positive for most cases.

60

61 This implies that the statistic
will be positive Most of the terms in this sum will be positive

62 On the other hand If two variables are negatively related the sign of will be opposite in sign to

63 When is positive will be negative.
When xi is above its mean, yi will be below its mean When is negative will be positive. When xi is below its mean, yi will be above its mean The product will be negative for most cases.

64 Again implies that the statistic
will be negative Most of the terms in this sum will be negative

65 Pearson’s correlation coefficient r
A statistic measuring the strength of the relationship between two variables - X and Y

66 Pearsons correlation coefficient is defined as below:

67 The denominator: is always positive

68 The numerator: is positive if there is a positive relationship between X ad Y and negative if there is a negative relationship between X ad Y. This property carries over to Pearson’s correlation coefficient r

69 Properties of Pearson’s correlation coefficient r
The value of r is always between –1 and +1. If the relationship between X and Y is positive, then r will be positive. If the relationship between X and Y is negative, then r will be negative. If there is no relationship between X and Y, then r will be zero. The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope. The value of r will be -1 if the points, (xi, yi) lie on a straight line with negative slope.

70 r =1

71 r = 0.95

72 r = 0.7

73 r = 0.4

74 r = 0

75 r = -0.4

76 r = -0.7

77 r = -0.8

78 r = -0.95

79 r = -1

80 Computing formulae for the statistics:

81

82 To compute first compute Then

83 Example Verbal IQ, MathIQ

84 The following table gives data on Verbal IQ, Math IQ,
Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program Initial Final Verbal Math Reading Reading Student IQ IQ Acheivement Acheivement

85

86 Now Hence

87 Thus Pearsons correlation coefficient is:

88 Thus r = 0.769 Verbal IQ and Math IQ are positively correlated. If Verbal IQ is above (below) the mean then for most cases Math IQ will also be above (below) the mean.

89 Is the improvement in reading achievement (RA) related to either Verbal IQ or Math IQ?
improvement in RA = Final RA – Initial RA

90 The Data Correlation between Math IQ and RA Improvement Correlation between Verbal IQ and RA Improvement

91 Scatterplot: Math IQ vs RA Improvement

92 Scatterplot: Verbal IQ vs RA Improvement

93 correlation coefficient
Spearman’s rank correlation coefficient r (rho)

94 Spearman’s rank correlation coefficient
r (rho) Spearman’s rank correlation coefficient is computed as follows: Arrange the observations on X in increasing order and assign them the ranks 1, 2, 3, …, n Arrange the observations on Y in increasing order and assign them the ranks 1, 2, 3, …, n. For any case (i) let (xi, yi) denote the observations on X and Y and let (ri, si) denote the ranks on X and Y.

95 If the variables X and Y are strongly positively correlated the ranks on X should generally agree with the ranks on Y. (The largest X should be the largest Y, The smallest X should be the smallest Y). If the variables X and Y are strongly negatively correlated the ranks on X should in the reverse order to the ranks on Y. (The largest X should be the smallest Y, The smallest X should be the largest Y). If the variables X and Y are uncorrelated the ranks on X should randomly distributed with the ranks on Y.

96 Spearman’s rank correlation coefficient
is defined as follows: For each case let di = ri – si = difference in the two ranks. Then Spearman’s rank correlation coefficient (r) is defined as follows:

97 Properties of Spearman’s rank correlation coefficient r
The value of r is always between –1 and +1. If the relationship between X and Y is positive, then r will be positive. If the relationship between X and Y is negative, then r will be negative. If there is no relationship between X and Y, then r will be zero. The value of r will be +1 if the ranks of X completely agree with the ranks of Y. The value of r will be -1 if the ranks of X are in reverse order to the ranks of Y.

98 Example xi yi Ranking the X’s and the Y’s we get: ri si Computing the differences in ranks gives us: di

99

100 Computing Pearsons correlation coefficient, r, for the same problem:

101

102 To compute first compute

103 Then

104 and Compare with

105 Comments: Spearman’s rank correlation coefficient r and Pearson’s correlation coefficient r
The value of r can also be computed from: Spearman’s r is Pearson’s r computed from the ranks.

106 Spearman’s r is less sensitive to extreme observations. (outliers)
The value of Pearson’s r is much more sensitive to extreme outliers. This is similar to the comparison between the median and the mean, the standard deviation and the pseudo-standard deviation. The mean and standard deviation are more sensitive to outliers than the median and pseudo- standard deviation.

107 Scatter plots

108 Some Scatter Patterns

109

110 Non-Linear Patterns

111 Measuring correlation
Pearson’s correlation coefficient r Spearman’s rank correlation coefficient r

112 Simple Linear Regression
Fitting straight lines to data

113 The Least Squares Line The Regression Line
When data is correlated it falls roughly about a straight line.

114 In this situation wants to:
Find the equation of the straight line through the data that yields the best fit. The equation of any straight line: is of the form: Y = a + bX b = the slope of the line a = the intercept of the line

115 Rise = y2-y1 Run = x2-x1 y2-y1 Rise b = = Run x2-x1 a

116 a is the value of Y when X is zero
b is the rate that Y increases per unit increase in X. For a straight line this rate is constant. For non linear curves the rate that Y increases per unit increase in X varies with X.

117 Linear

118 Non-linear

119 Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below: Age Class 30-40 40-50 50-60 60-70 70-80 Mipoint Age (X) 35 45 55 65 75 Median BP (Y) 114 124 143 158 166

120 Graph:

121 Interpretation of the slope and intercept
Intercept – value of Y at X = 0. Predicted Blood pressure of a newborn (65.1). This interpretation remains valid only if linearity is true down to X = 0. Slope – rate of increase in Y per unit increase in X. Blood Pressure increases 1.38 units each year.

122 Fitting the best straight line to “linear” data
The Least Squares Line Fitting the best straight line to “linear” data

123 Reasons for fitting a straight line to data
It provides a precise description of the relationship between Y and X. The interpretation of the parameters of the line (slope and intercept) leads to an improved understanding of the phenomena that is under study. The equation of the line is useful for prediction of the dependent variable (Y) from the independent variable (X).

124 Assume that we have collected data on two variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

125 Let Y = a + b X denote an arbitrary equation of a straight line. a and b are known values. This equation can be used to predict for each value of X, the value of Y. For example, if X = xi (as for the ith case) then the predicted value of Y is:

126 For example if Y = a + b X = X Is the equation of the straight line. and if X = xi = 20 (for the ith case) then the predicted value of Y is:

127 If the actual value of Y is yi = 70.0 for case i, then the difference
is the error in the prediction for case i. is also called the residual for case i

128 If the residual can be computed for each case in the sample, The residual sum of squares (RSS) is a measure of the “goodness of fit of the line Y = a + bX to the data

129 Y (x3,y3) Y=a+bX r3 r4 (x4,y4) (x1,y1) r2 (x2,y2) r1 X

130 The optimal choice of a and b will result in the residual sum of squares
attaining a minimum. If this is the case than the line: Y = a + bX is called the Least Squares Line

131

132

133

134

135

136

137 The equation for the least squares line
Let

138 Computing Formulae:

139 Then the slope of the least squares line can be shown to be:

140 and the intercept of the least squares line can be shown to be:

141 The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in   TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in   Country (i) Xi Yi Australia Canada Denmark Finland Great Britain Holland Iceland Norway Sweden Switzerland USA  

142

143

144 Fitting the Least Squares Line

145 Fitting the Least Squares Line
First compute the following three quantities:

146 Computing Estimate of Slope and Intercept

147 Y = (0.228)X

148 Interpretation of the slope and intercept
Intercept – value of Y at X = 0. Predicted death rate from lung cancer (6.756) for men in 1950 in Counties with no smoking in 1930 (X = 0). Slope – rate of increase in Y per unit increase in X. Death rate from lung cancer for men in 1950 increases units for each increase of 1 cigarette per capita consumption in 1930.

149 Correlation & Linear Regression
A review

150 The Scattergram Pearson’s Correlation Coefficient Spearman’s Rank Correlation Coefficient

151 The Least squares Line Slope and intercept

152 Comment: Regression to the mean
The Least Squares line is not the major axis of the ellipse that covers the data. The major axis of the ellipse The Least Squares line

153 If slope of the major axis is positive, the slope of the Least Squares line will also be positive but not as large This fact is sometimes revered to as regression towards the mean Suppose X is the father’s height and Y is the son’s height. If the fathers is tall within his population, then the son will also likely be tall but not as tall as the father within his population.

154 Relationship between correlation and Linear Regression
Pearsons correlation. Takes values between –1 and +1

155 Least squares Line Y = a + bX
Minimises the Residual Sum of Squares: The Sum of Squares that measures the variability in Y that is unexplained by X. This can also be denoted by: SSunexplained

156 Some other Sum of Squares:
The Sum of Squares that measures the total variability in Y (ignoring X).

157 The Sum of Squares that measures the total variability in Y that is explained by X.

158 It can be shown: (Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)

159 It can also be shown: = proportion variability in Y explained by X. = the coefficient of determination

160 Further: = proportion variability in Y that is unexplained by X.

161 Example  TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in   Country (i) Xi Yi Australia Canada Denmark Finland Great Britain Holland Iceland Norway Sweden Switzerland USA  

162 Fitting the Least Squares Line
First compute the following three quantities:

163 Computing Estimate of Slope and Intercept

164 Computing r and r2 54.4% of the variability in Y (death rate due to lung Cancer (1950) is explained by X (per capita cigarette smoking in 1930)

165 Y = (0.228)X

166 Comments Correlation will be +1 or -1 if the data lies on a straight line. Correlation can be zero or close to zero if the data is either Not related or In some situations non-linear

167 Example The data

168 One should be careful in interpreting zero correlation.
It does not necessarily imply that Y is not related to X. It could happen that Y is non-linearly related to X. One should plot Y vs X before concluding that Y is not related to X.

169 Next topic: Categorical Data


Download ppt "Multivariate Data."

Similar presentations


Ads by Google