Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control.

Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control

Regression/Prediction In this case we can use X to predict Y with accuracy. The equation is Y = 2X +1 For any X we can compute Y Error is not a factor

In this case we can’t fit a nice straight line to the data. If we repeat the experiment we will get somewhat different results Random error is a factor

If in this case we know that the relationship should be a straight line and we believe the deviation from a line is due to error We can fit a line to the data There are many possible lines to choose from. How do we select the best line?

Principles of Prediction If we don’t know anything about the person being measured, our best bet is always to predict the mean If we know the person is average on X we should predict they will be average on Y. We want to select the line that produces the least error. X – ScaleY – Scale 1.00 2.00 3.00 4.00 5.00 3.00 5.00 6.00 9.00 4.00 6.00 7.00 10.00 4.00 6.00 8.00 10.00 5.00 7.00 9.00 12.00 6.00 7.00 10.00 12.00 Mean = 3.0Mean = 7.3

Development Sample – A sample in which both X and Y are known – Used to develop an equation that can be used to compute a predicted Y from X – Used to Compute the Standard Curve Unknown Sample – Use the equation developed above to predict Y for people or samples for whom we know X but not Y X – ScaleY – Scale 1.00 2.00 3.00 4.00 5.00 3.00 5.00 6.00 9.00 4.00 6.00 7.00 10.00 4.00 6.00 8.00 10.00 5.00 7.00 9.00 12.00 6.00 7.00 10.00 12.00 Mean = 3.0Mean = 7.3

Regression Freddie Bruflot X = 5 Y = 10 Actual Y > Predicted Y > Mean of Y > Total Residual Regression Error.75

Computation of SSE XYY predictedResidual (error)Residual squared 135.8-2.87.84 155.8-0.80.64 165.80.20.04 195.83.210.24 246.55-2.556.5 266.55-0.550.3 276.550.450.2 2106.553.4511.9 347.3-3.310.89 367.3-1.31.69 387.30.70.49 3107.32.77.29 458.05-3.059.3 478.05-1.051.1 498.050.950.9 4128.053.9515.6 568.8-2.87.84 578.8-1.83.24 5108.81.21.44 5128.83.210.24 Mean = 3Mean = 7.3Sum of Squared Residuals = 107.68

Correlation/Regression Example High levels of a particular factor in blood samples (call it BF-Costly)is known to be highly predictive of cervical cancer. Measuring this specific factor is so expensive and time consuming that it is impractical. The following data are obtained concerning the relationship between a second, easily measured blood factor (call it BF-Cheap) and BF-Costly.

BF-CheapBF-Costly 8978 4857 7465 9786 5958 6575 4657 8495 7869 7786 6778 3647 8374 6877 9687 Mean = 71.13Mean = 72.60

Here is a scatterplot of the data showing the relationship between BF-Cheap and BF-Costly. It looks like they are correlated. If it is significant this might be worth pursuing. Null hypothesis: correlation is zero Alpha =.05

Test Significance of the Correlation The probability that the correlation is zero is.0000724 (7.24E-05) We can reject the null hypothesis It may well be worth it to develop a prediction equation that can be used to predict BF- Costly from BF-Cheap Correlations BF - CheapBF - Costly BF - Cheap Pearson Correlation 10.845299 Sig. (2- tailed).7.24E-05 N 15 BF - Costly Pearson Correlation 0.8452991 Sig. (2- tailed) 7.24E-05. N 15 **Correlation is significant at the 0.01 level (2-tailed).

Regression Analysis Null Hypothesis: the slope of the regression line is zero Alpha =.05 Probability is.0000724 We can reject the null hypothesis What is the equation? Is it any good? ANOVA Model Sum of Squaresdf Mean SquareFSig. 1Regression1843.2011 32.538877.24E-05 Residual736.39941356.64611 Total2579.614 aPredictors: (Constant), BF - Cheap bDependent Variable: BF - Costly

The equation we are looking for will be of the form: The Y – Intercept is called “constant” and is 27.6 The slope (the number you multiply BF-Cheap by) is.63 Both of these are significant The equation is: Coefficients Unstandardized Coefficients Standardized Coefficients Model BStd. ErrorBetatSig. 1(Constant)27.649038.1162883.406610.004682 BF - Cheap0.6319260.1107810.8452995.7042857.24E-05 aDependent Variable: BF - Costly Predicted BF-Costly = slope * BF-Cheap + y-intercept Predicted BF-Costly =.63(BF-Cheap) + 27.6

How good is the equation. What kind of accuracy can we expect. R-Square (.715) is the proportion of the total variance accounted for by the equation. About 71.5% of the variance is accounted for That means about 28.5% is not accounted for Model Summary ModelRR Square Adjusted R Square Std. Error of the Estimate 1 0.8452990.714530.692577.526361 a Predictors: (Constant), BF - Cheap b Dependent Variable: BF - Costly

Using Linear Regression to Develop a Standard Curve for Real Time PCR Develop a standard curve from samples with known concentration. – Y is the concentration – X is the CP Both X and Y are known The relationship between concentration and output is not linear it is an S-curve but the relationship between the log(concentration) and output is linear.

Standard Sample Data CPConcentrationLog Concentration 33.03101 31.15101 28.471002 28.591002 23.2310003 22.0810003 19.17100004 17.78100004 14.851000005 13.811000005 10.0210000006 9.6310000006

CP and Concentration are highly correlated. It is a negative correlation. The correlation is -.996 Now fit a regression line to these data and get the equation of the line.

Fit Regression Line This is highly significant of course Y-Intercept is 8.11 Slope is -.22 Equation is: Predicted Concentration = antilog (-.22(CP) + 8.11) Coefficients Unstandardized Coefficients Standardized CoefficientstSig. ModelBStd. ErrorBeta 1(Constant) 8.1056040.14414256.233347.67E-14 CP -0.219480.006444-0.99572-34.06021.13E-11 aDependent Variable: Log Concentration

Standard Curve For an Unknown with a CP of 18.48 Log Predicted Concentration would be 8.11 -.22(18.48) = 4.04 Antilog of 4.04 is 1.10E4

Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control.

Similar presentations

Presentation on theme: "Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control.

Similar presentations

Presentation on theme: "Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control."— Presentation transcript:

Similar presentations

About project

Feedback