Vooruitblik 10 en 11 Dinsdag 30 september 2008. Chapter 10 Correlation and Regression 1. Correlation 2. Regression 3. Variation and Prediction Intervals.

Presentation on theme: "Vooruitblik 10 en 11 Dinsdag 30 september 2008. Chapter 10 Correlation and Regression 1. Correlation 2. Regression 3. Variation and Prediction Intervals."— Presentation transcript:

Vooruitblik 10 en 11 Dinsdag 30 september 2008

Chapter 10 Correlation and Regression 1. Correlation 2. Regression 3. Variation and Prediction Intervals 4. Rangorde correlatie

1. Correlation Verband tussen twee gemeten variabelen in een dataset op interval of ratio nivo In dit boek: alléén lineaire verbanden Let op de voorwaarden! Maat: Pearson PM correlatie r of rho Geen correlatie: r = 0, maximale correlatie r = -1 of +1 Kritische waarden: tabel A-6

Scatterplots of Paired Data Figure 10-2

Scatterplots of Paired Data Figure 10-2

Formula 10-1 n  xy – (  x)(  y) n(  x 2 ) – (  x) 2 n(  y 2 ) – (  y) 2 r =r = The linear correlation coefficient r measures the strength of a linear relationship between the paired values in a sample. Calculators can compute r Formula

Figure 10-3 Hypothesis Test for a Linear Correlation

2. Regression Vervolg op correlatie Berekening van regressielijn in de scatterplot: de lijn die het beste past in de puntenwolk Doel: voorspellen van waarden

Regression The typical equation of a straight line y = mx + b is expressed in the form y = b 0 + b 1 x, where b 0 is the y -intercept and b 1 is the slope. ^ The regression equation expresses a relationship between x (called the independent variable, predictor variable or explanatory variable), and y (called the dependent variable or response variable). ^

Formulas for b 0 and b 1 Formula 10-2 n(  xy) – (  x) (  y) b 1 = (slope) n(  x 2 ) – (  x) 2 b 0 = y – b 1 x ( y -intercept) Formula 10-3 calculators or computers can compute these values

Given the sample data in Table 10-1, find the regression equation. Example: Old Faithful - cont

Procedure for Predicting Figure 10-7

3. Variation and Prediction Intervals Vervolg op regressielijn (hfst 7) Confidence interval = interval schatting van populatie parameters: proportie, gemiddelde, variantie Hier: interval schatting van de schatting van de waarde van een variabele

Key Concept In this section we proceed to consider a method for constructing a prediction interval, which is an interval estimate of a predicted value of y.

y - E < y < y + E ^ ^ Prediction Interval for an Individual y where E = t   2 s e n(x2)n(x2) – (  x) 2 n(x0 – x)2n(x0 – x)2 1 + + 1 n x 0 represents the given value of x t   2 has n – 2 degrees of freedom

Standard Error of Estimate The standard error of estimate, denoted by s e is a measure of the differences (or distances) between the observed sample y -values and the predicted values y that are obtained using the regression equation. Definition ^

4. Rangorde correlatie Non-parametrische methode = verdelingsvrije toets = geen aannames mbt. Verdeling in de opulatie Associatietest op twee variabelen Spearman’s: r s (sample) of voor populatie: rho s Procedure in fig 10.10 (p.537)

voorbeeld

1. Goodness-of-fit: multinominaal 2. Kruistabellen (contingency tables) 3. Variantie analyse (ANOVA) Chapter 11 Multinomial Experiments and Contingency Tables

Overview  We focus on analysis of categorical (qualitative or attribute) data that can be separated into different categories (often called cells).  Use the  2 (chi-square) test statistic (Table A- 4).  The goodness-of-fit test uses a one-way frequency table (single row or column).  The contingency table uses a two-way frequency table (two or more rows and columns).

1. Goodness-of-fit: multinominaal Komt een feitelijke kansverdeling op een nominale variabele overeen met een verwachte verdeling? H0: p1 = x, p2 = y, p3 = z, p4 = etc.. H1: Tenminste één van de gevonden proporties is afwijkend van de verwachte kans.

Goodness-of-Fit Test in Multinomial Experiments Critical Values 1. Found in Table A- 4 using k – 1 degrees of freedom, where k = number of categories. 2. Goodness-of-fit hypothesis tests are always right-tailed.  2 =  ( O – E ) 2 E Test Statistics

Example: Last Digit Analysis Test the claim that the digits in Table 11-2 do not occur with the same frequency.

Relationships Among the  2 Test Statistic, P-Value, and Goodness-of-Fit Figure 11-3

2. Kruistabellen (contingency tables) In this section we consider contingency tables (or two-way frequency tables), which include frequency counts for categorical data arranged in a table with a least two rows and at least two columns. We present a method for testing the claim that the row and column variables are independent of each other. We will use the same method for a test of homogeneity, whereby we test the claim that different populations have the same proportion of some characteristics.

491 213 704 377 112 489 31 8 39 899 333 1232 BlackWhiteYellow/Orange Row Totals Controls (not injured) Cases (injured or killed) Column Totals For the upper left hand cell: = 513.714 E = (899)(704) 1232 Case-Control Study of Motorcycle Drivers (row total) (column total) E = (grand total) 899 1232 704 899 1232

491 513.714 213 704 377 112 489 31 8 39 899 333 1232 BlackWhiteYellow/Orange Row Totals Cases (injured or killed) Expected Column Totals Controls (not injured) Expected 190.286 356.827 132.173 28.459 10.541 Case-Control Study of Motorcycle Drivers

H 0 : Row and column variables are independent. H 1 : Row and column variables are dependent. The test statistic is  2 = 8.775  = 0.05 The number of degrees of freedom are (r–1)(c–1) = (2–1)(3–1) = 2. The critical value (from Table A-4) is  2.05,2 = 5.991. Case-Control Study of Motorcycle Drivers

We reject the null hypothesis. It appears there is an association between helmet color and motorcycle safety. Case-Control Study of Motorcycle Drivers Figure 11-4

3. Variantie analyse (ANOVA) ANalysis Of VAriance H0 = meerdere populatie gemiddeldes zijn gelijk F-verdeling (tabel A7) Toets op P-waarde

TOT SLOT: Bayesiaanse statistiek Teksten en 2 opdrachten (worden uitgedeeld) 1. Intuïtieve benadering 2. Formele benadering

Voorbeeldprobleem Gegeven: In Orange County VS is 51 % man, 9.5% van de mannen rookt sigaren, tegenover 1.7% van de vrouwen Gevraagd: Hoe groot is de kans dat een willekeurige sigarenroker een man is?