Presentation on theme: "Chapter 3 ~ Descriptive Analysis & Presentation of Bivariate Data"— Presentation transcript:
1 Chapter 3 ~ Descriptive Analysis & Presentation of Bivariate Data 543216WeightHeightRegression PlotY = Xr = 0.559
2 Chapter GoalsTo be able to present bivariate data in tabular and graphic formTo become familiar with the ideas of descriptive presentationTo gain an understanding of the distinction between the basic purposes of correlation analysis and regression analysis
3 3.1 ~ Bivariate DataBivariate Data: Consists of the values of two different response variables that are obtained from the same population of interestThree combinations of variable types:1. Both variables are qualitative (attribute)2. One variable is qualitative (attribute) and the other is quantitative (numerical)3. Both variables are quantitative (both numerical)
4 Two Qualitative Variables When bivariate data results from two qualitative (attribute or categorical) variables, the data is often arranged on a cross-tabulation or contingency tableExample: A survey was conducted to investigate the relationship between preferences for television, radio, or newspaper for national news, and gender. The results are given in the table below:
5 Marginal TotalsThis table may be extended to display the marginal totals (or marginals). The total of the marginal totals is the grand total:Row Totals760560Col. Totals3954504751320TVRadioNPMale280175305Female115275170Note: Contingency tables often show percentages (relative frequencies). These percentages are based on the entire sample or on the subsample (row or column) classifications.
6 Percentages Based on the Grand Total (Entire Sample) The previous contingency table may be converted to percentages of the grand total by dividing each frequency by the grand total and multiplying by 100For example, 175 becomes 13.3%TVRadioNPRow TotalsMale21.213.323.157.6Female8.720.812.942.4Col. Totals29.934.136.0100.01751320100133=æèçöø÷.
7 Percentages Based on Grand Total IllustrationThese same statistics (numerical values describing sample results) can be shown in a (side-by-side) bar graph:510152025TVRadioNPMaleFemalePercentages Based on Grand TotalPercentMedia
8 Percentages Based on Row (Column) Totals The entries in a contingency table may also be expressed as percentages of the row (column) totals by dividing each row (column) entry by that row’s (column’s) total and multiplying by The entries in the contingency table below are expressed as percentages of the column totals:Note: These statistics may also be displayed in a side-by-side bar graph
9 One Qualitative & One Quantitative Variable 1. When bivariate data results from one qualitative and one quantitative variable, the quantitative values are viewed as separate samples2. Each set is identified by levels of the qualitative variable3. Each sample is described using summary statistics, and the results are displayed for side-by-side comparison4. Statistics for comparison: measures of central tendency, measures of variation, 5-number summary5. Graphs for comparison: dotplot, boxplot
10 ExampleExample: A random sample of households from three different parts of the country was obtained and their electric bill for June was recorded. The data is given in the table below:The part of the country is a qualitative variable with three levels of response. The electric bill is a quantitative variable. The electric bills may be compared with numerical and graphical techniques.
11 Comparison Using Dotplots :Northeast.:..:. ..Midwest:WestThe electric bills in the Northeast tend to be more spread out than those in the Midwest. The bills in the West tend to be higher than both those in the Northeast and Midwest.
12 Comparison Using Box-and-Whisker Plots 234567ElectricBillThe Monthly Electric Bill
13 Two Quantitative Variables 1. Expressed as ordered pairs: (x, y)2. x: input variable, independent variable y: output variable, dependent variableScatter Diagram: A plot of all the ordered pairs of bivariate data on a coordinate axis system. The input variable x is plotted on the horizontal axis, and the output variable y is plotted on the vertical axis.Note: Use scales so that the range of the y-values is equal to or slightly less than the range of the x-values. This creates a window that is approximately square.
14 ExampleExample: In a study involving children’s fear related to being hospitalized, the age and the score each child made on the Child Medical Fear Scale (CMFS) are given in the table below:Construct a scatter diagram for this data
15 Child Medical Fear Scale Solutionage = input variable, CMFS = output variableChild Medical Fear Scale154329876CMFSAge
16 3.2 ~ Linear CorrelationMeasures the strength of a linear relationship between two variablesAs x increases, no definite shift in y: no correlationAs x increases, a definite shift in y: correlationPositive correlation: x increases, y increasesNegative correlation: x increases, y decreasesIf the ordered pairs follow a straight-line path: linear correlation
17 Example: No Correlation As x increases, there is no definite shift in y:32154OutputInput
18 Example: Positive Correlation As x increases, y also increases:543216OutputInput
19 Example: Negative Correlation As x increases, y decreases:OutputInput543219876
20 Please NotePerfect positive correlation: all the points lie along a line with positive slopePerfect negative correlation: all the points lie along a line with negative slopeIf the points lie along a horizontal or vertical line: no correlationIf the points exhibit some other nonlinear pattern: no linear relationship, no correlationNeed some way to measure correlation
21 3.1 ~ Bivariate DataCoefficient of Linear Correlation: r, measures the strength of the linear relationship between two variablesPearson’s Product Moment Formula:Notes:r = +1: perfect positive correlationr = -1 : perfect negative correlation
22 Alternate Formula for r SS“sum of squares for()xx”=n-å2SS“sum of squares for()yy”=n-å2SS“sum of squares for()xyxy”=xyn-å
23 ExampleExample: The table below presents the weight (in thousands of pounds) x and the gasoline mileage (miles per gallon) y for ten different automobiles. Find the linear correlation coefficient:
24 Completing the Calculation for r xyxy=-SS().)(0.427974491116947
25 Please Note r is usually rounded to the nearest hundredth r close to 0: little or no linear correlationAs the magnitude of r increases, towards -1 or +1, there is an increasingly stronger linear correlation between the two variablesMethod of estimating r based on the scatter diagram. Window should be approximately square. Useful for checking calculations.
26 3.3 ~ Linear RegressionRegression analysis finds the equation of the line that best describes the relationship between two variablesOne use of this equation: to make predictions
27 Models or Prediction Equations Some examples of various possible relationships:y^bx=+1abxcx2()logLinear:Quadratic:Exponential:Logarithmic:Note: What would a scatter diagram look like to suggest each relationship?
28 Method of Least Squares bx=+1y^Equation of the best-fitting line:y^Predicted value:()))ybx-=+å21^Least squares criterion:Find the constants b0 and b1 such that the sum is as small as possible
29 Illustration Observed and predicted values of y: y y b x = + ) ( , x y 1y^)(,xyy-^y^(,)x
30 The Line of Best Fit Equation The equation is determined by:b0: y-interceptb1: slopeValues that satisfy the least squares criterion:
31 ExampleExample: A recent article measured the job satisfaction of subjects with a 14-question survey. The data below represents the job satisfaction scores, y, and the salaries, x, for a sample of similar individuals:1) Draw a scatter diagram for this data2) Find the equation of the line of best fit
32 Finding b1 & b0Preliminary calculations needed to find b1 and b0:
33 Line of Best Fit ( ) å y ^ b xy x 118 75 229 5 5174 = SS ( ) . 0. b y 1133517423484902=-×å(0.)(.Equation of the lineof best fit:.0.x=+149517y^Solution 1)
35 Please NoteKeep at least three extra decimal places while doing the calculations to ensure an accurate answerWhen rounding off the calculated values of b0 and b1, always keep at least two significant digits in the final answerThe slope b1 represents the predicted change in y per unit increase in xThe y-intercept is the value of y where the line of best fit intersects the y-axisThe line of best fit will always pass through the point
36 Making Predictions1. One of the main purposes for obtaining a regression equation is for making predictionsy^2. For a given value of x, we can predict a value of3. The regression equation should be used to make predictions only about the population from which the sample was drawn4. The regression equation should be used only to cover the sample domain on the input variable. You can estimate values outside the domain interval, but use caution and use values close to the domain interval.5. Use current data. A sample taken in 1987 should not be used to make predictions in 1999.