Presentation on theme: "Looking at Data-Distributions"— Presentation transcript:
1Looking at Data-Distributions 1.1-Displaying Distributions with Graphs
2Basic definitions Data-numbers with a context Eg. Your friends new baby weighed 10.5 pounds, we know that baby is quite large. But if it is 10.5ounces or 10.5kg, we know that it is impossible-the context makes the number informativeIndividuals-objects described in the data(people,animals,things)Variable-any property/characteristics of an individual(IQ scores of persons)Distribution-of a variable tells us what values & how often(frequency of a variable)
3Types of variablescategorical variable-places an individual into one of several categories(male/female, smoker/nonsmoker)quantitative variable-takes numerical values for which arithmetic operations such as adding & averaging can be performed(shoe size,age)
4How to represent data?Categorical variables-can use Pie-chart & bar graphsEg. make a pie chart/bar graph for distribution of genderQuantitative variables-can use histogram
5Example 1-The color of your car(distribution of the most popular colors for 2005 model luxury cars made in North AmericaColorPercentSilver20White, pearl18Black16Blue13Light brown10Red7Yellow,gold6What percent of vehicles are some other color?Make a bar graph?Can we make a pie chart for the given colors?Would it be correct to make a pie chart if you added an “Other” category?
6Example 2-The density of the earth (the variable recorded was the density of the earth as multiple of the density of water)
7Using TI-84 create a histogram Discuss the shape, center, spread and outliers
8Example3-Do women study more than men Example3-Do women study more than men? Variable-minutes studied on a typical weeknight of a first-year college classHere are the responses of random samples of 30 women and 30 men from the class:Women 180,120,150,200,120,90,120,180,120,150,60,240,180,120,180,180,120, 180,360,240,180,150,180,115,240,170,150,180,180,120Men90,90,150,240,30,0,120,45,120,60,230,200,30,30,60, 120, 120, 120, 90,120,240,60,95,120,200,75,300,30,150,180Examine the data. Why are you not surprised that more responses are multiples of 10minutes? We eliminated one student who claimed to study 30,000 minutes per night. Are there any other responses you consider suspicious?Make a back-to-back stem plot to compare the two samples. That is, use one set of stems with two sets of leaves, one to the right and one to the left of the stems.(Draw a line on either side of the stems to separate stems and leaves.) Order both sets of leaves from smallest at the stem to largest away from the stem. Report the approximate midpoints of both graphs. Does it appear that women study more than men(or at least claim that they do)?
9Answersa) Most people round their answers. The students who claimed 0 minutes, 360 minutes and 300 minutes.B)The stemplots suggest that women(claim to) study more than men. The approximate centers are 175 minutes for women and 120 minutes for men.
10Looking at Data-Distributions 1.2-Describing Distributions with numbers
11Mean =sum of numbers/ number of numbers Mean & MedianMean =sum of numbers/ number of numbersMedian=Middle value(when the numbers are in ascending order)Example 1: 103,105,109,140,170 (Median is 109-the number in the (n+1)/2th position from the bottom of the list-n is number of values)Example 2: 18,19,20,20,26,28(Median is 20- the avg of n/2 position number & n/2+1 position). Mean =21.83Example 3: replace 28 in example 2 by 100 & re -compute mean and median?18,19,20,20,26,100Mean =33.83Median-does not changeMean is affected by outliersMedian is not affected by outliersA measure of center alone can be misleadingSolution-need a measure of spread(variability)
12Measuring spreadQuartilesExample 4–Age of 10 students26,19,20,18,20,19,19,19,19,21Sort them in ascending order18,19,19,19,19,19,20,20,21,26Median =19 (Q2 )First quartile=median of the lower half of data(Q1 )=19Third quartile=median of the upper half of data(Q3 )=20
13IQR(Inter quartile range)= Q3 - Q1 Five-number summaryMin Q1 Q2 Q3 MaxBox plot- Picture of the five number summary. Can be used to compare two distributionsIQR(Inter quartile range)= Q3 - Q1MaxQ3IQRMedian(Q2 )Q1Min
14The 1.5 X IQR rule for suspected outliers Example 5(travel times to work in New York-in minutes)10,30,5,25,40,20,10,15,30,20,15,20,85,15,65,15,60,60,40,45(single peaked/right skewed/no center observation,but there is a center pair)The five number summaryIQR= =27.5Apply 1.5XIQR ruleStep 1:calculate 1.5 X IQR=1.5 x 27.5Step 2: Calculate Q1 -(1.5 X IQR)= =-26.25Step 3: Calculate Q3 +(1.5 X IQR)= =83.75Any values outside of (-26.25,83.75) are flagged as outliersThe suspected outlier in the data is 85
15Standard deviation(s) Used as a measure of spread when mean=centerUnits of s=same as data unitss always positiveHigher s->more spreads=0->no spread -> all observations equals affected by outliersExample :1,1,2,5,3
17Looking at Data-Distributions 1.3 –Density Curves and Normal Distributions
18DefinitionsDensity Curve-Special type of histogram such that total area under the curve is 1Typical histogramExample for a Density CurveRelative frequencyBin limitsCharacteristics of density curveAll y values positivetotal area under curve=1Curve approaches to zero for extreme left & right x values
19Definitions Normal Distribution Formula It can be shown that the probability density function for a normal random variable, X, with mean X and standard deviation X has the following form.TI-84 calculator-> 1)STAT plot off 2) enter in Y1-use normalpdf(x,mean,standard deviation) 3)normalpdf( ) found in 2nd->DISTR
20Definitions The 68-95-99.7 rule Example-When Mean 0 & standard deviation is 1Approximately 68% of the observations fall within one standard deviation of the meanApproximately 95% of the observations fall within two standard deviation of the meanApproximately 99.7% of the observations fall within three standard deviation of the mean
21Definitionstables-allows us to calculate the probabilities for a normal distributionHow to get numbers?There are too many normals(one per possible mean/one per possible standard deviation)>infinitely manyNeed to standardizeStandardization of Normal Random Variables. If X is normally distributed, its standardization isEquation:
22DefinitionsStandard normal(Z) : N(0,1) , mean 0 & Standard deviation 1Now can calculate the fraction of my data set between any two limits