Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 1: Looking at Data - Distributions:

Similar presentations


Presentation on theme: "Chapter 1: Looking at Data - Distributions:"— Presentation transcript:

1 Chapter 1: Looking at Data - Distributions:

2 What is Statistics?

3 What is Statistics Statistics is the science of learning from data.
Components Collection Organization Analysis Interpretation

4 Applications of Statistics
Computer Science client-server performance image processing Chemistry/Physics determining outliers in your data linear regression propagation of error dealing with large populations and approximations Engineering is one process/technique better than another one? Business Making good decisions Everyday life Medical information Average cell phone usage of Purdue students

5 Branches of Statistics
Collection of data Descriptive Statistics Inferential Statistics

6 1.1: Data: Goals Give examples of cases in a data set.
Identify the variables in a data set. Demonstrate how a label can be used as a variable in a data set. Identify the values of a variable. Classify variables as categorical or quantitative. Information on Histograms: Slides: 15 – 19, Book: pp. 15 – 20.

7 Basic Definitions Cases objects that are described by the data Label
special variable used to separate the cases Variable characteristic of a case

8 Types of Variables Number univariate bivariate multivariate Type
Categorical Quantitative Distribution of a variable The possible values and how often that it takes these variables

9 To better understand a data set, ask:
Who? What cases do the data describe? How many cases? What? How many variables? What is the exact definition of each variable? What is the unit of measurement for each variable? Why? What is the purpose of the data? What questions are being asked? Are the variables suitable?

10 1.2: Displaying Distributions with Graphs: Goals
Analyze the distribution of categorical variable: Bar Graphs Pie Charts Analyze the distribution of quantitative variable: Histogram Time plots Identify the shape, center, and spread Identify and describe any outliers

11 Categorical Variables - Display
The distribution of a categorical variable lists the categories and gives the count or percent or frequency of individuals who fall into each category. Pie charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories. Bar graphs represent categories as bars whose heights show the category counts or percents.

12 Categorical Variables – Display (STAT 311)

13 Quantitative Variable: Stemplot
Stemplots separate each observation into a stem and a leaf that are then plotted to display the distribution while maintaining the original values of the variable. Procedure Separate each observation into a stem (first part of the number) and a leaf (the remaining part of the number). Write the stems in a vertical column; draw a vertical line to the right of the stems. Write each leaf in the row to the right of its stem; order leaves if desired. Put in the units. X

14 Stemplot: Example The actual percentages are: 77 83 66 91 75 76 78 57
95 90 86 99 63 71 73 52 68 X

15 Quantitative Variable: Histograms
Histograms show the distribution of a quantitative variable by using bars. The height of a bar represents the number of individuals whose values fall within the corresponding class. Procedure - discrete Calculate the frequency and/or relative frequency of each x value. Mark the possible x values on the x-axis. Above each value, draw a rectangle whose height is the frequency (or relative frequency) of that value.

16 Histogram - Discrete 100 married couples between 30 and 40 years of age are studied to see how many children each couple have. The table below is the frequency table of this data set. Kids # of Couples Rel. Freq 11 0.11 1 22 0.22 2 24 0.24 3 30 0.30 4 5 0.01 6 0.00 7 100 1.00

17 Quantitative Variable: Histograms - continuous
Procedure - continuous Divide the x-axis into a number of class intervals or classes such that each observation falls into exactly one interval. Calculate the frequency or relative frequency for each interval. Above each value, draw a rectangle whose height is the frequency (or relative frequency) of that value.

18 Visual Display: Continuous Histogram
Power companies need information about customer usage to obtain accurate forecasts of demand. Investigators from Wisconsin Power and Light determined the energy consumption (BTUs) during a particular period for a sample of 90 gas-heated homes. An adjusted consumption value was calculated via The data is listed under furnace.txt under extra files on the computer web page.

19 Example (cont) 63 classes Bin = 0.5 32 classes Bin = 0.25 Bin = 1

20 Examining Distributions
In any graph of data, look for the overall pattern and for striking deviations from that pattern. You can describe the overall pattern by its shape, center, and spread. An important kind of deviation is an outlier, an individual that falls outside the overall pattern.

21 Shapes of Histograms - Number
Symmetric unimodal bimodal multimodal

22 Shapes of Histograms (cont)
Symmetric Positively skewed Negatively skewed

23 Shapes of Histograms (cont)

24 Outliers

25 Time Plots A time plot shows behavior over time.
Time is always on the x-axis; the other variable is on the y-axis Look for a trend and deviations from the trend. Connecting the data points by lines may emphasize this trend. Look for patterns that repeat at known regular intervals.

26 Example: Time Plots We are interested in the temperature (oF) of effluent at a sewage treatment plant. Plot a histogram of the data. Plot a time plot of the data. 47 54 53 50 46 51 52

27 Example: Time Plots (cont)

28 1.3: Describing Distributions with Numbers: Goals
Describe the center of a distribution by: mean median Compare the mean and median Describe the measure of spread: quartiles standard deviation Describe a distribution by a boxplot (five-number summary and outliers) Be able to determine which summary statistics are appropriate for a given situation Be able to determine the effects of a linear transformation on the above summary statistics.

29 Sample Mean 𝑥 = 𝑠𝑢𝑚 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑛 = 1 𝑛 𝑥 𝑖

30 Sample Mean: Example The following data give the time in months from hire to promotion to manager for a random sample of 20 software engineers from all software engineers employed by a large telecommunications firm. a) What is the mean time for this sample? b) Suppose that instead of x20 = 69, we had chosen another engineer that took 483 months to be promoted. what is the mean time for this new sample? 5 7 12 14 18 22 21 25 23 24 34 37 49 64 47 67 69

31 Sample Median, M or x̃ Procedure
Sort n observations from smallest to largest If n is odd, x̃ is the center If n is even, x̃ is the average of the two center observations

32 Sample Median: Example
The following data give the time in months from hire to promotion to manager for a random sample of 20 software engineers from all software engineers employed by a large telecommunications firm. a) What is the median time for this sample? b) Suppose that instead of x20 = 69, we had chosen another engineer that took 483 months to be promoted. what is the median time for this new sample? 5 7 12 14 18 21 22 23 24 25 34 37 47 49 64 67 69

33 Mean and Median Left skew Right skew Mean Median Mean Mean Median

34 Variability of Data Set 1 -15 -10 -5 5 10 15 Set 2 -1 1 Set 3 -3 -2 2
5 10 15 Set 2 -1 1 Set 3 -3 -2 2 3

35 Quartiles Q1 Q2 Q3

36 Quartiles Procedure Sort the values from lowest to highest and locate the median. The first Quartile, Q1 is the median of the lower half. The third quartile, Q3 is the median of the upper half.

37 Quartiles: Example The following data give the time in months from hire to promotion to manager for a random sample of 19 software engineers from all software engineers employed by a large telecommunications firm. Find the median and the quartiles. What is the Interquartile Range? Are there any outliers in this data set? 7 12 14 18 21 22 23 24 25 34 37 47 49 64 100 150

38 Boxplots Procedure Draw and label a number line that includes the range of the distribution. Draw a central box from Q1 to Q3. Draw a line for the median. Extend lines (whiskers) from the box to the minimum and maximum values that are not outliers. Put in dots (* or some symbol) for the outliers

39 Boxplot: Example

40 Side-by-side Boxplot: Example

41 Sample Standard Deviation
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒= 𝑠 𝑥 2 = 1 𝑛−1 ( 𝑥 𝑖 − 𝑥 ) 2 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛= 𝑠 𝑥 = 1 𝑛−1 ( 𝑥 𝑖 − 𝑥 ) 2

42 Properties of Standard Deviation
s measures spread about the mean so only use this measure when you are using the mean to measure the center. s = 0 means that all of the observations are the same, normally s > 0 s is not resistant to outliers s has the same units of measurement as the original observations

43 Sample Standard Deviation: Example
The following data give the time in months from hire to promotion to manager for a random sample of 20 software engineers from all software engineers employed by a large telecommunications firm. What is the standard deviation time for this sample? 5 7 12 14 18 22 21 25 23 24 34 37 49 64 47 67 69

44 Choosing Measures of Center and Spread
Choices Mean and standard deviation Median and IQR ALWAYS PLOT YOUR DATA! Hans-Rosling-Bubble-Plot-Cartoon.jpg

45 Change of Measurement Linear transformation: xnew = a + bx Effects
No change to shape Adding a: adds a to measures of center; doesn’t effect measures of spread Multiplying by b: multiplies both measures of center and measures of spread (s, IQR) by b.

46 1.4: Density Curves and Normal Distributions: Goals
Be able to state the definition and practical importance of a density curve. State the physical means of the measurements of center and spread for density distributions. Normal distributions Be able to sketch the normal distribution. Be able to state the importance of the 68 – 96 – 99.7 rule Be able to standardize a value Be able to use the Z-table Be able to calculate percentages Be able to calculate percentiles (Inverse calculations) Be able to determine if a distribution is normal (normal quantile plots)

47 Exploring Quantitative Data
Always plot your data. Look for the overall pattern. Calculate a numeric summary. Sometimes, the overall pattern is regular so that we can describe it by a specific methodology.

48 Density Curve (a) (b) (c)

49 Properties of Density Curve
y = f(x) y = f(x)

50 Density Curves – Median and Mean
The median of a density curve is the equal – areas point. 𝑝=0.5= −∞ 𝑦=𝑚𝑒𝑑𝑖𝑎𝑛 𝑓 𝑥 𝑑𝑥 The mean of a density curve is the balance point. If the distribution is symmetric, the median and mean are the same and are the center of the curve.

51 Mean

52 Sample vs. Population Terms for samples (actual observations)
Mean: x̄, median: x̃, standard deviation, s Terms for populations (density curves) Mean: , median: ̃, standard deviation, 

53 Normal Distribution A visual comparison of normal and paranormal
Lower caption says 'Paranormal Distribution' - no idea why the graphical artifact is occurring.

54 Normal Distribution 𝑓 𝑥 = 1 𝜎 2𝜋 𝑒 − (𝑥−𝜇) 2 2 𝜎 2 where -∞ <  < ∞, σ > 0 X ~ N(,σ)

55 Shapes of Normal Density Curve
/process_simulations_sensitivity_analysis_and_error_analysis_modeling /distributions_for_assigning_random_values.htm

56 Rule Empirical Rule

57 Standard Normal or z curve
𝑓 𝑧 = 1 2𝜋 𝑒 − 𝑧 3 2

58 Cumulative z curve area

59 Z-table

60 Using the Z table area right of z = 1  area left of z
area between z1 and z2 = area left of z – area left of z2

61 Procedure for Normal Distribution Problems
Sketch the situation and shade the area to be found. Standardize X to state the problem in terms of Z. Use Table A to find the area to the left of z. Calculate the final answer. Write your conclusion in the context of the problem.

62 Normal Distribution: Example
A particular rash has shown up in an elementary school. It has been determined that the length of time that the rash will last is normally distributed with mean 6 days and standard deviation 1.5 days. What is the percentage of students that have the rash for longer than 8 days? What is the percentage of students that the rash will last between 3.7 and 8 days?

63 Percentiles

64 Normal Distribution: Example
A particular rash has shown up in an elementary school. It has been determined that the length of time that the rash will last is normally distributed with mean 6 days and standard deviation 1.5 days. How long would the student’s rash have to have lasted to be in the top 10% of the number of days that the students have the rash?

65 Symmetrically Located Areas

66 Normal Distribution: Example
A particular rash has shown up in an elementary school. It has been determined that the length of time that the rash will last is normally distributed with mean 6 days and standard deviation 1.5 days. What interval symmetrically placed about the mean will capture 95% of the times for the student’s rashes to have lasted.

67 Procedure: Normal Quantile Plot
Arrange the data from smallest to largest. Record the corresponding percentiles (quantiles). Find the z value corresponding to the quantile calculated in part 2. Plot the original data points (from 1) vs. the z values (from 3).


Download ppt "Chapter 1: Looking at Data - Distributions:"

Similar presentations


Ads by Google