Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics.

Similar presentations


Presentation on theme: "Statistics."— Presentation transcript:

1 Statistics

2 Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical methods help: Understand patterns of variation in data, and Describe characteristics of a population.

3 Objectives After completing this section, participants should be able to: Define and indentify the three types of data: nominal, ordinal, and continuous; Construct and interpret histograms, Calculate and interpret the sample mean, standard deviation, variance, median, range; Characterize distributions as: skewed, symmetric, bimodal, multimodal, normal, and mound-shaped;

4 Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical methods help: Understand patterns of variation in data, and Describe characteristics of a population. In this section, we address the following for continuous data: Graphical displays of data, and Numerical measures of location and spread.

5 Introduction The methods used to display and analyze data depend on the type of data of interest. Measurements can be classified into three types: Nominal – Measurements are unordered categories. Ordinal – Measurements are ordered categories. Continuous – The measurement of interest is in units of measure that, at least conceptually, follow a continuous scale. Note that, under column information, there is “Data Type” and “Modeling Type”. Continuous, nominal, and ordinal are technically “Modeling Types”. For simplicity, however, we think of these as categories of data.

6 Introduction Another classification of data, used in distinguishing control charts, has two main categories: Attribute data: Counts and proportions resulting from nominal data. Variables data: Again, conceptually continuous measurements with a unit of measure. Variables/attribute terminology is the old quality control terminology. People who have done QC will relate to this, and the terms are still prevalent enough that they are worth a quick mention.

7 Histograms A Six Sigma team is charged with reducing delivery time to customers Delivery time is defined as the number of days between shipment of an order by the company and receipt of the order at the customer’s site. Delivery time is obtained for 100 randomly chosen orders. Note that we treat this data as continuous.

8 Histograms The list of data values is not very informative. The 100 data values can be grouped in order to give more information on overall behavior. This Stem-and-Leaf plot groups the data while retaining the original values. Note that 36 deliveries occurred within 4 days, 27 took 5 to 9 days, etc. What can you conclude about delivery times? We don’t need to emphasize stem and leaf plots. The point in showing this chart is simply to show how data are grouped into categories, and that this grouping shows the “distribution” of the data. If you now flip (and reflect) the stem and leaf plot, it becomes a histogram, as shown on the next slide.

9 Histograms The chart below, called a histogram, is a special kind of bar chart. The histogram respects the order that is implicit in the continuous data, and gives a cleaner picture of the data than does the stem-and-leaf plot. The height of each bar, given on the vertical axis, indicates the number of delivery times that fall within each interval, given on the horizontal axis. Tie this back to the stem and leaf plot.

10 Histograms We say that the histogram gives a picture of the distribution of delivery times. The continuous curve superimposed on the histogram gives a picture of the shape of the distribution. This distribution has a long right tail. We say that it is skewed to the right. The term “distribution” is used to refer to the pattern that the data forms when displayed on a histogram. It is a picture of how the data is distributed. Skewness is to the right or the left, where the direction is the direction of the long tail.

11 Histograms We can use histograms to assess three characteristics of the distribution: Centering, spread, and shape. Where are the delivery times centered? How much do they spread or vary? What is the shape of the distribution? At this point, we are thinking of centering, spread, and shape very intuitively. Centering is merely meant to describe an approximate balancing point for the data. Spread can be interpreted, at this point, simply as the range (smallest value to largest value). Possible shapes are normal, mound-shaped, skewed, bimodal…

12 Histograms We can generate a histogram and discuss this distribution in terms of centering, spread and shape. Does the process appear to be meeting the specifications limits (the blue vertical lines)? The obvious answer is no. But, COULD the process meet spec? Yes, and all it would take is a shift of about .02 units to the right. (This is clear once you see the histogram, but was not clear simply from a listing of the data.) Note that, all this takes, is an increase in density, and this is an easy setting to change. BUT, even if the process were centered between the specs, the histogram would practically fill the range between the limits. This means that it would be easy for an observation to fall outside the spec limitsl What we would like is for the process to be very consistent, namely, to have very little spread. This is a harder situation to engineer. It requires eliminating common causes of variation. All this is clear from the histogram.

13 Common Histogram Shapes
Left Skewed: Data trails off to the left. Symmetric: Data has approximately the same distribution on either side of the center. Right Skewed: Data trails off to the right.

14 Histograms More Histogram Shapes
Bi-modal or multi-modal: Data has more than one peak. Uniform: Data is evenly distributed over its range. Uni-modal: Data has one peak.

15 Histograms Compare the centering and spread (variability) of these three distributions. Note that all three distributions have the same horizontal scales. We prefer the one that has the least variability.

16 Histograms Histograms provide many benefits. Histograms:
Summarize the data. Allow one to assess centering, spread, and shape. Help to identify unusual patterns in data. Histograms also have some limitations: Conclusions about the shape of the underlying distribution should not be drawn without a large enough data set (at least 75 randomly chosen data values data values are recommended). Individual data values are not shown. Improper bin sizes, as we will see on the following slide, can mask important data features.

17 Histograms 9 bins of width 20:
0.10 0.20 0.30 70 110 150 190 230 270 9 bins of width 20: 0.05 0.10 0.15 0.20 70 90 110 130 150 170 190 210 230 250 18 bins of width 10: Too many bins? Do we see too much noise? It might be nice to open the data file InkDensity.jmp, and to change the the number of bins. To do this, use the hand tool. Pick it up on the tool bar, move it to any histogram bar, left click and drag. You can show how you obtain as many bars as data values. This just shows the noise in the data; it does not show structure. Then you can drag until you get only two bars. This conceals the structure in the data, and all information is lost. JMP chooses the number of bins based on a traditional rule: The number of intervals should be approximately the square root of the number of data values. 0.20 0.40 0.60 70 150 230 310 5 bins of width 40: Too few? Do we lose too much information?

18 Measures of Location and Spread
Graphical displays are often supplemented with numerical measures that summarize the information in the data. Measures of centering or location include: Mean Median Mode Measures of spread or variability include: Variance Standard deviation Range The mode is not very useful as a measure of centering. It is primarily noteworthy because it gives rise to the terms “bimodal”, “trimodal”, etc.

19 Measures of Location and Spread
In a study of pull-off force for bonded wires, given in foot-pounds, what can we conclude about the distribution of values? What is a typical pull-off force? How do the measurements vary about the center? Where do the values center? Where do they balance? The point at which they balance is the mean.

20 Measures of Location and Spread
The sample mean or average is the most important measure of centering. The sample mean, referred to as ‘X-bar’, is the average of all observations from a sample: The mean is the center of gravity or balancing point of a data set. The sample mean is an estimate of the population mean, which is the average of all observations from a population.

21 Measures of Location and Spread
The sample mean for the pull-off force data. Notice that the mean is denoted by a fulcrum to emphasize that it is the balancing point of the distribution of values.

22 Measures of Location and Spread
The sample median is the 50th percentile of the sample data. Half of the data values lie below the median and half lie above the median. The median is the middle value when the data are ordered. The sample median is denoted X 0.50. The sample median and sample mean are approximately the same if the distribution is symmetric. In skewed data, the mean is pulled off in the direction of the long tail by extreme values. The median is insensitive to extreme values, as it is based only on the rank of values.

23 Measures of Location and Spread
The sample mode is the most frequently occurring value in a dataset. The mode is of little interest in itself. Terms such as unimodal, bimodal and multimodal are of interest: A unimodal distribution has one peak. A bimodal distribution has two peaks. A multimodal distribution has two or more peaks. Multimodal distributions are often indications that more than one underlying population or process is represented in the data. As mentioned earlier, the mode itself is not of interest.

24 Measures of Location and Spread
Example: Histogram of the Asphalt Content 0.05 0.10 0.15 0.20 0.25 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 Possibly a mixture of data from four different batches of material?

25 Measures of Location and Spread
0.10 0.30 0.50 3 4 5 6 7 8 9 0.20 0.40 0.60 Batch 1 Batch 2 Batch 3 Batch 4

26 Measures of Location and Spread
The two most important measures of the variability or spread of sample data are the sample variance and sample standard deviation. Sample Variance, denoted by S2: S2 is an estimate of the population variance, 2. S2 = “average” squared distance between the data points and the sample mean. Sample Standard Deviation, denoted by S: S is an estimate of the population standard deviation, . S = square root of “average” squared distance between the data points and the sample mean. The sample variance is useful from a theoretical perspective. The standard deviation can be interpreted in terms of the raw data. First of all, the standard deviation is in the units of the raw data; the variance is in the square of those units. Also, the Empirical Rule gives us a way to related the magnitude of the standard deviation to the spread of the data. We will see this later – trust us for now!

27 Measures of Location and Spread
The sample variance is “average” squared distance between the data points and the sample mean: The sample standard deviation is the square root of the sample variance: Note that the sample standard deviation is a value whose units are the original measurement units. Intuitively, the variance is the averaged squared “distance” of points from their mean. The average is obtained by dividing by n – 1, rather than n, because there are really only n – 1 independent summands in the sum of the squares. The next slide illustrates this. The averaging with respect to n – 1 is done for theoretical reasons, but the independence of the n – 1 summands is a critical factor in the theory. S2, as defined, is an unbiased estimator of the population variance, s2. In fact, S2 underestimates sigma2, and the division by n – 1 corrects for this underestimation. The sample standard deviation, S, is not an unbiased estimator of sigma. (This is why S-bar is divided by a factor, c4, when one estimates sigma using a control chart for S.)

28 Measures of Location and Spread
Calculation of variance and standard deviation for the pull-off force data. Note that the third column sums to 0. This means that, if n – 1 deviations are known, the remaining one is known by subtraction. This is where the n – 1 degrees of freedom come from. There are only n – 1 independent deviations. Note also that S2 is in the square of the original measurement units, whereas S is given in the units of the original data.

29 Measures of Location and Spread
The sample range (R) is the difference between the largest observation and the smallest observation. The sample range is the simplest measure of spread about the sample mean: R = High value - Low value This formula is often written as: R = Xmax - Xmin Example: For the pull-off force data, R = Xmax- Xmin= = 1.3 ft-lbs Note that the sample range is completely determined by the extreme values, making it very sensitive to these values. Although R works reasonably as an estimate of the spread in a control chart where subgroup sizes are small, it is not the best choice when subgroup sizes are large. In that case, we use and S chart for spread.

30 Parameters and Statistics
We will start by defining some basic statistical terms used throughout this course. A population is a set of all possible observations or units of interest. Some populations are finite (all parts in inventory today). Others are conceptual (all parts that can be produced by a machine at given settings). A sample is a set of observations drawn from a population. A random sample is a representative sample drawn from the population. Such a sample must be selected in a random manner so that each member of the population has an equal probability of being selected.

31 Parameters and Statistics
The population mean is the theoretical (unknown) average of all population measurements. For continuous data, the population mean is denoted by the Greek letter  (mu). The population standard deviation is the theoretical (unknown) standard deviation of a population. For continuous data, it is denoted by the Greek letter  (sigma). The population variance is the theoretical variance of a population. For continuous data, it is denoted by 2 (sigma squared). We virtually never know the true values of the population mean, standard deviation, or variance.

32 Parameters and Statistics
A parameter is a numerical value calculated from population data. The population mean, standard deviation, and variance are examples of parameters. Since we are virtually never able to compute parameters, they are theoretical quantities. Parameters are often represented by Greek letters: m, s, and s2 are examples of this convention. A statistic is a numerical value calculated from sample data. Examples are the sample mean ( ), the sample standard deviation (S), and the sample variance (S2). These statistics are used to estimate the corresponding population parameters.

33 Parameters and Statistics
Population Sample m s, s2 p S, S2 Unknown! Can calculate!

34 Parameters and Statistics
Other examples of parameters are: the median, the range, and any percentile of a population. These are estimated by taking a random sample from a population, and calculating its median, range, and percentiles. Of critical importance in estimating population parameters is the ability to draw a sample (often, but not always, a random sample) from the population. In order to do this, the population of interest must be well-defined.

35 Distributions The word distribution is used to describe the pattern formed by measurements. For example, we discuss the distribution of cycle times or of errors. In the case of a population of measurements, we talk about the theoretical distribution. For example, cycle times from a sample might have the distribution given by the histogram on the next slide. Keep in mind that the theoretical distribution is, in general unknown. It is something that we try to estimate.

36 Distributions The histogram might give cycle times for a sample of 100 orders. The continuous curve that is overlayed on the histogram might represent the distribution of cycle times in the underlying population. 0.10 0.20 0.30 0.40 100 150 200 250

37 Distributions Theoretical distributions are useful for two main reasons: Modeling data, and Providing a “yardstick” for sample statistics. With this in mind, we will introduce several useful theoretical distributions: The normal (or Gaussian) distribution, which is based on continuous data, The binomial distribution, which applies to two-category nominal data, and The Poisson distribution, which applies to counts of occurrences.

38 The Normal Distribution
The normal distribution is the basis for many of the statistical techniques that we cover throughout the course. There are many other distributions that are used for modeling continuous data. However, the normal distribution is useful in terms of sample statistics as well as for modeling data. The normal distribution has the classic bell-shape. There are infinitely many normal distributions, each defined by a value for the mean, m, and one for the variance, s2. If a quantity, call it X, has a normal distribution with mean  and variance 2, we denote this by writing X ~ N(, 2 ).

39 The Normal Distribution
There are infinitely many possible normal distributions defined by values of the population parameters  and 2. Below, we see examples of normal curves with different means and variances. N(25, 1) N(15, 9)

40 The Normal Distribution
Example: The distributions of characteristics of manufactured product are often normal. Suppose that certain bonded wires have pull-off force measurements, X, that are normally distributed with mean 10 and variance 4. The distribution of X is denoted by X ~ N(10, 4). Note that s = 2. The shaded area shows P(X>13), where X represents the pull-off force.

41 The Normal Distribution
99.73% 95.45% 68.27%

42 The Binomial Distribution
Another distribution that is extremely prevalent is called the binomial distribution. Binomial data are data that result from a series of trials, where each trial results in only one of two possible values, pass or fail, success or failure, yes or no, etc. To have a binomial distribution, three conditions must be met: The number of trials, denoted by n, is fixed in advance; The probability of obtaining a success (which is denoted by “p”) must be constant from trial to trial; The trials are independent (obtaining a success on one trial must not affect the likelihood of obtaining an success on another trial).

43 The Binomial Distribution
A binomial variable is the total number of successes in n trials where the previous conditions are satisfied. The binomial distribution provides a model for many industrial situations. The following are quantities that might well have binomial distributions: The number of parts produced with a particular type of defect; The number orders for a given part that take in excess of 20 days to fill; The number of late deliveries of a certain type of shipment.

44 The Binomial Distribution
In the case of binomial data, we are usually interested in the proportion of successes in a series of trials (rather than the total number of successes). Equivalently, we are interested in the probability p of a success. For example, we may be interested in the proportion of parts with a defect, or the proportion of late deliveries. The proportion of successes in a random sample drawn from a binomial distribution is denoted by .

45 The Binomial Distribution
Example: Suppose that 100 records are randomly chosen from the data warehouse, and that 12 of these have a particular type of error. Then an estimate of p, the proportion of records in the entire data warehouse that have this type of error, is given by:

46 The Poisson Distribution
Another distribution of interest is called the Poisson distribution. The Poisson distribution is used to model the number of occurrences of an event that is relatively rare in some unit of time or space. The following might be modeled by Poisson distributions: The number of stacking marks per month’s production of cups; The number of customer returns of a given type of product, reported weekly; The number of OSHA recordable injuries per 100,000 man hours; The number of defects in a large casting.

47 The Poisson Distribution
The parameter of interest for a Poisson distribution is the average number of occurrences in the unit of time or space. So, for example, the population mean for the number of errors of a specific type entering the data warehouse daily, or the population mean of the number of defects in large castings. This theoretical mean is denoted by “c”, for “count”. A sample can be used to estimate c. The estimate is simply the average of the counts in the sample. For example, if the numbers of errors for five randomly chosen days are 8, 5, 6, 4, and 7, then c is estimated by

48 The Empirical Rule The Empirical Rule provides an estimate of the proportion of data values falling within a certain distance of the mean. The Empirical Rule states that, if a frequency distribution is approximately symmetric and mounded in shape, then: Approximately 68% of all values will fall within one standard deviation of the mean. Approximately 95% will fall within two standard deviations of the mean. Nearly 100% will fall within 3 standard deviations of the mean. The Empirical Rule is derived from the probabilities associated with a normal random variable.

49 The Empirical Rule We repeat a graph shown earlier. 99.73% 95.45%
68.27%


Download ppt "Statistics."

Similar presentations


Ads by Google