Measure Phase Six Sigma Statistics

Measure Phase Six Sigma Statistics
Now we will continue in the Measure Phase with “Six Sigma Statistics”.

Six Sigma Statistics Welcome to Measure Process Discovery
Descriptive Statistics Normal Distribution Assessing Normality Graphing Techniques Basic Statistics Special Cause / Common Cause Wrap Up & Action Items Process Capability Measurement System Analysis Six Sigma Statistics Process Discovery Welcome to Measure In this module you will learn how your processes speak to you in the form of data. If you are to understand the behaviors of your processes, then you must learn to communicate with the process in the language of data. The field of statistics provides the tools and techniques to act on data, to turn data into information and knowledge which you will then use to make decisions and to manage your processes. The statistical tools and methods that you will need to understand and optimize your processes are not difficult. Use of excel spreadsheets or specific statistical analytical software has made this a relatively easy task. In this module you will learn basic, yet powerful analytical approaches and tools to increase your ability to solve problems and manage process behavior.

Purpose of Basic Statistics
The purpose of Basic Statistics is to: Provide a numerical summary of the data being analyzed. Data (n) Factual information organized for analysis. Numerical or other information represented in a form suitable for processing by computer Values from scientific experiments. Provide the basis for making inferences about the future. Provide the foundation for assessing process capability. Provide a common language to be used throughout an organization to describe processes. Statistics is the basic language of Six Sigma. A solid understanding of Basic Statistics is the foundation upon which many of the subsequent tools will be based. Having an understanding of Basic Statistics can be quite valuable. Statistics, like anything, can be taken to the extreme. But it is not the need or the intent of this course to do that, nor is it the intent of Six Sigma. It can be stated that Six Sigma does not make people into statisticians, rather it makes people into excellent problem solvers by using applicable statistical techniques. Data is like crude oil that comes out of the ground. Crude oil is not of much good use. However if the crude oil is refined many useful products occur; such as medicines, fuel, food products, lubricants, etc. In a similar sense statistics can refine data into usable “products” to aid in decision making, to be able to see and understand what is happening, etc Statistics is broadly used by just about everyone today. Sometimes we just don’t realize it. Things as simple as using graphs to better understand something is a form of statistics, as are the many opinion and political polls used today. With easy to use software tools to reduce the difficulty and time to do statistical analyses, knowledge of statistics is becoming a common capability amongst people. An understanding of Basic Statistics is also one of the differentiating features of Six Sigma and it would not be possible without the use of computers and programs like SigmaXL®. It has been observed that the laptop is one of the primary reasons that Six Sigma has become both popular and effective. Relax….it won’t be that bad!

Statistical Notation – Cheat Sheet
Summation The Standard Deviation of sample data The Standard Deviation of population data The variance of sample data The variance of population data The range of data The average range of data Multi-purpose notation, i.e. # of subgroups, # of classes The absolute value of some term Greater than, less than Greater than or equal to, less than or equal to An individual value, an observation A particular (1st) individual value For each, all, individual values The Mean, average of sample data The grand Mean, grand average The Mean of population data A proportion of sample data A proportion of population data Sample size Population size Use this as a cheat sheet, don’t bother memorizing all of this. Actually most of the notation in Greek is for population data.

Parameters vs. Statistics
Population: All the items that have the “property of interest” under study. Frame: An identifiable subset of the population. Sample: A significantly smaller subset of the population used to make an inference. Population Sample The purpose of sampling is: To get a “sufficiently accurate” inference for considerably less time, money and other resources, and also to provide a basis for statistical inference; if sampling is done well, and sufficiently, then the inference is that “what we see in the sample is representative of the population” A population parameter is a numerical value that summarizes the data for an entire population, a sample has a corresponding numerical value called a statistic. The population is a collection of all the individual data of interest. It must be defined carefully, such as all the trades completed in If for some reason there are unique subsets of trades it may be appropriate to define those as a unique population, such as, “all sub custodial market trades completed in 2001”, or “emerging market trades”. Sampling frames are complete lists and should be identical to a population with every element listed only once. It sounds very similar to population, and it is. The difference is how it is used. A sampling frame, such as the list of registered voters, could be used to represent the population of adult general public. Maybe there are reasons why this wouldn’t be a good sampling frame. Perhaps a sampling frame of licensed drivers would be a better frame to represent the general public. The sampling frame is the source for a sample to be drawn. It is important to recognize the difference between a sample and a population because we typically are dealing with a sample of the what the potential population could be in order to make an inference. The formulas for describing samples and populations are slightly different. In most cases we will be dealing with the formulas for samples. Population Parameters: Arithmetic descriptions of a population µ,  , P, 2, N Sample Statistics: Arithmetic descriptions of a sample X-bar , s, p, s2, n

Attribute Data (Qualitative)
Types of Data Attribute Data (Qualitative) Is always binary, there are only two possible values (0, 1) Yes, No Go, No go Pass/Fail Variable Data (Quantitative) Discrete (Count) Data Can be categorized in a classification and is based on counts. Number of defects Number of defective units Number of customer returns Continuous Data Can be measured on a continuum, it has decimal subdivisions that are meaningful Time, Pressure, Conveyor Speed, Material feed rate Money Pressure Conveyor Speed Material feed rate The nature of data of data is important to understand. Based on the type of data you will have the option to utilize different analyses. Data, or numbers, are usually abundant and available to virtually everyone in the organization. Using data to measure, analyze, improve, and control processes forms the foundation of the Six Sigma methodology. Data turned into information, then transformed into knowledge, lowers the risks of decision. Your goal is to make more decisions based on data versus the typical practices of “I think,” “I feel,” and “In my opinion”. One of your first steps in refining data into information is to recognize what the type of data is that you are using. There are two primary types of data, they are attribute and variable data. Attribute Data is also called qualitative data. Attribute Data is the lowest level of data. It is purely binary in nature. Good or bad, yes or no type data. No analysis can be performed on Attribute Data. Attribute Data must be converted to a form of variable data called discrete data in order to be counted or be useful. Discrete Data is information that can be categorized into a classification. Discrete Data is based on counts. It is typically things counted in whole numbers. Discrete Data is data that can't be broken down into a smaller unit to add additional meaning. Only a finite number of values is possible, and the values cannot be subdivided meaningfully. For example, there is no such thing as a half of defect or a half of a system lockup. Continuous Data is information that can be measured on a continuum or scale. Continuous Data, also called quantitative data can have almost any numeric value and can be meaningfully subdivided into finer and finer increments, depending upon the precision of the measurement system. Decimal sub-divisions are meaningful with Continuous Data. As opposed to Attribute Data like good or bad, off or on, etc., Continuous Data can be recorded at many different points (length, size, width, time, temperature, cost, etc.). For example inches is a meaningful number, whereas defects does not make sense. Later in the course we will study many different statistical tests but it is first important to understand what kind of data you have.

Possible Values for the Variable
Discrete Variables Discrete Variable Possible Values for the Variable The number of defective needles in boxes of 100 diabetic syringes 0,1,2, …, 100 The number of individuals in groups of 30 with a Type A personality 0,1,2, …, 30 The number of surveys returned out of 300 mailed in a customer satisfaction study. 0,1,2, … 300 The number of employees in 100 having finished high school or obtained a GED 0,1,2, … 100 The number of times you need to flip a coin before a head appears for the first time 1,2,3, … (note, there is no upper limit because you might need to flip forever before the first head appears) Shown here are additional Discrete Variables. Can you think of others within your business?

Possible Values for the Variable
Continuous Variables Continuous Variable Possible Values for the Variable The length of prison time served for individuals convicted of first degree murder All the real numbers between a and b, where a is the smallest amount of time served and b is the largest. The household income for households with incomes less than or equal to $30,000 All the real numbers between a and $30,000, where a is the smallest household income in the population The blood glucose reading for those individuals having glucose readings equal to or greater than 200 All real numbers between 200 and b, where b is the largest glucose reading in all such individuals Shown here are additional Continuous Variables. Can you think of others within your business?

Definitions of Scaled Data
Understanding the nature of data and how to represent it can affect the types of statistical tests possible. Nominal Scale – data consists of names, labels, or categories. Cannot be arranged in an ordering scheme. No arithmetic operations are performed for nominal data. Ordinal Scale – data is arranged in some order, but differences between data values either cannot be determined or are meaningless. Interval Scale – data can be arranged in some order and for which differences in data values are meaningful. The data can be arranged in an ordering scheme and differences can be interpreted. Ratio Scale – data that can be ranked and for which all arithmetic operations including division can be performed. (division by zero is of course excluded) Ratio level data has an absolute zero and a value of zero indicates a complete absence of the characteristic of interest. Shown here are the four types of scales. It is important to understand these scales as they will dictate the type of statistical analysis that can be performed on your data.

Possible nominal level data values for the variable
Nominal Scale Qualitative Variable Possible nominal level data values for the variable Blood Types A, B, AB, O State of Residence Alabama, …, Wyoming Country of Birth United States, China, other Listed are some examples of Nominal Data. The only analysis is whether they are different or not. Time to weigh in!

Possible Ordinal level data values
Ordinal Scale Qualitative Variable Possible Ordinal level data values Automobile Sizes Subcompact, compact, intermediate, full size, luxury Product rating Poor, good, excellent Baseball team classification Class A, Class AA, Class AAA, Major League These are examples of Ordinal Data.

Interval Scale Interval Variable Possible Scores
IQ scores of students in BlackBelt Training 100… (the difference between scores is measurable and has meaning but a difference of 20 points between 100 and 120 does not indicate that one student is 1.2 times more intelligent ) These are examples of Interval Data.

Ratio Scale Ratio Variable Possible Scores
Grams of fat consumed per adult in the United States 0 … (If person A consumes 25 grams of fat and person B consumes 50 grams, we can say that person B consumes twice as much fat as person A. If a person C consumes zero grams of fat per day, we can say there is a complete absence of fat consumed on that day. Note that a ratio is interpretable and an absolute zero exists.) Shown here is an example of Ratio Data.

Converting Attribute Data to Continuous Data
Continuous Data is always more desirable In many cases Attribute Data can be converted to Continuous Which is more useful? 15 scratches or Total scratch length of 9.25” 22 foreign materials or 2.5 fm/square inch 200 defects or 25 defects/hour Continuous Data provides us more opportunity for statistical analyses. Attribute Data can often be converted to Continuous by converting it to a rate.

Descriptive Statistics
Measures of Location (central tendency) Mean Median Mode Measures of Variation (dispersion) Range Interquartile Range Standard deviation Variance We will review the Descriptive Statistics shown here which are the most commonly used. 1) For each of the measures of location, how alike or different are they? 2) For each measure of variation, how alike or different are they? 3) What do these similarities or differences tell us?

Open “Measure Data Sets.xls” and select the “Basic Statistics” worksheet. We are going to use the worksheet shown here to create graphs and statistics. Open the workbook “Measure Data Sets.xls” and select the “Basic Statistics” worksheet. Change the Start Point and the Bin Width as shown and click Update Chart. Typically one would use the default values but these changes produce a cleaner histogram.

Mean is: Population Sample Measures of Location
Commonly referred to as the average. The arithmetic balance point of a distribution of data. Population Sample SigmaXL>Graphical Tools>Basic Histogram Count = 200 Mean = Stdev = Range = Minimum = th Percentile (Q1) = th Percentile (Median) = 5 75th Percentile (Q3) = Maximum = 5.02 95% CI Mean = 5.00 to % CI Sigma = 0.01 to Anderson-Darling Normality Test: A-Squared = 8.278; p-value = The Mean are the most common measures of location. A “Mean” implies that you are talking about the population or inferring something about the population. Conversely the average means you are implying something about all the sample data. Although the symbol is different, there is no mathematical difference between the Mean of a sample and Mean of a population. To produce chart, SigmaXL>Graphical Tools>Basic Histogram, Select Data, as “Numeric Data Variable (Y)”. Set Start Point to 4.97 and Bin width to 0.01 then click update chart. Please select Descriptive Statistics. After clicking OK, the X axis should be modified to show 2 decimal places. Note, that Descriptive Statistics are also available in SigmaXL>Statistical Tools>Descriptive Statistics, and SigmaXL>Graphical Tools>Histograms & Descriptive Statistics.

Median is: Measures of Location
The mid-point, or 50th percentile, of a distribution of data. Arrange the data from low to high, or high to low. It is the single middle value in the ordered list if there is an odd number of observations It is the average of the two middle values in the ordered list if there are an even number of observations Count = 200 Mean = Stdev = Range = Minimum = th Percentile (Q1) = th Percentile (Median) = 5 75th Percentile (Q3) = Maximum = 5.02 95% CI Mean = 5.00 to % CI Sigma = 0.01 to Anderson-Darling Normality Test: A-Squared = 8.278; p-value = The physical center of a data set is the Median and is unaffected by large data values. This is why people use Median when discussing average salary for an American worker; people like Bill Gates and Warren Buffet skew the average number.

Measures of Location Trimmed Mean is a: Compromise between the Mean and Median. The Trimmed Mean is calculated by eliminating a specified percentage of the smallest and largest observations from the data set and then calculating the average of the remaining observations Useful for data with potential extreme values. To calculate a Trimmed Mean in Excel, enter the following formula into an empty cell adjacent to the data: SigmaXL® does not include Trimmed Mean, but Excel’s native function can be used as shown above. The next slide will explain each part of this formula. The Trimmed Mean (highlighted above) is less susceptible to the effects of extreme scores. 19

TRIMMEAN - This is the Excel Function to calculate a Trimmed Mean.
Array – This is your range of data. We will be taking the Trimmed Mean of the data in cells A2:A201. Percent – This is the percentage that you wish to Trim, i.e., the total percentage of the data which will be excluded. A percentage value of 0.1 will exclude 5% of the largest values and 5% of the smallest. The Mean is calculated on the remaining data. SigmaXL® does not include Trimmed Mean, but Excel’s native function can be used as shown above.

Mode is: Mode = 5 Measures of Location
The most frequently occurring value in a distribution of data. Mode = 5 It is possible to have multiple Modes. When this happens it’s called Bi-modal Distributions. Here we only have one; Mode = 5.

Use Range or Interquartile Range when the data distribution is Skewed.
Measures of Variation Range is the: Difference between the largest observation and the smallest observation in the data set. A small range would indicate a small amount of variability and a large range a large amount of variability. Interquartile Range is the: Difference between the 75th percentile and the 25th percentile. Use Range or Interquartile Range when the data distribution is Skewed. Data Count = 200 Mean = Stdev = Range = Minimum = th Percentile (Q1) = th Percentile (Median) = 5 75th Percentile (Q3) = Maximum = 5.02 95% CI Mean = 5.00 to % CI Sigma = 0.01 to Anderson-Darling Normality Test: A-Squared = 8.278; p-value = A Range is typically used for small data sets which is completely efficient in estimating variation for a sample of 2. As your data increases the Standard Deviation is a more appropriate measure of variation.

Standard Deviation is:
Measures of Variation Standard Deviation is: Equivalent of the average deviation of values from the Mean for a distribution of data. A “unit of measure” for distances from the Mean. Use when data are symmetrical. Population Sample Cannot calculate population Standard Deviation because this is sample data. Data Count = 200 Mean = Stdev = Range = Minimum = th Percentile (Q1) = th Percentile (Median) = 5 75th Percentile (Q3) = Maximum = 5.02 95% CI Mean = 5.00 to % CI Sigma = 0.01 to Anderson-Darling Normality Test: A-Squared = 8.278; p-value = The Standard Deviation for a sample and population can be equated with short and long-term variation. Usually a sample is taken over a short period of time making it free from the types of variation that can accumulate over time so be aware. We will explore this further at a later point in the methodology.

Variance is the: Sample Population Measures of Variation
Average squared deviation of each individual data point from the Mean. Sample Population The Variance is the square of the Standard Deviation. It is common in statistical tests where it is necessary to add up sources of variation to estimate the total. Standard Deviations cannot be added, variances can.

What are the characteristics of a Normal Distribution?
The Normal Distribution is the most recognized distribution in statistics. What are the characteristics of a Normal Distribution? Only random error is present Process free of assignable cause Process free of drifts and shifts So what is present when the data is Non-normal? We can begin to discuss the Normal Curve and its properties once we understand the basic concepts of central tendency and dispersion. As we begin to assess our distributions know that sometimes it’s actually more difficult to determine what is effecting a process if it is Normally Distributed. When we have a Non-normal Distribution there is usually special or more obvious causes of variation that can be readily apparent upon process investigation.

The Normal Curve The Normal Curve is a smooth, symmetrical, bell-shaped curve, generated by the density function. It is the most useful continuous probability model as many naturally occurring measurements such as heights, weights, etc. are approximately Normally Distributed. The Normal Distribution is the most commonly used and abused distribution in statistics and serves as the foundation of many statistical tools which will be taught later in the methodology.

“Standard” Normal Distribution Has a μ = 0, and σ = 1
Each combination of Mean and Standard Deviation generates a unique Normal Curve: “Standard” Normal Distribution Has a μ = 0, and σ = 1 Data from any Normal Distribution can be made to fit the standard Normal by converting raw scores to standard scores. Z-scores measure how many Standard Deviations from the Mean a particular data-value lies. The shape of the Normal Distribution is a function of 2 parameters, (the Mean and the Standard Deviation). We will convert the Normal Distribution to the standard Normal in order to compare various Normal Distributions and to estimate tail area proportions. By normalizing the Normal Distribution this converts the raw scores into standard Z-scores with a Mean of 0 and Standard Deviation of 1; this practice allows us to use the Z-table.

Normal Distribution The area under the curve between any 2 points represents the proportion of the distribution between those points. Convert any raw score to a Z-score using the formula: Refer to a set of Standard Normal Tables to find the proportion between μ and x. m x The area between the Mean and any other point depends upon the Standard Deviation. The area under the curve between any two points represents the proportion of the distribution. The concept of determining the proportion between 2 points under the standard Normal curve is a critical component to estimating Process Capability and will be covered in detail in that module.

The Empirical Rule… The Empirical Rule
+6 -1 -3 -4 -5 -6 -2 +4 +3 +2 +1 +5 The Empirical Rule allows us to predict or more appropriately make an estimate of how our process is performing. You will gain a great deal of understanding within the Process Capability module. Notice the difference between +/- 1 SD and +/- 6 SD. 68.27 % of the data will fall within +/- 1 Standard Deviation 95.45 % of the data will fall within +/- 2 Standard Deviations 99.73 % of the data will fall within +/- 3 Standard Deviations % of the data will fall within +/- 4 Standard Deviations % of the data will fall within +/- 5 Standard Deviations % of the data will fall within +/- 6 Standard Deviations

The Empirical Rule (cont.)
No matter what the shape of your distribution is, as you travel 3 Standard Deviations from the Mean, the probability of occurrence beyond that point begins to converge to a very low number. Please read the slide.

There are many types of distributions:
Why Assess Normality? While many processes in nature behave according to the Normal Distribution, many processes in business, particularly in the areas of service and transactions, do not. There are many types of distributions: There are many statistical tools that assume Normal Distribution properties in their calculations. So understanding just how “Normal” the data are will impact how we look at the data. There is no good and bad. It is not always better to have “Normal” data; look at it in respect to the intent of your project. Again, there is much informational content in Non-normal Distributions, for this reason it is useful to know how Normal our data are. Go back to your project, what do you want to do with your distribution, Normal or Non-normal? Many distributions simply by nature can NOT be Normal. Assume that your dealing with a time metric, how do you get negative time, without having a flux capacitor as in the movie “Back to the Future.” If your metric is, by nature bound to some setting.

Tools for Assessing Normality
The shape of any Normal curve can be calculated based on the Normal Probability density function. Tests for Normality basically compare the shape of the calculated curve to the actual distribution of your data points. For the purposes of this training, we will focus on 2 ways in SigmaXL® to assess Normality: The Anderson-Darling test Normal probability test Watch that curve! The Anderson Darling test yields a statistical assessment (called a goodness-of-fit test) of Normality and the SigmaXL® version of the Normal Probability test produces a graph to visually demonstrate just how good that fit is.

The Anderson-Darling test uses an empirical density function.
Goodness-of-Fit The Anderson-Darling test uses an empirical density function. Departure of the actual data from the expected Normal Distribution. The Anderson-Darling Goodness-of-Fit test assesses the magnitude of these departures using an Observed minus Expected formula. The Anderson-Darling test assesses how closely actual frequency at a given value corresponds to the theoretical frequency for a Normal Distribution with the same Mean and Standard Deviation.

The Normal Probability Plot
Open the worksheet tab “Amount”. The graph shows the probability density of your data plotted against the expected density of a Normal curve. Notice that the y-axis (probability) does not increase linearly. Normal data will lie on a straight line (the black line) in this analysis. The graph shows you which values tend to deviate from the Normal Curve.

The Anderson-Darling test is a good litmus test for normality: if the P-value is more than .05, your data are normal enough for most purposes. The reasoning behind the decision to assume Normality based on the P-value will be covered in the Analyze Phase. For now, just accept this as a general guideline. Open the worksheet tab “Descriptive Statistics”. Using SigmaXL®’s Histograms and Descriptive Statistics tool, select Anderson Darling and click OK. P-value

Anderson-Darling Caveat
Use the Anderson Darling column to generate these graphs. In this case, both the Histogram and the Normality Plot look very “normal”. However, because the sample size is so large, the Anderson-Darling test is very sensitive and any slight deviation from Normal will cause the P-value to be very low. Again, the topic of sensitivity will be covered in greater detail in the Analyze Phase. For now, just assume that if N > 100 and the data look Normal, then they probably are. Please read the slide.

If the Data Are Not Normal, Don’t Panic!
Normal Data are not common in the transactional world. There are lots of meaningful statistical tools you can use to analyze your data (more on that later). It just means you may have to think about your data in a slightly different way. Don’t touch that button! Once again, Non-normal data is NOT a bad thing, depending on the type of process / metrics you are working with. Sometimes it can even be exciting to have Non-normal data because in some ways it represents opportunities for improvements.

Exercise objective: To demonstrate how to test for Normality.
Normality Exercise Exercise objective: To demonstrate how to test for Normality. Generate Normal Probability Plots and the graphical summary using the “Descriptive Statistics” tab. Use only the columns Dist A and Dist D. Answer the following quiz questions based on your analysis of this data set. Answers: 1) Is Distribution A Normal? Answer > No 2) Is Distribution B Normal? Answer > No

Isolating Special Causes from Common Causes
Special Cause: Variation is caused by known factors that result in a non-random distribution of output. Also referred to as “Assignable Cause”. Common Cause: Variation caused by unknown factors resulting in a steady but random distribution of output around the average of the data. It is the variation left over after Special Cause variation has been removed and typically (not always) follows a Normal Distribution. If we know that the basic structure of the data should follow a Normal Distribution, but plots from our data shows otherwise; we know the data contain Special Causes. Special Causes = Opportunity Don’t get too worried about killing all variation, get the biggest bang for your buck and start making improvements by following the methodology. Many companies today can realize BIG gains and reductions in variation by simply measuring, describing the performance and then making common sense adjustments within the process…recall the “ground fruit”? Think about your data in terms of what it should look like, then compare it to what it does look like. See some deviation, maybe some Special Causes at work?

Introduction to Graphing
The purpose of Graphing is to: Identify potential relationships between variables. Identify risk in meeting the critical needs of the Customer, Business and People. Provide insight into the nature of the X’s which may or may not control Y. Show the results of passive data collection. In this section we will cover… Box Plots Scatter Plots Dot Plots Time Series Plots Histograms Passive data collection means don’t mess with the process! We are gathering data and looking for patterns in a graphical tool. If the data is questionable, so is the graph we create from it. For now utilize the data available, we will learn a tool called Measurement System Analysis later in this phase.

Data Sources Data sources are suggested by many of the tools that have been covered so far: Process Map X-Y Matrix Fishbone Diagrams FMEA Examples are: 1. Time Shift Day of the week Week of the month Season of the year 2. Location/position Facility Region Office 3. Operator Training Experience Skill Adherence to procedures 4. Any other sources? Data demographics will come out of the basic Measure Phase tools such as Process Maps, X-Y Matrixes, FMEAs and Fishbones. Put your focus on the top X’s from X-Y Matrix to focus your activities.

The characteristics of a good graph include: Variety of data
Graphical Concepts The characteristics of a good graph include: Variety of data Selection of Variables Graph Range Information to interpret relationships Explore quantitative relationships The characteristics of a graph are critical to the graphing process. The validity of data allows us to understand the extent of error in the data. The selection of variables impacts how we can control a specific output of a process. The type of graph will depend on the data demographics while the range will be related to the needs of the customer. The visual analysis of the graph will qualify further investigation of the quantitative relationship between the variables.

The Histogram A Histogram displays data that have been summarized into intervals. It can be used to assess the symmetry or Skewness of the data. To construct a Histogram, the horizontal axis is divided into equal intervals and a vertical bar is drawn at each interval to represent its frequency (the number of values that fall within the interval). A Histogram is a basic graphing tool that displays the relative frequency or the number of times a measured items falls within a certain cell size. The values for the measurements are shown on the horizontal axis (in cells) and the frequency of each size is shown on the vertical axis as a bar graph. The graph illustrates the distribution of the data by showing which values occur most and least frequently. A Histogram illustrates the shape, centering and spread of the data you have. It is very easy to construct and an easy to use tool that you will find useful in many situations. This graph represents the data for the 20 days of arrival times at work from the previous lesson page. In many situations the data will form specific shaped distributions. One very common distribution you will encounter is called the Normal Distribution, also called the bell shaped curve for its appearance. You will learn more about distributions and what they mean throughout this course.

Histogram Caveat All the Histograms below were generated using random samples from the “Histogram” column of the worksheet “Graphing Data”. Be careful not to determine Normality simply from a Histogram plot, if the sample size is low the data may not look very Normal. As you can see in the SigmaXL® file the columns used to generate the Histograms above only have 20 data points. It is easy to generate your own samples to create Histogram simply by using the SigmaXL® menu path: “Data Manipulation>Random Subset”

Variation on a Histogram
Using the “Graphing Data” worksheet, create a simple Histogram for the data column called granular. The Histogram shown here looks to be very Normal.

Dot Plot The Dot Plot can be a useful alternative to the Histogram especially if you want to see individual values or you want to brush the data. Using the “Graphing Data ” tab, create a Dot Plot. Histogram for the granular distribution obscures the granularity, whereas the Dot Plot reveals it. Points could have Special Causes associated with them. These occurrences should also be identified in the Logbook in order to assess the potential for a Special Cause related to them. You should look for potential Special Cause situations by examining the Dot Plot for both high frequencies and location. If in fact there are Special Causes (Uncontrollable Noise or Procedural non-compliance) then they should be addressed separately and then excluded from this analysis. Take a few minutes and create other Dot Plots using the columns in this worksheet.

Box Plot Box Plots summarize data about the shape, dispersion and center of the data and also help spot outliers. Box Plots require that one of the variables, X or Y, be categorical or Discrete and the other be Continuous. A minimum of 10 observations should be included in generating the Box Plot. Middle 50% of Data 50th Percentile (Median) 25th Percentile 75th Percentile min(1.5 x Interquartile Range or minimum value) Outliers Maximum Value Mean A Box Plot (sometimes called a Whisker Plot) is made up of a box representing the central mass of the variation and thin lines, called whiskers, extending out on either side representing the thinning tails of the distribution. Box Plots summarize information about the shape, dispersion and center of your data. Because of their concise nature, it easy to compare multiple distributions side by side. These may be “before” and “after” views of a process or a variable. Or they may be several alternative ways of conducting an operation. Essentially, when you want to quickly find out if two or more distributions are different (or the same) then you create a Box Plot. They can also help you spot outliers quickly which show up as asterisks on the chart.

* Box Median Box Plot Anatomy Outlier Upper Limit: Q3+1.5(Q3-Q1)
Upper Whisker Lower Whisker Upper Limit: Q3+1.5(Q3-Q1) Lower Limit: Q1+1.5(Q3-Q1) Q3: 75th Percentile Q1: 25th Percentile Q2: Median 50th Percentile Box * Outlier A Box Plot is based on quartiles and represents a distribution as shown on the left of the graphic. The lines extending from the box are called whiskers. The whiskers extend outward to indicate the lowest and highest values in the data set (excluding outliers). The lower whisker represents the first 25% of the data in the Histogram (the light grey area). The second and third quartiles form the box, which represents fifty percent of the data and finally the whisker on the right represents the fourth quartile. The line drawn through the box represents the median of the data. Extreme values, or outliers, are represented by asterisks. A value is considered an outlier if it is outside of the box (greater than Q3 or less than Q1) by more than 1.5 times (Q3-Q1). You can use the Box Plot to assess the symmetry of the data: If the data are fairly symmetric, the median line will be roughly in the middle of the box and the whiskers will be similar in length. If the data are Skewed, the Median may not fall in the middle of the box and one whisker will likely be noticeably longer than the other.

Eat this – then check the Box Plot!
Box Plot Examples What can you tell about the data expressed in a Box Plots? Eat this – then check the Box Plot! The first Box Plot shows the differences in glucose level for nine different people. The second Box Plot shows the effects of cholesterol medication over time for a group of patients.

Box Plot Example Use the “Graphing Data” worksheet tab.

Box Plot Example The data shows the setup cycle time to complete “Lockout – Tagout” for 3 individuals in the department. The data shows the setup cycle time to complete “Lockout – Tagout” for three people in the department. Looking only at the Box Plots, it appears that Brian should be the benchmark for the department since he has the lowest Median setup cycle time with the smallest variation. On the other hand, Shree’s data has 3 outlier points that are well beyond what would be expected for the rest of the data and his variation is larger. Be cautious drawing conclusions solely from a Box Plot. Shree may be the expert who is brought in for special setups because no one else can complete the job.

Multi-Vari Chart Enhancement
The Multi-Vari Chart shows the individual data points that are represented in the Box Plot. Open the workbook “Measure Data Sets” and select the “Graphing Data” tab.

Multi-Vari Chart The individual value plot shown above was created using SigmaXL®’s Multi-Vari Chart tool.

Box Plot with a Continuous Y and an Attribute X (pass/fail)
Attribute Y Box Plot Box Plot with a Continuous Y and an Attribute X (pass/fail) Graphical Tools > Boxplots Open the “Graphing Data” tab. To create this Box Plot follow the SigmaXL® menu path “SigmaXL>Graphical Tools>Boxplots” If the output is pass/fail, it must be plotted on the y axis. Use the data shown to create the transposed Box Plot. The reason we do this is for consistency and accuracy.

Attribute Y Box Plot SigmaXL® does not permit transposed value and category scales so the above Box Plot shows pass/fail on the x-axis and Hydrogen Content on the Y-axis.

The Multi-Vari Chart will highlight the problem
Individual Value Plot The Multi-Vari Chart when used with a Categorical X or Y enhances the information provided in the Box Plot: Recall the inherent problem with the Box Plot when a bimodal distribution exists (Box Plot looks perfectly symmetrical) The Multi-Vari Chart will highlight the problem We will use a modified Multi-Vari Chart to display the individual points. Open the “Graphing Data” worksheet, select “Graphical Tools > Boxplots”. Select Data as the Numeric Data Variable (Y), and Distribution Type as the Group Category (X). The Multi-Vari Chart was created and modified using “Graphical Tools > Multi-Vari Options” using the same variables as the Box Plot. Note, if these options are saved they will be used in “Graphical Tools > Multi-Vari Charts”.

Multi-Vari Individuals
Under General Options select only “Individual Data Points” Under Mean Options select “Show Means” and “Connect Means”. Click Finish. The Multi-Vari Chart selection will now appear. Ensure your data is selected and click next. Select Data as your “Numeric Data Variable (Y)” and Distribution Type as your “Group Category (X1)”. Click OK. The following chart is produced. Graphical Tools > Multi-Vari Options Go to the Multi-Vari Individuals worksheet. Note, SigmaXL® does not support Jitter, a feature used to spread the individual data points.

Run Charts allow you to examine data over time.
Time Series Plot Run Charts allow you to examine data over time. Depending on the shape and frequency of patterns in the plot, several X’s can be found as critical or eliminated. Using the “Graphing Data” worksheet. A Run Chart is created by following the SigmaXL® menu path “SigmaXL>Graphical Tools>Run Chart”. Run charts, also known as Time Series Plots are very useful in most projects. Every project should provide Run Chart data to look for frequency, magnitude and patterns. What X would cause these issues?

What other characteristic is present?
Time Series Example Looking at the Run Chart below, the response appears to be very dynamic. What other characteristic is present? The benefit of this approach to charting is you can see every data point as it is gathered over time. Some interesting occurrences can be revealed.

Time Series Example (Cont.)
Let’s look at some other Time Series Plots. What is happening within each plot? What is different between the two plots? Using the “Graphing Data” worksheet. Now let’s lay 2 Time Series on top of each other. This can be done by following the SigmaXL® menu path “Graphical Tools > Overlay Run Chart” (use variables Time 2 and Time 3). What is happening within each plot? What’s the difference between the two plots? Time 3 appears to have wave pattern. Note, SigmaXL does not include Lowess Smoothing, however an advanced user could utilize exponential smoothing in Excel’s Data Analysis Toolpak

At this point, you should be able to:
Summary At this point, you should be able to: Explain the various statistics used to express location and spread of data Describe characteristics of a Normal Distribution Explain Special Cause variation Use data to generate various graphs and make interpretations based on their output Please read the slide.

Get the latest products at…
The Certified Lean Six Sigma Green Belt Assessment The Certified Lean Six Sigma Green Belt (CLSSGB) tests are useful for assessing Green Belt’s knowledge of Lean Six Sigma. The CLSSGB can be used in preparation for the ASQ or IASSC Certified Six Sigma Green Belt (CSSGB) exam or for any number of other certifications, including private company certifications. The Lean Six Sigma Green Belt Course Manual Open Source Six Sigma Course Manuals are professionally designed and formatted manuals used by Belt’s during training and for reference guides afterwards. The OSSS manuals complement the OSSS Training Materials and consist of slide content, instructional notes data sets and templates. Get the latest products at… 62

Measure Phase Six Sigma Statistics

Similar presentations

Presentation on theme: "Measure Phase Six Sigma Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Measure Phase Six Sigma Statistics

Similar presentations

Presentation on theme: "Measure Phase Six Sigma Statistics"— Presentation transcript:

Similar presentations

About project

Feedback