# Measure Phase Six Sigma Statistics

## Presentation on theme: "Measure Phase Six Sigma Statistics"— Presentation transcript:

Measure Phase Six Sigma Statistics
Now we will continue in the Measure Phase with “Six Sigma Statistics”.

Six Sigma Statistics Welcome to Measure Process Discovery
Descriptive Statistics Normal Distribution Assessing Normality Graphing Techniques Basic Statistics Special Cause / Common Cause Wrap Up & Action Items Process Capability Measurement System Analysis Six Sigma Statistics Process Discovery Welcome to Measure In this module you will learn how your processes speak to you in the form of data. If you are to understand the behaviors of your processes, then you must learn to communicate with the process in the language of data. The field of statistics provides the tools and techniques to act on data, to turn data into information and knowledge which you will then use to make decisions and to manage your processes. The statistical tools and methods that you will need to understand and optimize your processes are not difficult. Use of excel spreadsheets or specific statistical analytical software has made this a relatively easy task. In this module you will learn basic, yet powerful analytical approaches and tools to increase your ability to solve problems and manage process behavior.

Purpose of Basic Statistics
The purpose of Basic Statistics is to: Provide a numerical summary of the data being analyzed. Data (n) Factual information organized for analysis. Numerical or other information represented in a form suitable for processing by computer Values from scientific experiments. Provide the basis for making inferences about the future. Provide the foundation for assessing process capability. Provide a common language to be used throughout an organization to describe processes. Statistics is the basic language of Six Sigma. A solid understanding of basic statistics is the foundation upon which many of the subsequent tools will be based. Having an understanding of Basic Statistics can be quite valuable to an individual. Statistics, like anything, can be taken to the extreme. But it is not the need or the intent of this course to do that, nor is it the intent of Six Sigma. It can be stated that Six Sigma does not make people into statisticians, rather it makes people into excellent problem solvers by using applicable statistical techniques. Data is like crude oil that comes out of the ground. Crude oil is not of much good use. However if the crude oil is refined many useful products occur; such as medicines, fuel, food products, lubricants, etc. In a similar sense statistics can refine data into usable “products” to aid in decision making, to be able to see and understand what is happening, etc Statistics is broadly used by just about everyone today. Sometimes we just don’t realize it. Things as simple as using graphs to better understand something is a form of statistics, as are the many opinion and political polls used today. With easy to use software tools to reduce the difficulty and time to do statistical analyses, knowledge of statistics is becoming a common capability amongst people. An understanding of Basic Statistics is also one of the differentiating features of Six Sigma and it would not be possible without the use of computers and programs like MINITAB™. It has been observed that the laptop is one of the primary reasons that Six Sigma has become both popular and effective. Relax….it won’t be that bad!

Statistical Notation – Cheat Sheet
Summation The Standard Deviation of sample data The Standard Deviation of population data The variance of sample data The variance of population data The range of data The average range of data Multi-purpose notation, i.e. # of subgroups, # of classes The absolute value of some term Greater than, less than Greater than or equal to, less than or equal to An individual value, an observation A particular (1st) individual value For each, all, individual values The mean, average of sample data The grand mean, grand average The mean of population data A proportion of sample data A proportion of population data Sample size Population size Use this as a cheat sheet, don’t bother memorizing all of this. Actually most of the notation in Greek is for population data.

Parameters vs. Statistics
Population: All the items that have the “property of interest” under study. Frame: An identifiable subset of the population. Sample: A significantly smaller subset of the population used to make an inference. Population Sample The purpose of sampling is: To get a “sufficiently accurate” inference for considerably less time, money, and other resources, and also to provide a basis for statistical inference; if sampling is done well, and sufficiently, then the inference is that “what we see in the sample is representative of the population” A population parameter is a numerical value that summarizes the data for an entire population, a sample has a corresponding numerical value called a statistic. The population is a collection of all the individual data of interest. It must be defined carefully, such as all the trades completed in If for some reason there are unique subsets of trades it may be appropriate to define those as a unique population, such as, “all sub custodial market trades completed in 2001”, or “emerging market trades”. Sampling frames are complete lists and should be identical to a population with every element listed only once. It sounds very similar to population, and it is. The difference is how it is used. A sampling frame, such as the list of registered voters, could be used to represent the population of adult general public. Maybe there are reasons why this wouldn’t be a good sampling frame. Perhaps a sampling frame of licensed drivers would be a better frame to represent the general public. The sampling frame is the source for a sample to be drawn. It is important to recognize the difference between a sample and a population because we typically are dealing with a sample of the what the potential population could be in order to make an inference. The formulas for describing samples and populations are slightly different. In most cases we will be dealing with the formulas for samples. Population Parameters: Arithmetic descriptions of a population µ,  , P, 2, N Sample Statistics: Arithmetic descriptions of a sample X-bar , s, p, s2, n

Attribute Data (Qualitative)
Types of Data Attribute Data (Qualitative) Is always binary, there are only two possible values (0, 1) Yes, No Go, No go Pass/Fail Variable Data (Quantitative) Discrete (Count) Data Can be categorized in a classification and is based on counts. Number of defects Number of defective units Number of customer returns Continuous Data Can be measured on a continuum, it has decimal subdivisions that are meaningful Time, Pressure, Conveyor Speed, Material feed rate Money Pressure Conveyor Speed Material feed rate The nature of data of data is important to understand. Based on the type of data you will have the option to utilize different analyses. Data, or numbers, are usually abundant and available to virtually everyone in the organization. Using data to measure, analyze, improve, and control processes forms the foundation of the Six Sigma methodology. Data turned into information, then transformed into knowledge, lowers the risks of decision. Your goal is to make more decisions based on data versus the typical practices of “I think,” “I feel,” and “In my opinion”. One of your first steps in refining data into information is to recognize what the type of data is that you are using. There are two primary types of data, they are attribute and variable data. Attribute Data is also called qualitative data. Attribute Data is the lowest level of data. It is purely binary in nature. Good or bad, yes or no type data. No analysis can be performed on Attribute Data. Attribute Data must be converted to a form of variable data called discrete data in order to be counted or be useful. Discrete Data is information that can be categorized into a classification. Discrete Data is based on counts. It is typically things counted in whole numbers. Discrete Data is data that can't be broken down into a smaller unit to add additional meaning. Only a finite number of values is possible, and the values cannot be subdivided meaningfully. For example, there is no such thing as a half of defect or a half of a system lockup. Continuous Data is information that can be measured on a continuum or scale. Continuous Data, also called quantitative data can have almost any numeric value and can be meaningfully subdivided into finer and finer increments, depending upon the precision of the measurement system. Decimal sub-divisions are meaningful with Continuous Data. As opposed to Attribute Data like good or bad, off or on, etc., Continuous Data can be recorded at many different points (length, size, width, time, temperature, cost, etc.). For example inches is a meaningful number, whereas defects does not make sense. Later in the course we will study many different statistical tests but it is first important to understand what kind of data you have.

Possible values for the variable
Discrete Variables Discrete Variable Possible values for the variable The number of defective needles in boxes of 100 diabetic syringes 0,1,2, …, 100 The number of individuals in groups of 30 with a Type A personality 0,1,2, …, 30 The number of surveys returned out of 300 mailed in a customer satisfaction study. 0,1,2, … 300 The number of employees in 100 having finished high school or obtained a GED 0,1,2, … 100 The number of times you need to flip a coin before a head appears for the first time 1,2,3, … (note, there is no upper limit because you might need to flip forever before the first head appears. Shown here are additional Discrete Variables. Can you think of others within your business?

Possible Values for the Variable
Continuous Variables Continuous Variable Possible Values for the Variable The length of prison time served for individuals convicted of first degree murder All the real numbers between a and b, where a is the smallest amount of time served and b is the largest. The household income for households with incomes less than or equal to \$30,000 All the real numbers between a and \$30,000, where a is the smallest household income in the population The blood glucose reading for those individuals having glucose readings equal to or greater than 200 All real numbers between 200 and b, where b is the largest glucose reading in all such individuals Shown here are additional Continuous Variables. Can you think of others within your business?

Definitions of Scaled Data
Understanding the nature of data and how to represent it can affect the types of statistical tests possible. Nominal Scale – data consists of names, labels, or categories. Cannot be arranged in an ordering scheme. No arithmetic operations are performed for nominal data. Ordinal Scale – data is arranged in some order, but differences between data values either cannot be determined or are meaningless. Interval Scale – data can be arranged in some order and for which differences in data values are meaningful. The data can be arranged in an ordering scheme and differences can be interpreted. Ratio Scale – data that can be ranked and for which all arithmetic operations including division can be performed. (division by zero is of course excluded) Ratio level data has an absolute zero and a value of zero indicates a complete absence of the characteristic of interest. Shown here are the four types of Scales. It is important to understand these scales as they will dictate the type of statistical analysis that can be performed on your data.

Possible nominal level data values for the variable
Nominal Scale Qualitative Variable Possible nominal level data values for the variable Blood Types A, B, AB, O State of Residence Alabama, …, Wyoming Country of Birth United States, China, other Listed are some examples of Nominal Data. The only analysis is whether they are different or not. Time to weigh in!

Possible Ordinal level data values
Ordinal Scale Qualitative Variable Possible Ordinal level data values Automobile Sizes Subcompact, compact, intermediate, full size, luxury Product rating Poor, good, excellent Baseball team classification Class A, Class AA, Class AAA, Major League These are examples of Ordinal Data.

Interval Scale Interval Variable Possible Scores
IQ scores of students in BlackBelt Training 100… (the difference between scores is measurable and has meaning but a difference of 20 points between 100 and 120 does not indicate that one student is 1.2 times more intelligent ) These are examples of Interval Data.

Ratio Scale Ratio Variable Possible Scores
Grams of fat consumed per adult in the United States 0 … (If person A consumes 25 grams of fat and person B consumes 50 grams, we can say that person B consumes twice as much fat as person A. If a person C consumes zero grams of fat per day, we can say there is a complete absence of fat consumed on that day. Note that a ratio is interpretable and an absolute zero exists.) Shown here is an example of Ratio Data.

Converting Attribute Data to Continuous Data
Continuous Data is always more desirable In many cases Attribute Data can be converted to Continuous Which is more useful? 15 scratches or Total scratch length of 9.25” 22 foreign materials or 2.5 fm/square inch 200 defects or 25 defects/hour Continuous Data provides us more opportunity for statistical analyses. Attribute Data can often be converted to Continuous by converting it to a rate.

Descriptive Statistics
Measures of Location (central tendency) Mean Median Mode Measures of Variation (dispersion) Range Interquartile Range Standard deviation Variance We will review the descriptive statistics shown here which are the most commonly used. 1) For each of the measures of location, how alike or different are they? 2) For each measure of variation, how alike or different are they? 3) What do these similarities or differences tell us?

Descriptive Statistics
Open the MINITAB™ Project “Measure Data Sets.mpj” and select the worksheet “basicstatistics.mtw” We are going to use the MINITAB™ worksheet shown here to create graphs and statistics. Open the worksheet “basicstatistics.mtw”.

Mean is: Sample Population Measures of Location
Commonly referred to as the average. The arithmetic balance point of a distribution of data. Stat>Basic Statistics>Display Descriptive Statistics…>Graphs… >Histogram of data, with normal curve Sample Population Descriptive Statistics: Data Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Data Variable Maximum Data Mean are the most common measure of location. A “Mean”, implies that you are talking about the population or inferring something about the population. Conversely, average, you are implies something about Sample Data. Although the symbol is different, there is no mathematical difference between the Mean of a sample and Mean of a population.

Median is: Measures of Location
The mid-point, or 50th percentile, of a distribution of data. Arrange the data from low to high, or high to low. It is the single middle value in the ordered list if there is an odd number of observations It is the average of the two middle values in the ordered list if there are an even number of observations Descriptive Statistics: Data Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Data Variable Maximum Data The physical center of a data set is the Median and unaffected by large data values. This is why people use Median when discussing average salary for an American worker, people like Bill Gates and Warren Buffet skew the average number.

Trimmed Mean is a: Measures of Location
Compromise between the Mean and Median. The Trimmed Mean is calculated by eliminating a specified percentage of the smallest and largest observations from the data set and then calculating the average of the remaining observations Useful for data with potential extreme values. Stat>Basic Statistics>Display Descriptive Statistics…>Statistics…> Trimmed Mean Descriptive Statistics: Data Variable N N* Mean SE Mean TrMean StDev Minimum Q1 Median Data Variable Q3 Maximum Data The Trimmed Mean (highlighted above) is less susceptible to the effects of extreme scores.

Mode is: Mode = 5 Measures of Location
The most frequently occurring value in a distribution of data. Mode = 5 It is possible to have multiple Modes, when this happens it’s called Bi-modal Distributions. Here we only have one; Mode = 5.

Interquartile Range is the:
Measures of Variation Range is the: Difference between the largest observation and the smallest observation in the data set. A small range would indicate a small amount of variability and a large range a large amount of variability. Interquartile Range is the: Difference between the 75th percentile and the 25th percentile. Descriptive Statistics: Data Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Data Variable Maximum Data A Range is typically used for small data sets which is completely efficient in estimating variation for a sample of 2. As your data increases the Standard Deviation is a more appropriate measure of variation. Use Range or Interquartile Range when the data distribution is Skewed.

Standard Deviation is:
Measures of Variation Standard Deviation is: Equivalent of the average deviation of values from the Mean for a distribution of data. A “unit of measure” for distances from the Mean. Use when data are symmetrical. Sample Population Descriptive Statistics: Data Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Data Variable Maximum Data The Standard Deviation for a sample and population can be equated with short and long-term variation. Usually a sample is taken over a short period of time making it free from the types of variation that can accumulate over time so be aware. We will explore this further at a later point in the methodology. Cannot calculate population Standard Deviation because this is sample data.

Variance is the: Sample Population Measures of Variation
Average squared deviation of each individual data point from the Mean. Sample Population The Variance is the square of the Standard Deviation. It is common in statistical tests where it is necessary to add up sources of variation to estimate the total. Standard Deviations cannot be added, variances can.

What are the characteristics of a Normal Distribution?
The Normal Distribution is the most recognized distribution in statistics. What are the characteristics of a Normal Distribution? Only random error is present Process free of assignable cause Process free of drifts and shifts So what is present when the data is Non-normal? We can begin to discuss the Normal Curve and its properties once we understand the basic concepts of central tendency and dispersion. As we begin to assess our distributions know that sometimes it’s actually more difficult to determine what is effecting a process if it is Normally Distributed. When we have a Non-normal Distribution there is usually special or more obvious causes of variation that can be readily apparent upon process investigation.

The Normal Curve The normal curve is a smooth, symmetrical, bell-shaped curve, generated by the density function. It is the most useful continuous probability model as many naturally occurring measurements such as heights, weights, etc. are approximately Normally Distributed. The Normal Distribution is the most commonly used and abused distribution in statistics and serves as the foundation of many statistical tools which will be taught later in the methodology.

“Standard” Normal Distribution Has a μ = 0, and σ = 1
Each combination of Mean and Standard Deviation generates a unique normal curve: “Standard” Normal Distribution Has a μ = 0, and σ = 1 Data from any Normal Distribution can be made to fit the standard Normal by converting raw scores to standard scores. Z-scores measure how many Standard Deviations from the mean a particular data-value lies. The shape of the Normal Distribution is a function of 2 parameters, (the Mean and the Standard Deviation). We will convert the Normal Distribution to the standard Normal in order to compare various Normal Distributions and to estimate tail area proportions. By normalizing the Normal Distribution this converts the raw scores into standard Z-scores with a Mean of 0 and Standard Deviation of 1, this practice allows us to use the Z-table.

Normal Distribution The area under the curve between any 2 points represents the proportion of the distribution between those points. Convert any raw score to a Z-score using the formula: Refer to a set of Standard Normal Tables to find the proportion between μ and x. The area between the Mean and any other point depends upon the Standard Deviation. m x The area under the curve between any two points represents the proportion of the distribution. The concept of determining the proportion between 2 points under the standard Normal curve is a critical component to estimating process capability and will be covered in detail in that module.

The Empirical Rule… The Empirical Rule
+6 -1 -3 -4 -5 -6 -2 +4 +3 +2 +1 +5 The Empirical Rule allows us to predict or more appropriately make an estimate of how our process is performing. You will gain a great deal of understanding within the Process Capability module. Notice the difference between +/- 1 SD and +/- 6 SD. 68.27 % of the data will fall within +/- 1 standard deviation 95.45 % of the data will fall within +/- 2 standard deviations 99.73 % of the data will fall within +/- 3 standard deviations % of the data will fall within +/- 4 standard deviations % of the data will fall within +/- 5 standard deviations % of the data will fall within +/- 6 standard deviations