Presentation on theme: "Measure Phase Six Sigma Statistics"— Presentation transcript:
1Measure Phase Six Sigma Statistics Now we will continue in the Measure Phase with “Six Sigma Statistics”.
2Six Sigma Statistics Welcome to Measure Process Discovery Descriptive StatisticsNormal DistributionAssessing NormalityGraphing TechniquesBasic StatisticsSpecial Cause / Common CauseWrap Up & Action ItemsProcess CapabilityMeasurement System AnalysisSix Sigma StatisticsProcess DiscoveryWelcome to MeasureIn this module you will learn how your processes speak to you in the form of data. If you are to understand the behaviors of your processes, then you must learn to communicate with the process in the language of data.The field of statistics provides the tools and techniques to act on data, to turn data into information and knowledge which you will then use to make decisions and to manage your processes.The statistical tools and methods that you will need to understand and optimize your processes are not difficult. Use of excel spreadsheets or specific statistical analytical software has made this a relatively easy task.In this module you will learn basic, yet powerful analytical approaches and tools to increase your ability to solve problems and manage process behavior.
3Purpose of Basic Statistics The purpose of Basic Statistics is to:Provide a numerical summary of the data being analyzed.Data (n)Factual information organized for analysis.Numerical or other information represented in a form suitable for processing by computerValues from scientific experiments.Provide the basis for making inferences about the future.Provide the foundation for assessing process capability.Provide a common language to be used throughout an organization to describe processes.Statistics is the basic language of Six Sigma. A solid understanding of basic statistics is the foundation upon which many of the subsequent tools will be based.Having an understanding of Basic Statistics can be quite valuable to an individual. Statistics, like anything, can be taken to the extreme. But it is not the need or the intent of this course to do that, nor is it the intent of Six Sigma. It can be stated that Six Sigma does not make people into statisticians, rather it makes people into excellent problem solvers by using applicable statistical techniques.Data is like crude oil that comes out of the ground. Crude oil is not of much good use. However if the crude oil is refined many useful products occur; such as medicines, fuel, food products, lubricants, etc. In a similar sense statistics can refine data into usable “products” to aid in decision making, to be able to see and understand what is happening, etcStatistics is broadly used by just about everyone today. Sometimes we just don’t realize it. Things as simple as using graphs to better understand something is a form of statistics, as are the many opinion and political polls used today. With easy to use software tools to reduce the difficulty and time to do statistical analyses, knowledge of statistics is becoming a common capability amongst people.An understanding of Basic Statistics is also one of the differentiating features of Six Sigma and it would not be possible without the use of computers and programs like MINITAB™. It has been observed that the laptop is one of the primary reasons that Six Sigma has become both popular and effective.Relax….it won’t be that bad!
4Statistical Notation – Cheat Sheet SummationThe Standard Deviation of sample dataThe Standard Deviation of population dataThe variance of sample dataThe variance of population dataThe range of dataThe average range of dataMulti-purpose notation, i.e. # of subgroups, # of classesThe absolute value of some termGreater than, less thanGreater than or equal to, less than or equal toAn individual value, an observationA particular (1st) individual valueFor each, all, individual valuesThe mean, average of sample dataThe grand mean, grand averageThe mean of population dataA proportion of sample dataA proportion of population dataSample sizePopulation sizeUse this as a cheat sheet, don’t bother memorizing all of this. Actually most of the notation in Greek is for population data.
5Parameters vs. Statistics Population: All the items that have the “property of interest” under study.Frame: An identifiable subset of the population.Sample: A significantly smaller subset of the population used to make an inference.PopulationSampleThe purpose of sampling is:To get a “sufficiently accurate” inference for considerably less time, money, and other resources, and also to provide a basis for statistical inference; if sampling is done well, and sufficiently, then the inference is that “what we see in the sample is representative of the population”A population parameter is a numerical value that summarizes the data for an entire population, a sample has a corresponding numerical value called a statistic.The population is a collection of all the individual data of interest. It must be defined carefully, such as all the trades completed in If for some reason there are unique subsets of trades it may be appropriate to define those as a unique population, such as, “all sub custodial market trades completed in 2001”, or “emerging market trades”.Sampling frames are complete lists and should be identical to a population with every element listed only once. It sounds very similar to population, and it is. The difference is how it is used. A sampling frame, such as the list of registered voters, could be used to represent the population of adult general public. Maybe there are reasons why this wouldn’t be a good sampling frame. Perhaps a sampling frame of licensed drivers would be a better frame to represent the general public.The sampling frame is the source for a sample to be drawn.It is important to recognize the difference between a sample and a population because we typically are dealing with a sample of the what the potential population could be in order to make an inference. The formulas for describing samples and populations are slightly different. In most cases we will be dealing with the formulas for samples.Population Parameters:Arithmetic descriptions of a populationµ, , P, 2, NSample Statistics:Arithmetic descriptions of a sampleX-bar , s, p, s2, n
6Attribute Data (Qualitative) Types of DataAttribute Data (Qualitative)Is always binary, there are only two possible values (0, 1)Yes, NoGo, No goPass/FailVariable Data (Quantitative)Discrete (Count) DataCan be categorized in a classification and is based on counts.Number of defectsNumber of defective unitsNumber of customer returnsContinuous DataCan be measured on a continuum, it has decimal subdivisions that are meaningfulTime, Pressure, Conveyor Speed, Material feed rateMoneyPressureConveyor SpeedMaterial feed rateThe nature of data of data is important to understand. Based on the type of data you will have the option to utilize different analyses.Data, or numbers, are usually abundant and available to virtually everyone in the organization. Using data to measure, analyze, improve, and control processes forms the foundation of the Six Sigma methodology. Data turned into information, then transformed into knowledge, lowers the risks of decision. Your goal is to make more decisions based on data versus the typical practices of “I think,” “I feel,” and “In my opinion”.One of your first steps in refining data into information is to recognize what the type of data is that you are using. There are two primary types of data, they are attribute and variable data.Attribute Data is also called qualitative data. Attribute Data is the lowest level of data. It is purely binary in nature. Good or bad, yes or no type data. No analysis can be performed on Attribute Data. Attribute Data must be converted to a form of variable data called discrete data in order to be counted or be useful.Discrete Data is information that can be categorized into a classification. Discrete Data is based on counts. It is typically things counted in whole numbers. Discrete Data is data that can't be broken down into a smaller unit to add additional meaning. Only a finite number of values is possible, and the values cannot be subdivided meaningfully. For example, there is no such thing as a half of defect or a half of a system lockup.Continuous Data is information that can be measured on a continuum or scale. Continuous Data, also called quantitative data can have almost any numeric value and can be meaningfully subdivided into finer and finer increments, depending upon the precision of the measurement system. Decimal sub-divisions are meaningful with Continuous Data. As opposed to Attribute Data like good or bad, off or on, etc., Continuous Data can be recorded at many different points (length, size, width, time, temperature, cost, etc.). For example inches is a meaningful number, whereas defects does not make sense.Later in the course we will study many different statistical tests but it is first important to understand what kind of data you have.
7Possible values for the variable Discrete VariablesDiscrete VariablePossible values for the variableThe number of defective needles in boxes of 100 diabetic syringes0,1,2, …, 100The number of individuals in groups of 30 with a Type A personality0,1,2, …, 30The number of surveys returned out of 300 mailed in a customer satisfaction study.0,1,2, … 300The number of employees in 100 having finished high school or obtained a GED0,1,2, … 100The number of times you need to flip a coin before a head appears for the first time1,2,3, …(note, there is no upper limit because you might need to flip forever before the first head appears.Shown here are additional Discrete Variables. Can you think of others within your business?
8Possible Values for the Variable Continuous VariablesContinuous VariablePossible Values for the VariableThe length of prison time served for individuals convicted of first degree murderAll the real numbers between a and b, where a is the smallest amount of time served and b is the largest.The household income for households with incomes less than or equal to $30,000All the real numbers between a and $30,000, where a is the smallest household income in the populationThe blood glucose reading for those individuals having glucose readings equal to or greater than 200All real numbers between 200 and b, where b is the largest glucose reading in all such individualsShown here are additional Continuous Variables. Can you think of others within your business?
9Definitions of Scaled Data Understanding the nature of data and how to represent it can affect the types of statistical tests possible.Nominal Scale – data consists of names, labels, or categories. Cannot be arranged in an ordering scheme. No arithmetic operations are performed for nominal data.Ordinal Scale – data is arranged in some order, but differences between data values either cannot be determined or are meaningless.Interval Scale – data can be arranged in some order and for which differences in data values are meaningful. The data can be arranged in an ordering scheme and differences can be interpreted.Ratio Scale – data that can be ranked and for which all arithmetic operations including division can be performed. (division by zero is of course excluded) Ratio level data has an absolute zero and a value of zero indicates a complete absence of the characteristic of interest.Shown here are the four types of Scales. It is important to understand these scales as they will dictate the type of statistical analysis that can be performed on your data.
10Possible nominal level data values for the variable Nominal ScaleQualitative VariablePossible nominal level data values for the variableBlood TypesA, B, AB, OState of ResidenceAlabama, …, WyomingCountry of BirthUnited States, China, otherListed are some examples of Nominal Data. The only analysis is whether they are different or not.Time to weigh in!
11Possible Ordinal level data values Ordinal ScaleQualitative VariablePossible Ordinal level data valuesAutomobile SizesSubcompact, compact, intermediate, full size, luxuryProduct ratingPoor, good, excellentBaseball team classificationClass A, Class AA, Class AAA, Major LeagueThese are examples of Ordinal Data.
12Interval Scale Interval Variable Possible Scores IQ scores of students in BlackBelt Training100…(the difference between scores is measurable and has meaning but a difference of 20 points between 100 and 120 does not indicate that one student is 1.2 times more intelligent )These are examples of Interval Data.
13Ratio Scale Ratio Variable Possible Scores Grams of fat consumed per adult in the United States0 …(If person A consumes 25 grams of fat and person B consumes 50 grams, we can say that person B consumes twice as much fat as person A. If a person C consumes zero grams of fat per day, we can say there is a complete absence of fat consumed on that day. Note that a ratio is interpretable and an absolute zero exists.)Shown here is an example of Ratio Data.
14Converting Attribute Data to Continuous Data Continuous Data is always more desirableIn many cases Attribute Data can be converted to ContinuousWhich is more useful?15 scratches or Total scratch length of 9.25”22 foreign materials or 2.5 fm/square inch200 defects or 25 defects/hourContinuous Data provides us more opportunity for statistical analyses. Attribute Data can often be converted to Continuous by converting it to a rate.
15Descriptive Statistics Measures of Location (central tendency)MeanMedianModeMeasures of Variation (dispersion)RangeInterquartile RangeStandard deviationVarianceWe will review the descriptive statistics shown here which are the most commonly used.1) For each of the measures of location, how alike or different are they?2) For each measure of variation, how alike or different are they?3) What do these similarities or differences tell us?
16Descriptive Statistics Open the MINITAB™ Project “Measure Data Sets.mpj” and select the worksheet “basicstatistics.mtw”We are going to use the MINITAB™ worksheet shown here to create graphs and statistics. Open the worksheet “basicstatistics.mtw”.
17Mean is: Sample Population Measures of Location Commonly referred to as the average.The arithmetic balance point of a distribution of data.Stat>Basic Statistics>Display Descriptive Statistics…>Graphs… >Histogram of data, with normal curveSamplePopulationDescriptive Statistics: DataVariable N N* Mean SE Mean StDev Minimum Q1 Median Q3DataVariable MaximumDataMean are the most common measure of location. A “Mean”, implies that you are talking about the population or inferring something about the population. Conversely, average, you are implies something about Sample Data.Although the symbol is different, there is no mathematical difference between the Mean of a sample and Mean of a population.
18Median is: Measures of Location The mid-point, or 50th percentile, of a distribution of data.Arrange the data from low to high, or high to low.It is the single middle value in the ordered list if there is an odd number of observationsIt is the average of the two middle values in the ordered list if there are an even number of observationsDescriptive Statistics: DataVariable N N* Mean SE Mean StDev Minimum Q1 Median Q3DataVariable MaximumDataThe physical center of a data set is the Median and unaffected by large data values. This is why people use Median when discussing average salary for an American worker, people like Bill Gates and Warren Buffet skew the average number.
19Trimmed Mean is a: Measures of Location Compromise between the Mean and Median.The Trimmed Mean is calculated by eliminating a specified percentage of the smallest and largest observations from the data set and then calculating the average of the remaining observationsUseful for data with potential extreme values.Stat>Basic Statistics>Display Descriptive Statistics…>Statistics…> Trimmed MeanDescriptive Statistics: DataVariable N N* Mean SE Mean TrMean StDev Minimum Q1 MedianDataVariable Q3 MaximumDataThe Trimmed Mean (highlighted above) is less susceptible to the effects of extreme scores.
20Mode is: Mode = 5 Measures of Location The most frequently occurring value in a distribution of data.Mode = 5It is possible to have multiple Modes, when this happens it’s called Bi-modal Distributions. Here we only have one; Mode = 5.
21Interquartile Range is the: Measures of VariationRange is the:Difference between the largest observation and the smallest observation in the data set.A small range would indicate a small amount of variability and a large range a large amount of variability.Interquartile Range is the:Difference between the 75th percentile and the 25th percentile.Descriptive Statistics: DataVariable N N* Mean SE Mean StDev Minimum Q1 Median Q3DataVariable MaximumDataA Range is typically used for small data sets which is completely efficient in estimating variation for a sample of 2. As your data increases the Standard Deviation is a more appropriate measure of variation.Use Range or Interquartile Range when the data distribution is Skewed.
22Standard Deviation is: Measures of VariationStandard Deviation is:Equivalent of the average deviation of values from the Mean for a distribution of data.A “unit of measure” for distances from the Mean.Use when data are symmetrical.SamplePopulationDescriptive Statistics: DataVariable N N* Mean SE Mean StDev Minimum Q1 Median Q3DataVariable MaximumDataThe Standard Deviation for a sample and population can be equated with short and long-term variation.Usually a sample is taken over a short period of time making it free from the types of variation that can accumulate over time so be aware.We will explore this further at a later point in the methodology.Cannot calculate population Standard Deviation because this is sample data.
23Variance is the: Sample Population Measures of Variation Average squared deviation of each individual data point from the Mean.SamplePopulationThe Variance is the square of the Standard Deviation. It is common in statistical tests where it is necessary to add up sources of variation to estimate the total.Standard Deviations cannot be added, variances can.
24What are the characteristics of a Normal Distribution? The Normal Distribution is the most recognized distribution in statistics.What are the characteristics of a Normal Distribution?Only random error is presentProcess free of assignable causeProcess free of drifts and shiftsSo what is present when the data is Non-normal?We can begin to discuss the Normal Curve and its properties once we understand the basic concepts of central tendency and dispersion.As we begin to assess our distributions know that sometimes it’s actually more difficult to determine what is effecting a process if it is Normally Distributed. When we have a Non-normal Distribution there is usually special or more obvious causes of variation that can be readily apparent upon process investigation.
25The Normal CurveThe normal curve is a smooth, symmetrical, bell-shaped curve, generated by the density function.It is the most useful continuous probability model as many naturally occurring measurements such as heights, weights, etc. are approximately Normally Distributed.The Normal Distribution is the most commonly used and abused distribution in statistics and serves as the foundation of many statistical tools which will be taught later in the methodology.
26“Standard” Normal Distribution Has a μ = 0, and σ = 1 Each combination of Mean and Standard Deviation generates a unique normal curve:“Standard” Normal DistributionHas a μ = 0, and σ = 1Data from any Normal Distribution can be made to fit the standard Normal by converting raw scores to standard scores.Z-scores measure how many Standard Deviations from the mean a particular data-value lies.The shape of the Normal Distribution is a function of 2 parameters, (the Mean and the Standard Deviation).We will convert the Normal Distribution to the standard Normal in order to compare various Normal Distributions and to estimate tail area proportions.By normalizing the Normal Distribution this converts the raw scores into standard Z-scores with a Mean of 0 and Standard Deviation of 1, this practice allows us to use the Z-table.
27Normal DistributionThe area under the curve between any 2 points represents the proportion of the distribution between those points.Convert any raw score to a Z-score using the formula:Refer to a set of Standard Normal Tables to find the proportion between μ and x.The area between the Mean and any other point depends upon the Standard Deviation.mxThe area under the curve between any two points represents the proportion of the distribution. The concept of determining the proportion between 2 points under the standard Normal curve is a critical component to estimating process capability and will be covered in detail in that module.
28The Empirical Rule… The Empirical Rule +6-1-3-4-5-6-2+4+3+2+1+5The Empirical Rule allows us to predict or more appropriately make an estimate of how our process is performing. You will gain a great deal of understanding within the Process Capability module. Notice the difference between +/- 1 SD and +/- 6 SD.68.27 % of the data will fall within +/- 1 standard deviation95.45 % of the data will fall within +/- 2 standard deviations99.73 % of the data will fall within +/- 3 standard deviations% of the data will fall within +/- 4 standard deviations% of the data will fall within +/- 5 standard deviations% of the data will fall within +/- 6 standard deviations