# Master Black Belt Statistics.

## Presentation on theme: "Master Black Belt Statistics."— Presentation transcript:

Master Black Belt Statistics

Statistical Notation – Cheat Sheet
Summation The standard deviation of sample data The standard deviation of population data The variance of sample data The variance of population data The range of data The average range of data Multi-purpose notation, i.e. # of subgroups, # of classes The absolute value of some term Greater than, less than Greater than or equal to, less than or equal to An individual value, an observation A particular (1st) individual value For each, all, individual values The mean, average of sample data The grand mean, grand average The mean of population data A proportion of sample data A proportion of population data Sample size Population size Use this as a cheat sheet, don’t bother memorizing all of this. Actually most of the notation in Greek is for population data. Measure Phase

Parameters vs. Statistics
Population: All the items that have the “property of interest” under study. Frame: An identifiable subset of the population. Sample: A significantly smaller subset of the population used to make an inference. Population Sample The purpose of sampling is: To get a “sufficiently accurate” inference for considerably less time, money, and other resources, and also to provide a basis for statistical inference; if sampling is done well, and sufficiently, then the inference is that “what we see in the sample is representative of the population” A population parameter is a numerical value that summarizes the data for an entire population, a sample has a corresponding numerical value called a statistic. The population is a collection of all the individuals of interest. It must be defined carefully, such as all the trades completed in If for some reason there are unique subsets of trades it may be appropriate to define those as a unique population, such as, “all sub custodial market trades completed in 2001”, or “emerging market trades”. Sampling frames are complete lists and should be identical to a population with every element listed only once. It sounds very similar to population, and it is. The difference is how it is used. A sampling frame, such as the list of registered voters could be used to represent the population of adult general public. Maybe there are reasons why this wouldn’t be a good sampling frame. Perhaps a sampling frame of licensed drivers would be a better frame to represent the general public. The sampling frame is the source for a sample to be drawn. It is important to recognize the difference between a sample and a population because we typically are dealing with a sample of the what the potential population could be in order to make an inference. The formulas for describing samples and populations are slightly different. In most cases we will be dealing with the formulas for samples. Sample statistics: Arithmetic descriptions of a sample X-bar , s, p, s2, n Population Parameters: Arithmetic descriptions of a population µ,  , P, 2, N Measure Phase

Types of Data Attribute Data (Qualitative)
Is always binary, there are only two possible values (0, 1) Yes, No Go, No go Pass/Fail Variable Data (Quantitative) Discrete (Count) Data Can be categorized in a classification and is based on counts. Number of defects Number of defective units Number of customer returns Continuous Data Can be measured on a continuum, it has decimal subdivisions that are meaningful Time, Pressure, Conveyor Speed, Material feed rate Money Pressure Conveyor Speed Material feed rate The nature of data of data is important to understand. Based on the type of data you will have the option to utilize different analyses. In the next phases, we will study many different statistical tests but it is first important to understand what kind of data you have for example continuous data is the most powerful and attribute data is the least powerful. Measure Phase

Possible values for the variable
Discrete Variables Discrete Variable Possible values for the variable The number of defective needles in boxes of 100 diabetic syringes 0,1,2, …, 100 The number of individuals in groups of 30 with a Type A personality 0,1,2, …, 30 The number of surveys returned out of 300 mailed in a customer satisfaction study. 0,1,2, … 300 The number of employees in 100 having finished high school or obtained a GED 0,1,2, … 100 The number of times you need to flip a coin before a head appears for the first time 1,2,3, … (note, there is no upper limit because you might need to flip forever before the first head appears. Shown here are additional discrete variables. Can you think of others within your business? Measure Phase

Possible Values for the Variable
Continuous Variables Continuous Variable Possible Values for the Variable The length of prison time served for individuals convicted of first degree murder All the real numbers between a and b, where a is the smallest amount of time served and b is the largest. The household income for households with incomes less than or equal to \$30,000 All the real numbers between a and \$30,000, where a is the smallest household income in the population The blood glucose reading for those individuals having glucose readings equal to or greater than 200 All real numbers between 200 and b, where b is the largest glucose reading in all such individuals Shown here are additional continuous variables. Can you think of others within your business? Measure Phase

Definitions of Scaled Data
Understanding the nature of data and how to represent it can affect the types of statistical tests possible. Nominal Scale – data consists of names, labels, or categories. Cannot be arranged in an ordering scheme. No arithmetic operations are performed for nominal data. Ordinal Scale – data is arranged in some order, but differences between data values either cannot be determined or are meaningless. Interval Scale – data can be arranged in some order and for which differences in data values are meaningful. The data can be arranged in an ordering scheme and differences can be interpreted. Ratio Scale – data that can be ranked and for which all arithmetic operations including division can be performed. (division by zero is of course excluded) Ratio level data has an absolute zero and a value of zero indicates a complete absence of the characteristic of interest. Shown here are the four types of scales. It is important to understand these scales as they will dictate the type of statistical analysis that can be performed on your data. Measure Phase

Possible nominal level data values for the variable
Nominal Scale Qualitative Variable Possible nominal level data values for the variable Blood Types A, B, AB, O State of Residence Alabama, …, Wyoming Country of Birth United States, China, other Listed are some examples of nominal data. The only analysis is whether they are different or not. Time to weigh in! Measure Phase

Possible Ordinal level data values
Ordinal Scale Qualitative Variable Possible Ordinal level data values Automobile Sizes Subcompact, compact, intermediate, full size, luxury Product rating Poor, good, excellent Baseball team classification Class A, Class AA, Class AAA, Major League These are examples of ordinal data. Measure Phase

Interval Scale Interval Variable Possible Scores
IQ scores of students in BlackBelt Training 100… (the difference between scores is measurable and has meaning but a difference of 20 points between 100 and 120 does not indicate that one student is 1.2 times more intelligent ) These are examples of interval data. Measure Phase

Ratio Scale Ratio Variable Possible Scores
Grams of fat consumed per adult in the United States 0 … (If person A consumes 25 grams of fat and person B consumes 50 grams, we can say that person B consumes twice as much fat as person A. If a person C consumes zero grams of fat per day, we can say there is a complete absence of fat consumed on that day. Note that a ratio is interpretable and an absolute zero exists.) Shown here is an example of ratio data. Measure Phase

Converting Attribute Data to Continuous Data
Continuous data is always more desirable In many cases attribute data can be converted to continuous Which is more useful? 15 scratches or Total scratch length of 9.25” 22 foreign materials or 2.5 fm/square inch 200 defects or 25 defects/hour Continuous data provides us more opportunity for statistical analyses. Attribute data can often be converted to continuous by converting it to a rate. Measure Phase

Descriptive Statistics
Measures of Location (central tendency) Mean Median Mode Measures of Variation (dispersion) Range Interquartile Range Standard deviation Variance We will review the descriptive statistics shown here which are the most commonly used. 1) For each of the measures of location, how alike or different are they? 2) For each measure of variation, how alike or different are they? 3) What do these similarities or differences tell us? Measure Phase

Descriptive Statistics
Open the MINITAB™ Project “Measure Data Sets.mpj” and select the worksheet “basicstatistics.mtw” We are going to use the MINITAB™ worksheet shown here to create graphs and statistics. Open the worksheet “basicstatistics.mtw”. Measure Phase

Mean is: Measures of Location Population
Commonly referred to as the average. The arithmetic balance point of a distribution of data. Stat>Basic Statistics>Display Descriptive Statistics…>Graphs… >Histogram of data, with normal curve Sample Population Mean are the most common measure of location. A “mean”, implies that you are talking about the population, or inferring something about the population. Conversely, average, you are implies something about sample data. Although the symbol is different, there is no mathematical difference between the mean of a sample and mean of a population. Descriptive Statistics: Data Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Data Variable Maximum Data Measure Phase

Median is: Measures of Location
The mid-point, or 50th percentile, of a distribution of data. Arrange the data from low to high, or high to low. It is the single middle value in the ordered list if there is an odd number of observations It is the average of the two middle values in the ordered list if there are an even number of observations The physical center of a data set is the median and unaffected by large data values. This is why people use Median when discussing average salary for an American worker, people like Bill gates and Warren Buffet skew the average number. Descriptive Statistics: Data Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Data Variable Maximum Data Measure Phase

Trimmed Mean is a: Measures of Location
Compromise between the mean and median. The trimmed mean is calculated by eliminating a specified percentage of the smallest and largest observations from the data set and then calculating the average of the remaining observations Useful for data with potential extreme values. Stat>Basic Statistics>Display Descriptive Statistics…>Statistics…> Trimmed Mean Descriptive Statistics: Data Variable N N* Mean SE Mean TrMean StDev Minimum Q1 Median Data Variable Q3 Maximum Data The trimmed mean (highlighted above) is less susceptible to the effects of extreme scores. Measure Phase

Mode is: Measures of Location Mode = 5
The most frequently occurring value in a distribution of data. Mode = 5 It is possible to have multiple modes, when this happens it’s called bi-modal distributions. Here we only have One mode = 5 Measure Phase

Interquartile Range is the:
Measures of Variation Range is the: Difference between the largest observation and the smallest observation in the data set. A small range would indicate a small amount of variability and a large range a large amount of variability. Interquartile Range is the: Difference between the 75th percentile and the 25th percentile. Descriptive Statistics: Data Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Data Variable Maximum Data A range is typically used for small data sets which is completely efficient in estimating variation for a sample of 2. As your data increases the standard deviation is a more appropriate measure of variation. Use Range or Interquartile Range when the data distribution is skewed. Measure Phase

Standard Deviation is:
Measures of Variation Standard Deviation is: Equivalent of the average deviation of values from the mean for a distribution of data. A “unit of measure” for distances from the mean. Use when data are symmetrical. Sample Population The standard deviation for a sample and population can be equated with short and long term variation. Usually a sample is taken over a short period of time making it free from the types of variation that can accumulate over time so be aware. We will explore this further at a later point in the methodology. Descriptive Statistics: Data Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Data Variable Maximum Data Cannot calculate population standard deviation because this is sample data. Measure Phase

Variance is the: Measures of Variation Population
Average squared deviation of each individual data point from the mean. Sample Population The Variance is the square of the standard deviation. It is common in statistical tests where it is necessary to add up sources of variation to estimate the total. Standard deviations cannot be added, variances can. Measure Phase

What are the characteristics of a Normal distribution?
The normal distribution is the most recognized distribution in statistics. What are the characteristics of a Normal distribution? Only random error is present Process free of assignable cause Process free of drifts and shifts So what is present when the data is Non-Normal? We can begin to discuss the Normal Curve and its properties once we understand the basic concepts of central tendency and dispersion. As we begin to assess our distributions know that sometimes it’s actually more difficult to determine what is effecting a process if it is normally distributed. When we have a non-normal distribution there is usually special or more obvious causes of variation that can be readily apparent upon process investigation. Measure Phase

The Normal Curve The normal curve is a smooth, symmetrical, bell-shaped curve, generated by the density function. It is the most useful continuous probability model as many naturally occurring measurements such as heights, weights, etc. are approximately normally distributed. The normal distribution is the most commonly used and abused distribution in statistics and serves as the foundation of many statistical tools which will be taught later in the methodology. Measure Phase

Normal Distribution Each combination of mean and standard deviation generates a unique normal curve “Standard” Normal Distribution Has a μ = 0, and σ = 1 Data from any normal distribution can be made to fit the standard normal by converting raw scores to standard scores. Z-scores measure how many standard deviations from the mean a particular data-value lies. The shape of the normal distribution is a function of 2 parameters, (the mean and the standard deviation). We will convert the normal distribution to the standard normal in order to compare various normal distributions and to estimate tail area proportions. By normalizing the normal distribution this converts the raw scores into standard Z-scores with a mean of 0 and std dev of 1, this practice allows us to use the Z-table. Measure Phase

Convert any raw score to a Z-score using the formula:
Normal Distribution The area under the curve between any 2 points represents the proportion of the distribution between those points. Convert any raw score to a Z-score using the formula: Refer to a set of Standard Normal Tables to find the proportion between μ and x. The area between the mean and any other point depends upon the standard deviation m x The area under the curve between any two points represents the proportion of the distribution. The concept of determining the proportion between 2 points under the standard normal curve is a critical component to estimating process capability and will be covered in detail in that module. Measure Phase

The Empirical Rule… Empirical Rule
+6 -1 -3 -4 -5 -6 -2 +4 +3 +2 +1 +5 The Empirical rule allows us to predict or more appropriately make an estimate of how our process is performing. You will gain a great deal of understanding within the Process Capability module. Notice the difference between +/- 1 SD and +/- 6 SD. 68.27 % of the data will fall within +/- 1 standard deviation 95.45 % of the data will fall within +/- 2 standard deviations 99.73 % of the data will fall within +/- 3 standard deviations % of the data will fall within +/- 4 standard deviations % of the data will fall within +/- 5 standard deviations % of the data will fall within +/- 6 standard deviations Measure Phase

The Empirical Rule No matter what the shape of your distribution is, as you travel 3 standard deviations from the mean, the probability of occurrence beyond that point begins to converge to a very low number. Measure Phase

There are many types of distributions:
Why Assess Normality? While many processes in nature behave according to the normal distribution, many processes in business, particularly in the areas of service and transactions, do not There are many types of distributions: There are many statistical tools that assume normal distribution properties in their calculations. So understanding just how “normal” the data are will impact how we look at the data. There is no good and bad, its not always better to have “normal” data, look at it in respect to the intent of your project. Again, there is much informational content in non-normal distributions, for this reason it is useful to know how normal our data are. Go back to your project, what do you want to do with your distribution, normal or non-normal. Many distributions simply by nature can NOT be normal. Assume that your dealing with a time metric, how do you get negative time, without having a flux capacitor as in the movie “Back to the Future.” If your metric is, by nature bound to some setting. Measure Phase