Last Updated on 13 September 2023
There is a lot of statistics in Six Sigma, and it can be hard to get your head around it. The following is to give you a basic grounding in statistics for Six Sigma, so that when you do encounter these concepts in the various tools and methodologies, they’re already familiar.
Some statistical concepts are specific to one tool or process, so I’ll introduce those on the pages they’re relevant to. I’ve attempted here to introduce general statistics that are applicable to many tools, or are tested independently in Six Sigma exams.
Qualitative and quantitative data
There are two main categories of data, qualitative and quantitative:
Quantitative data is statistical and will usually be illustrated by a number. Examples are timings, weight, length etc. You will be able to do calculations and statistics on quantitative date.
Depending on the type of numbers that it is measured with, the quantitative data will either be discrete or continuous (see below).
Qualitative data is descriptive, and will give more information about an object, such as materials, color and other observations (including pass / fail). You won’t be able to measure qualitative data with a number.
Because of the descriptive values of it, a lot of the information you gather from voice of the customer will usually be qualitative data.
Discrete and continuous data
When your collecting your quantitative data, you will find there are two main types that you will run into; discrete and continuous.
Discrete data is where there are only a set number of options. You will see this most frequently when you are counting items that only exist as whole numbers e.g. number of screws. Pass/fail data such as quality pass or fail and counting whole objects such as staff members are two common examples of discrete data.
Continuous data is where there are infinite options for the values. A good example is length of your product, as you can always go one more decimal place to your measurements. E.g. 50.1, 50.09, 50.0093 can just be different measurements of the same product. To make sense of it, you usually have to split it into categories, e.g. by rounding to the nearest mm. Other common examples are weight and time.
Measures of Central tendency (Averages)
Six Sigma is about reducing variation, which is usually how much the process results vary from a target, or from the system average. It can be helpful to see how far your average is from target. But how do you calculate the average? There are three main ways of calculating averages, ‘mean’, ‘median’ and ‘mode’, each of which come with their own advantages and disadvantages.
This one is the most common, and you will likely already be familiar with it. You add all of the measurements together, and divide by the number of measurements.
The main advantage is that this average uses all the data (e.g. if any of the data points change, it moves the average), which makes this often the most useful average. The main disadvantage is from the same source; because all data points are used, outliers (those few points very far from the others) can make the average worse. For example if you are working out average personal incomes of a group of people, and you have Bill Gates in that group, the number for the ‘average of the group incomes’ will tell you very little information about ‘income for the average member of the group’.
Where mean is the average measurement, median is the measurement that the average product will get.
The median measurement solves the outlier issue from mean, making it probably the second most-commonly used average. All the very high and very low measurements are discarded, so any unusual measurements (including errors) should have minimal effect on the results.
Calculating the median is simple. You put all your measurements in ascending order (i.e. lowest to highest value), and the median is the value in the middle. If there are two numbers in the middle (because there are an even number of measurements), you’ll need to take a mean average of those two numbers to get your median.
Often the simplest to perform, the mode is simply the measurement the comes up the most frequently. This is easy for discrete ranges, but it is slightly trickier for continuous ranges where you’ll have to split the ranges into sections first (e.g. you may choose rounding down to the next mm), after which the ‘mode’ becomes the most common category.
Measures of Dispersion
There are three ‘measures of dispersion’ that need to be understood for Yellow Belt. To be clear, dispersion is the amount that the data is spread out from the average. These show how diverse your data is.
Range is the difference between the largest and smallest values in your sample:
Range = R = largest value – smallest value
The standard deviation is the square root of the variance. This has the advantage as being in the same units as what is being measured. The formula is:
To calculate it:
- For each measurement, work out the difference between the measurement and the mean
- Square that figure (multiply by itself)
- Add up all the figures from step 2
- Divide by the number of measurements
- Take the square root
It’s no coincidence that the symbol for this is Sigma; Six Sigma is based off a number of standard deviations.
Variance is the average of the square of all the differences between the data points and the mean.
The calculation is the same as standard deviation but without step 5.
A lot of issues in Six Sigma are things that may or occasionally occur rather than always occur, so a strong grasp of probability will be very helpful. The concepts that you need to understand and be able to apply are:
- independent events
- mutually exclusive events
- multiplication rules
- addition and multiplication rules
- conditional probability
- complementary probability
- joint occurrence of events
Central Limit Theorem
You will need to understand the central limit theorem how it relates to confidence intervals, hypothesis testing, and control charts.