Six Sigma Online

  • Increase font size
  • Default font size
  • Decrease font size
Home Six Sigma Heretic The Omnipotence of Random Sampling Distributions

The Omnipotence of Random Sampling Distributions

E-mail Print PDF

Every answer to statistical problems lies within RSD

As I was teaching class the other day, I told the students I was going to reveal to them the one secret they needed to learn to understand every statistical test they would ever use. The secret was the one thing that would make statistics more of a reasonable science than a bunch of equations to memorize, the one thing they needed to pass my class. (OK, there is a lot more needed to pass the class, but without this one thing doing so is a lot harder.)

When I was an undergrad taking my probability and statistics course, I was definitely in “memorize equations” mode. I didn’t really understand why sometimes we divided by the square root of n and sometimes we didn’t for the z- or t-score. I got an A in that class, but I really didn’t “get” it.

It wasn’t until years later, as I took a series of classes in experimental design, that I finally understood that the answer to life, the universe, and everything statistical was the random sampling distribution. (What did you think it was going to be?)

For instance, if I have a population about which I want to know something, I usually can’t test it in its entirety—the population might extend into the future, so how am I going to test that? Instead, I work with what I have, a subset of the population called the “research population,” from which I can sample. I sure hope the research population does a good job representing the population, otherwise I lack external validity for any conclusions I make.

Well, that research population is still pretty big, and I really don’t want to sample all production for this month, so I take a sample from the research population. This sample can be grabbing the top 10 in a box, in which case I have no guarantee that the top 10 weren’t the lightest or whatever; or I can take a random sample and hope that the sample represents the population. There are a number of clever ways to get a random sample but that might be a topic for another article.

So in CSI terms, I have kind of a chain of evidence, population to research population to sample, as illustrated in figure 1:

heretic_rsd

Figure 1: A sampling chain of evidence

At this point, if I have taken a good sample, my random sample tells me something about the research population, which tells me something about the population as a whole. It’s kind of roundabout, I know, but the alternative is measuring the whole population, which isn’t really an economically viable alternative.

Now I get to calculate some numbers that are related to three of the four aspects of any data set that we need to know: shape, spread, and location. (The last one, stability over time, is shown via a control chart.) Because these characteristics are calculated from the sample, we call them “statistics,” and they could include skewness and kurtosis (i.e., shape); range, standard deviation, and interquartile range (spread); and mean, median, or mode (location).

But I don’t really care about the sample—that bus has come and gone. What I care about is the population. So I need to know how the statistics are related to the population parameters, which is what I really care about. Once I do know, I can make some inferences as to what those parameters are.

If I did everything correctly and I am a little lucky, my inferences about the population parameters are close to the real population parameters. But once I move away from my sample, things get fuzzier due to differences between the sample statistics and the population from which they were drawn. This is called “sampling error.” I don’t expect to get exactly the average of the population with my sample, but it turns out that sampling error is quantifiable with a certain probability, so I can bound my estimate with a certain level of known sampling error. (This is where confidence intervals come from.)

The whole chain of evidence pointing back to the perpetrator would look something like figure 2:

heretic_rsd2

Figure 2: Inferences about population based on random samples

If I messed up at some point in the chain, then everything from that point forward is pretty suspect.

OK, so the distribution of my samples should look something like the distribution of my population, right? Let’s simulate a normal population with an average of 100 and a standard deviation of 10 below, and take some samples from it. The result would look like figure 3:

heretic_rsd3

Figure 3: Samples from simulation of normal population and standard deviation

However, I don’t want to spend the rest of my life taking a bunch of samples; I only want to take one sample. If we can find some relationship between the samples and the population, then maybe I can get away with using fewer, maybe even using only one.

So imagine we take all possible samples of size 10. All those individuals will look exactly like the big honkin’ distribution (BHD) in figure 3, right? But what if I make a distribution of the averages of all possible samples? Well, it makes sense that such a distribution of averages is still going to have the same average as the individuals, but since I am taking the average of 10, I’ll bet the averages are closer to the real average, on average. (If that sentence makes you cross your eyes, my work here is done.)

Figure 4 illustrates this distribution of the sample averages, cleverly called the “random sampling distribution (RSD) of the means.”

heretic_rsd3

Figure 4: Random sampling distribution of the means

I simulated only 15,000 means, but it should be pretty close to the random sampling distribution, which is all possible means of the given sample size of 10.

Now I see a much narrower distribution than I saw on the BHD or any individual sample (figure 3). It makes sense because I’m making a histogram of the means of each sample, right?

So why do I care? Because any time I take a sample of size 10 from my population, the average of it is going to fall on that little RSD distribution. That RSD for the means tends to be normally distributed, regardless of the shape of the population—if the sample size is large enough. And because the normal curve is defined by the mean (which we know) and the standard deviation, that only leaves one more step before we can fully describe the RSD of the means.

If we look at the distribution of the individuals, we know that the mean is 100 and the standard deviation is 10. Figure 5 shows the statistics from 15,000 samples from that population:

heretic_rsd5

Figure 5: Statistics from 15,000 samples

The mean is pretty darn close to 100, as we expected, but the standard deviation of the means is a lot smaller than the 10 we know it to be for the population. That’s because, as we take larger and larger samples, the mean of the sample is going to get closer and closer to the real mean. Theory tells us the standard deviation of the RSD of the means is:

heretic_rsd6

We see that our standard deviation is again pretty close to what we would have expected.

Now this is only valid for the RSD of the means. The RSD of other statistics look different, as shown in figure 6:

heretic_rsd7

 

Figure 6: Random sampling distribution of the range

The RSD of the range is positively skewed and leptokurtic. There is no population parameter for range for a normal distribution, but you can use the average range to infer something about the standard deviation, like we do in statistical process control (SPC). Remember:

heretic_rsd8

For a sample size of 10, d2 is 3.078. If we take our average range and divide by that, we get 10.04, again pretty close to what we know σ to be.

Now let's look at the RSD for standard deviations.

heretic_rsd9

Figure 7: Random sampling distribution of the standard deviation

Notice how the mean of the RSD of the standard deviations is pretty far off of the real value? That is why the average standard deviation is a “biased” measure. The RSD for the variance, though, is unbiased. Weird, huh? It is positively skewed. (Get it?) Remember this from SPC, though?

heretic_rsd10

For a sample size of 10, c4 = 0.9727, so using our average we would estimate σ to be 10.02.

Moving on to the RSD for skewness:

heretic_rsd11

Figure 8: Random sampling distribution of the skewness

The RSD of the skewness looks to the eye to be normal-shaped, which is why we never rely on our eyes to do statistical tests. It is symmetrical, but it is leptokurtic, not normal. This is an unbiased estimator, so the average skewness is pretty close to the zero that we know it to be.

How about kurtosis?

heretic_rsd12

Figure 9: RSD of the kurtosis

The RSD of the kurtosis is positively skewed and is very leptokurtic. It, too, is unbiased, so we are pretty close to zero.

Again, why do I care? Remember, the RSD are theoretical distributions of every possible sample of size n for a given BHD, and I can build one without even taking a sample. Then when I take a sample, whatever sample I get is somewhere in that distribution if I knew what that BHD really was. This is the basis for hypothesis tests. Looking at all the goofy shapes of the RSDs above, you can understand that there are going to be different tests for the different parameters. Also, once we get a sample, we can bound the error of the parameter estimate if we understand the RSD, which is where confidence intervals come from.

So the idea of the RSD underlies just about everything in inferential statistics. And yet, in my experience, people who use statistics daily often have no clue what an RSD is. That means they are probably making very costly mistakes without ever knowing it.

Do you want to see how the RSDs are used? Tune in next month.

 

Random Testimonial

This was great. The format was excellent for learning on your terms/pace and adds long-term value by providing an excellent continued education opportunity, material you can refer back to when needed for specific examples and problems. Learned so much, a skill set that will be in high demand/utilized in any type of economic condition. And, taught by someone that brought real life experiences to the lecture and SRO, truly a subject matter expert.

--R.Ta.


Our Economic Stimulus Package: $200 Off (click to learn more)

During this recent economic downturn, we have been contacted by a number of people who are looking to add more qualifications to their resume, either to increase their value at their current job or open some options as they seek employment after a layoff, as well as businesses looking for training with a high return on investment.  Six Sigma Online would like to do what little we can to help you out if this is your situation.

Six Sigma Online already offers one of the lowest price ways to get your Black Belt while still offering much greater depth of knowledge than you can get anywhere else.  In addition, for a limited time we will offer you a 10% discount to make it that much easier to get the skills you need to achieve your employment objectives.  Just go to the order page and it is automatically deducted during the time of this offer.

Businesses right now desperately need to find ways to improve profit and reduce the costs of quality.  Black Belts are the "edge of the blade" in this endeavor and so are in demand.  Our training will maximize the success of your projects with knowledge you will need as you encounter real-world problems to solve.  Here at Six Sigma Online, we hope that our training, and the discount above, will go some way to help individuals and businesses survive these tough times.

 

Heretic Articles

This article was originally published in Quality Digest. Subscribe to Quality Digest if you would like to receive these articles when they are published, or you can get an RSS feed that is updated two weeks after they are first published.

Unemployed in Colorado? Get state help for your Six Sigma training!

On October 15, 2010 The ROI Alliance (Six Sigma Online's parent company) became an approved provider of Six Sigma Black Belt and Master Black Belt training for Colorado's Workforce Investment Program (WIP), which helps people who are unemployed pay for the training they need to succeed in this competitive market place.  If you are looking for a job that is in high-demand, and you meet the requirements of the WIP, they may be able to help pay for your Six Sigma training!

 


Random Heresy

Applied research the smart way

Although we may use the define, measure, analyze, improve, control (DMAIC) mnemonic to help guide us through our problem solving, that doesn't really give us a lot of specific direction (as I bemoan in my Top 10 Stupid Six Sigma Tricks No. 4). Good experimental design technique is critical to being able to turn problems into solutions, and in my experience Black Belts have not been introduced to a good process to do this. If you know someone whose first thought is, "Let's go collect some data to see what is going on," then read on to avoid losing millions of dollars in experimental mistakes.

Read more...

In the News

Six Sigma's lead instructor Steven Ouellette wrote an article with Dr. Jeffrey Luftig on "The Decline of Ethical Behavior in Business."

 


 

Six Sigma Online's lead instructor Steven Ouellette was profiled in the June 2008 issue of Quality Digest magazine. If you want to learn more about Steve's peculiar view of the world, as well as what he studied for a year in Europe, read the profile online.