## To err is human, but to really mess things up you need a statistician.

Have you ever met people who “do” statistical process control (SPC) only to get some screwy-looking control chart, and then text: OMG I H8 SPC! (If you don’t understand that, ask your nine-year-old child or grandchild.)

Last month we saw how it is not a failure of SPC, but rather an EBKAC (error between keyboard and chair). As I wrote in my last article, “Why Doesn't SPC Work?” perhaps they are not doing the measurement system analysis first, or perhaps autocorrelation in a continuous process. But you batch process folks are not off the hook, which is what this month’s article is about.

A batch process makes a bunch of stuff all at once, like baking a batch of cookies. The batter was mixed at one time, the cookies were put onto the pan and into the oven at the same time, so we expect that all the cookies in that batch are pretty much the same. (If you have ever baked cookies, or run a batch process, I’ll bet you can identify at least three sources of variability within this batch—don’t tell anyone yet though. We already know you're smart since you are reading my article, and you would be stepping on my punchline.)

So let’s say you are running a batch process baking cookies (or heat treating, or forming, or whatever). You want to put some SPC in place, so you naturally start with the quality characteristic of the process. You take a sample of five from each batch (yum!) to test the critical characteristic “color.” The textbook tells you do to an X-bar and R chart; and you end up with a chart as shown in figure 1.

Figure 1: Mean and Range Chart: Color |
---|

Now at this point, you have a decision to make. You could say, “I am not going to show this to anyone! There is no way that our people are going to use this chart to react—they would go crazy!” (Lest you are thinking I am making this up, this exact situation came up at a business where I consulted. The practitioner had been doing this chart for four years and had not shown anyone.)

Or you could say, “This is my process screaming at me—maybe I should listen.”

So what is going on here? It all hinges on understanding how control charts work. They are not magic—they work for sound scientific reasons. The control limits on the chart are intended to show the expected variation range of the top five samples’ averages and the variation of the range across the bottom samples. However, the limits for the averages are calculated from the average range. I know, it seems weird, but the reasoning makes sense. Here’s why:

The purpose of a control chart is to identify when a process is being affected by common cause variability inherent to the current process or if it is being affected by common and special cause variability: something unusual to the process is affecting output. We use a control chart to help identify what these special causes might be so that we can eliminate them, leaving us with the true underlying process variation. Deming called this “finding the process.” Once a process is in control, it’s a lot easier to take the next step and improve the process by reducing that inherent variation or moving the average on target.

Calculating the expected common cause variation of the process is a bit trickier than you might think. If I had a process with some special causes in it and I were to use all the data *including the special causes* to calculate my expected variation, it tends to inflate the limits and make it harder to detect that there are actually special causes. Makes sense, right? If I calculate the standard deviation across all the data points, including ones that are different due to special causes, I’ll end up with a larger standard deviation than what the underlying process really has—leading to wider limits—which leads to the classification in some events, as within the control limits, when in fact they are not.

So how can I use my data to generate the expected limits of variation for the process when the data themselves might have special cause variation in them? It sounds like a chicken-and-egg situation, but by using our brains, we can see a possible way out.

In our batch process, we are taking five samples from each batch. If a process is totally unaffected by special cause variation, then the variation within each sample and between each sample is just coming from the same source: the within variation. Another way of saying this is that the only reason the average of sample one is different from the average of sample two is because of sampling error within each sample, not because there is any real difference between the two samples. Add to that the fact that the samples from each batch are probably as similar as we know how to make them, and we can hope that the variation within each sample is the minimum variation uninfluenced by special causes. Even if there are a few within-sample oddballs, we will be using the average range, so the effect will tend to be damped out.

So it turns out that the variation within each sample (the average range in this case) is actually a better way to get an estimate of the underlying process variation than the raw data, because the raw data itself would contain the special causes.

Which still leaves us with the messed-up chart in figure 1.

Now that I have reminded you about where those limits come from, can you guess what is going on? The tight limits indicate that the “within variability” is much smaller than the “between variability.” How would that come to occur?

OK, now those of you who have made cookies and run batch processes can shout out the answer. Go ahead—your co-habitants won’t mind. They already know you are prone to random verbal outbursts when you are on your computer.

What if there was an additional source of variability between batches? Maybe I mix up each batch but am sloppy on the measurements, so each batch is essentially a somewhat different recipe. How about this: each time I open the oven to take out some cookies the temperature changes, and the thermal control system either under- or over-corrects, leading to a different thermal profile for each batch. If so, this process is batch or setup dependent.

This control chart is screaming at you about this batch-to-batch variability and, like a filthy car with “Clean Me!” written on the window, it is telling you what to do. To get this process to produce to its underlying variability, you will have to investigate to determine the sources of the batch-to-batch changes and eliminate them. The chart has given you a huge hint to help out in the search; whatever it is, it changes from batch to batch, so talking to your operators and looking at your process logs (You do have process logs when you bake cookies, right?) would be the first step in figuring this out.

By the way, you can test for this situation on a control chart by doing a random-effects, one-way analysis of variance (ANOVA) using each batch as a level. If the between variability is larger than that predicted from the within variability, you will find significance with your ANOVA. This is built right into MVPstats, which is what I used to generate these charts. The ANOVA will also find much more subtle differences between the within variability and the between variability than a control chart will detect, so I use it as a diagnostic on all X-bar charts.

Using similar reasoning, can you figure out how the following situation, as charted in figure 2, might occur? (The limits are calculated from the data you see, so it’s not that we have improved the process over where it used to be.)

Figure 2: Mean and Range Chart: Thickness |
---|

The first one who posts the correct answer on the discussion page gets bragging rights! (This effect happened on a control chart at that same business, too. Go figure.)

Next month, I’ll reveal the answer and finish up with some other errors that I have seen when people make control charts without using the most important tool in the SPC tool box… their brains.