Chapter 7 Highlighting Freaks, Shifts, and Trends
In the Chapter 6 we calculated control limits for commonly used control charts. Control limits show the boundaries of the natural common cause process variation. Thus, data points outside the control limits (freaks) are signals of special cause variation in data – that is, unexpected change caused by something outside the usual system.
Control limits are designed to signal rather large (> 2SD), possibly transient, changes in the system. However, as discussed in detail in Chapter 3, minor to moderate changes in the form of shifts or trends in data may go unnoticed by the control limits for long periods of time. For this purpose, many supplementary tests (rules) have been suggested. To balance the need to quickly detect special causes while keeping the false alarm rate as low as possible we recommend using the 3-sigma rule to signal freaks and the two runs rules for unusually long runs and unusually few crossings to signal shifts and trends.
In this chapter we will improve the spc()
function to automatically visualise special cause variation in the form of freaks, shifts, and trends.
7.1 Introducing the cdiff data set
The cdiff data set contains 24 observations of monthly numbers of hospital acquired Clostridioides difficile infections from an acute care hospital.
# read data from file
cdiff <- read.csv('data/cdiff.csv',
comment.char = '#',
colClasses = c('Date',
'integer',
'integer'))
# calculate centre line and control limits
cdiff <- within(cdiff, {
cl <- mean(infections)
lcl <- pmax(0, cl - 3 * sqrt(cl)) # censor lcl at zero
ucl <- cl + 3 * sqrt(cl)
})
# print the first six rows of data
head(cdiff)
## month infections risk_days ucl lcl cl
## 1 2020-01-01 12 19801 11.77776 0 5.041667
## 2 2020-02-01 7 18674 11.77776 0 5.041667
## 3 2020-03-01 1 15077 11.77776 0 5.041667
## 4 2020-04-01 4 12062 11.77776 0 5.041667
## 5 2020-05-01 4 14005 11.77776 0 5.041667
## 6 2020-06-01 5 14840 11.77776 0 5.041667
Figure 7.1 shows data plotted with spc()
function. We see that there is one data point (#1) above the upper control limit. If we look carefully we also find an unusually long run of 11 data points below the centre line (#14-#24) and that the curve crosses the centre line only 7 times. Thus, in addition to the freak, there is also a shift in data, which is not large enough to break the limits but sustained enough to trigger the runs rules.
Figure 7.1: C control chart of CDiff infections
Notice that we censored the lower control limit at zero even if the exact value is negative (\(\bar{c}-3\sqrt{\bar{c}}\)). This is purely cosmetic, but since the data values themselves cannot be negative it makes little sense to show negative control limits. For the same reason we usually censor P chart control limits at 0 and 1 (100%) respectively.
To help us signal special cause variation we will improve the spc()
function.
7.2 Improved spc()
function
The new, improved spc()
function has a few changes:
Line 10: Import a function,
runs.analysis()
, from a separate R script to test for unusually long runs and unusually few crossings. See this function in the R function for runs analysis section at the end of this chapter.Lines 19-20: If no cl argument is given, use the median for centre line.
Lines 33-34: Create a logical vector identifying data points that lie outside the control limits.
Line 37: Test for unusually long runs or unusually few crossings.
Line 40: Start with an empty plot.
Line 47-49: Format the centre line according to the result of runs analysis.
Lines 56-49: Add the data line and points and colour data points outside control limits.
spc <- function(
x, # x axis values
y = NULL, # data values
cl = NULL, # centre line
lcl = NA, # lower control limit
ucl = NA, # upper control limit
... # other parameters passed to the plot() function
) {
# load runs analysis function from R script
source('R/runs.analysis.R')
# if y is missing, set y to x and make a sequence for x
if (is.null(y)) {
y <- x
x <- seq_along(y)
}
# if cl is missing use median of y
if (is.null(cl))
cl <- median(y, na.rm = TRUE)
# repeat line values to match the length of y
if (length(cl) == 1)
cl <- rep(cl, length(y))
if (length(lcl) == 1)
lcl <- rep(lcl, length(y))
if (length(ucl) == 1)
ucl <- rep(ucl, length(y))
# find data points outside control limits (freaks)
sigma.signal <- y < lcl | y > ucl
sigma.signal[is.na(sigma.signal)] <- FALSE
# check for sustained shifts and trends using runs analysis
runs.signal <- runs.analysis(y, cl)
# make empty plot
plot(x, y,
type = 'n',
ylim = range(y, cl, lcl, ucl, na.rm = TRUE),
...)
# add centre line, coloured and dashed if shifts or trends were identified by
# the runs analysis
lines(x, cl,
col = runs.signal + 1,
lty = runs.signal + 1)
# add control limits
lines(x, lcl)
lines(x, ucl)
# add data line and points, colour freak data points (outside control limits)
lines(x, y)
points(x, y,
pch = 19,
col = sigma.signal + 1)
}
Figure 7.2: Improved control chart with visual clues to highlight special cause variation
Now it is a lot easier to immediately see if a chart signals special cause variation or not (Figure 7.2). Freak data points are red, and the centre line turns red and dashed if there are any unusually long runs or if the curve crossed the centre line unusually few times.
Remember, the chart itself does not tell us what caused the signals. This interpretation of a chart – common or special cause variation – still relies on humans with a deep understanding the process and the data.
7.3 Highlighting special cause variation in short
In this chapter we have improved the spc()
function to automatically highlight signs of special cause variation using visual clues that signal special cause variation.
In the next chapter we will improve the spc()
function further to automatically aggregate data and calculate control limits.
R function for runs analysis
runs.analysis <- function(y, cl) {
# trichotomise data according to position relative to CL
# -1 = below, 0 = on, 1 = above
runs <- sign(y - cl)
# remove NAs and data points on the CL
runs <- runs[runs != 0 & !is.na(runs)]
# find run lengths
run.lengths <- rle(runs)$lengths
# find number of useful observations (data points not on CL)
n.useful <- sum(run.lengths)
# find longest run above or below CL
longest.run <- max(run.lengths)
# find number of times adjacent data points are on opposite sides of CL
n.crossings <- length(run.lengths) - 1
# find upper limit for longest run
longest.run.max <- round(log2(n.useful)) + 3
# find lower limit for number of crossing
n.crossings.min <- qbinom(0.05, n.useful - 1, 0.5)
# return result, TRUE if either of the two tests is true, otherwise FALSE
longest.run > longest.run.max | n.crossings < n.crossings.min
}