B Basic Statistical Concepts

Understanding data is the cornerstone of effective decision-making in any quality improvement context. While this chapter is not intended as a comprehensive statistical tutorial, it introduces key concepts related to data types, basic statistical summaries, and visualisation techniques that support meaningful application of appropriate SPC charts and help ensure that valid data-driven conclusions can be drawn.

B.1 Data types

Data can be classified into several types, with the most common being categorical and numerical data.

B.1.1 Categorical data

Categorical data, also known as qualitative data, represent characteristics that cannot be meaningfully measured on a numerical scale, for example [red, green, blue], [medicine, surgery, psychiatry] or [low, medium, high]. Categorical data cannot be directly analysed using arithmetic operations – what is the sum of red, green and blue, for example?

Categorical data can be divided into:

Nominal data: Categories without a natural order, e.g. [red, green, blue] or [medicine, surgery, psychiatry].
Ordinal data: Categories with a meaningful order, but without consistent intervals between them, e.g. [low < medium < high].

If categorical data consist of only two possible values – such as yes/no, true/false or dead/alive – they are referred to as binary or binomial data.

Categorical data can be transformed into numerical form by counting the number of observations in each category. In some (rare) cases, it may be appropriate to assign numerical values to ordinal data, such as school grades, thereby enabling the use of arithmetic operations. However, we should be careful not to impose numerical values on any ordinal data unless the intervals are consistent and well-defined. It may be tempting to converts, for instance, cancer stages [I, II, III] into numbers [1, 2, 3]. But this will only make sense, if cancer stage II is exactly twice as severe as stage I, and stage III three times as severe.

Note that some categorical data – such as postal codes, insurance numbers and phone numbers – may appear numerical but are actually categorical. To identify such cases, ask whether it makes sense to calculate a sum or an average. If not, the values should be treated as categories rather than numbers.

B.1.2 Numerical data

Numerical data, also known as quantitative data, consist of values that represent measurable quantities. These data can be meaningfully analysed using arithmetic operations such as addition or subtraction.

Numerical data fall in two main subtypes:

Discrete data: Distinct, separate values that result from counting and are always expressed as whole numbers (including zero). Examples include number of pregnancies, pressure ulcers and hospital admissions.
Continuous data: Values obtained through measurement that can take on any value within a given range, including decimals, fractions and negative numbers. Examples include height, weight, waiting time and blood pressure.

Just as categorical data can be converted into numbers through counting, numerical data can be transformed into categories by grouping values into defined ranges — for example, converting exact blood pressure readings into hypertension categories such as normal, elevated, stage 1 hypertension, and stage 2 hypertension.

B.2 Summarising categorical data

In this section we use the Adverse events dataset assigned to the variable ae.

As mentioned, categorical data are most commonly summarised by counting the number of observations in each category, typically presented as frequencies or proportions (or percentages). In R, this can be easily done using the table() function.

# count number of adverse events in each category
(ae.tbl <- table(ae$category))

## 
##             Fall Gastrointestinal        Infection       Medication 
##                1               40               34               18 
##            Other   Pressure ulcer        Procedure 
##                4                5               29

A useful visualisation of categorical data is the bar chart displaying either the frequencies or proportions in each category:

# plot frequencies
barplot(ae.tbl)

Figure B.1: Bar chart of adverse event frequencies

# plot proportions
barplot(prop.table(ae.tbl))

Figure B.2: Bar chart of adverse event proportions.

The only difference between figures B.1 and B.2 is the y-axis scale.

For binary data, a summary may simply involve calculating the proportion or percentage of one category relative to the total, for example the proportion of adverse events with fatal outcome:

# proportion of fatal adverse events (severity = 'I').
mean(ae$severity == 'I')

## [1] 0.007633588

B.3 Summarising numerical data

For this section we use the Robson group 1 births and the Renography doses datasets assigned to the births and reno variables respectively.

Numerical data are commonly summarised using three key characteristics: central tendency, distribution shape and spread (variation). In R, the summary() function provides a quick overview of all three:

# summarise the length of newborn babies
summary(births$length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    35.0    50.0    52.0    51.7    53.0    60.0       3

summary() is a generic function that behaves differently depending on the type of input. With numerical input, it returns a table displaying the minimum and maximum values, the three quartiles (1st, Median, 3rd), the mean and the number of NA values (if any).

Quartiles are the three values that divide a dataset into four equally sized parts. 25% of the data values fall below the first quartile, 50% fall below the second quartile (median), and 75% fall below the third quartile.

A most useful visualisation of the distribution of continuous data is the histogram:

hist(births$length)

Figure B.3: Histogram of birth lengths.

A histogram groups data into intervals, called bins, and shows how many values fall within each interval. It is similar to a bar chart, but the bars are touching to reflect the fact that the data are continuous.

Note that there are much more to histograms than meets the eye. For a deeper understanding and additional options, read the documentation (?hist).

B.3.1 Centre

The central tendency of a dataset refers to the typical or central value around which the data points cluster. It represents the centre of the distribution. Common measures of central tendency include the mean and the median.

The mean is perhaps the better-known of the two. It is calculated by summing all the values and dividing by the total number of values. For example, the mean of the values [1, 2, 3] is (1 + 2 + 3) / 3 = 2.

The median – often overlooked but equally important – is the middle value when the data are sorted in order. For example, given the values [2, 1, 3], the median is 2 (remember to sort first). If there is an even number of values, the median is the mean of the two central values; for instance, for [1, 2, 3, 4], the median is 2.5.

In the previous examples, the mean and the median are identical. This occurs when the data are perfectly symmetrical. However, when the data are skewed – that is, spread unevenly on either side of the centre – the mean and median can differ significantly. For example, consider the values [1, 2, 999]. The mean is 334, heavily influenced by the extreme value, whereas the median remains 2, effectively splitting the dataset in halves.

Imagine a playground seesaw with a child and an adult seated at opposite ends. The mean corresponds to the balance point of the seesaw, shifting closer to the heavier adult, reflecting the weighted average of the two positions. In contrast, the median represents the midpoint of the plank itself, simply dividing the seesaw into halves regardless of weight.

The choice between using the mean or the median depends largely on the purpose of the analysis as well as the characteristics of the data distribution.

The mean is useful when we want to consider every value in the dataset and obtain a measure that reflects the overall average, especially when the data are symmetrically distributed without extreme outliers.

On the other hand, the median is often preferred when the data are skewed or contain outliers, as it provides a measure of central tendency that is resistant to extreme values and better represents the “typical” observation in such cases. In SPC, the median is particularly useful for runs analysis with run charts. Because the median reflects the midpoint of the data, there is an equal probability – fifty-fifty – that any individual data point will fall above or below it. This balance is crucial for runs analysis, as it ensures that the number and length of runs (consecutive points above or below the median) can be assessed against expected probabilities under random variation. Consequently, using the median helps accurately detect non-random trends or shifts in the process regardless of the shape of data.

In our birth lengths data, the mean and the median are nearly identical: 51.7 vs 52.0.

Understanding the shape and spread of the data helps inform which measure will provide the most meaningful summary.

B.3.2 Shape

The shape of a dataset refers to the overall pattern or distribution of values. Two key features that describe the shape of data are symmetry and modality.

Symmetry describes how evenly the values are distributed around the centre. In a perfectly symmetrical distribution the left and right sides are mirror images and the mean and the median are identical.

When one side of the distribution extends further than the other, the data are said to be skewed. If the tail extends towards higher values, the distribution is right-skewed (positively skewed), and the mean is greater than the median. If it extends towards lower values, it is left-skewed (negatively skewed), and the mean is less than the median. Examples of data that are often right-skewed include waiting times and income, where most observations cluster around the lower end of the scale, while a few extreme values create a long tail extending towards higher values.

Symmetry is also characterised by the mean and median being approximately centred between the minimum and maximum values.

Because the mean and median of birth length are very close, we may assume that the data are approximately symmetrical. However, neither the mean nor the median is well centred between the minimum and maximum values, suggesting that the distribution may still be slightly skewed or influenced by outliers at one end.

The histogram in Figure B.3 provides a much clearer and more detailed picture of the shape of the data. The majority of values are symmetrically centred, but a few data points have unusually low values (birth length = 35 cm) suggesting a left-skew. However, these outliers appear unlikely for full-term babies and should be investigated further before proceeding with analysis.

Modality refers to the number of peaks or modes in a distribution. A unimodal distribution has one clear peak, bimodal has two peaks, and multimodal has more than two. Bi- or multimodal distributions indicate the presence of important subgroups in data that need to be taken into account when analysing data.

hist(reno$dose)

Figure B.4: Histogram of radiation doses used for renography.

Figure B.4 shows an example of bimodal data, suggesting the presence of two distinct groups of renography procedures – each likely driven by a different underlying cause that is unknown to us. Before plotting these data – as we did in Chapter 6 – we should first seek an explanation for the observed pattern and consider stratifying the data according to the underlying procedural strategy.

B.3.3 Spread

The concept of spread in data is fundamental to SPC, as we have discussed numerous times throughout this book – although the term “spread” itself has not been explicitly introduced until now. Spread refers to the variability or diversity within the data, indicating how much the values differ from each other.

When constructing control limits, we rely on the standard deviation (SD or sigma) of common cause variation as a primary measure of spread. However, standard deviation is not the only useful spread indicator. As demonstrated above, the summary() function provides additional measures that reflect the spread of data.

The range represents the distance between the highest and lowest values in a dataset, while the interquartile range (IQR) captures the spread of the middle 50% of the data — calculated as the difference between the third and first quartiles.

The range is easy to understand but has limited usefulness on its own, as it tends to increase with sample size. As the number of observations grows, so does the likelihood of including rare – though not necessarily unusual – values.

The interquartile range is a more robust measure of spread and can also offer some insight into the shape of the distribution through the position of the centre relative to the quartiles. However, it becomes unreliable with very small sample sizes. When fewer than five values are present – at the very least – calculating quartiles is effectively meaningless.

A simple, useful and compact alternative to histograms is the box plot, which provides a visual summary of the central tendency and spread of a dataset.

boxplot(births$length)

Figure B.5: Boxplot of birth lengths.

The box in a box plot represents the interquartile range (IQR), while the line inside the box indicates the median. The whiskers extend from either end of the box to the smallest and largest values that lie within 1.5 times the IQR from the quartiles. Any dots beyond the whiskers represent more extreme values – often referred to as outliers, though there is often nothing truly outlandish about them.

As long as data have a unimodal shape, the box plot is a very effective visualisation of both the centre, shape and spread of a distribution.

Box plots are especially useful with grouped data as in Figure B.6, which compares the birth length of boys and girls, suggesting that, on average, boys are slightly longer than girls.

boxplot(length ~ sex, births)

Figure B.6: Stratified boxplot.

The standard deviation is perhaps the most commonly used measure of spread in a dataset. It tells, on average, how far from the mean each data value lies.

To calculate SD we use

\[ SD = \sqrt{\frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}} \]

That is, the square root of the sum of squared deviations from the mean divided by one less than the number of values.

SD can always be calculated, but its usefulness depends on the underlying distribution of the data. This is why we first introduced the range and interquartile range, which are reliable measures of spread regardless of the distribution.

When data follow a Gaussian (or normal) distribution – which we will talk more about in the next section – the standard deviation has some particularly useful properties:

Approximately

68% of data values fall within ±1 SD of the mean,
95% fall within ±2 SD, and
99.7% fall within ±3 SD.

The first two rules depend heavily on data’s underlying distribution. However, the third rule – that “most” data points fall within ±3 SD – holds true for many common distributions. Without delving into technical details, it can be shown that for any unimodal distribution, at least 95% of the data lie within three standard deviations of the mean. For many common unimodal distributions, especially those that are roughly symmetric, this proportion can rise to more than 98%.

This fact supports the use of three-sigma limits in SPC charts regardless of the distribution of data and cautions against adopting tighter limits solely to increase sensitivity.

B.4 Theoretical distributions

So far we have mainly concerned ourselves with empirical distributions – examining the centre, shape and spread of real-world datasets. In this section we will take a look at some important theoretical distributions.

A theoretical distribution is a mathematical model that describes the probabilities of different outcomes in a given situation. For instance, consider tossing a fair coin. While we cannot predict the outcome of any single toss, we know that the set of possible outcomes is [heads, tails], each with a probability of 1/2. Similarly, when rolling a fair six-sided die, the possible outcomes are [1, 2, 3, 4, 5, 6], each occurring with a probability of 1/6.

Given this type of information, we are able – through the use of probability theory – to predict the probability of any future outcome or combination of outcomes.

Numerous theoretical distributions have been described, but for our purposes in SPC, we are primarily concerned with three: the Poisson, the binomial, and the Gaussian (or normal) distribution. The Poisson and binomial distributions are used for count data, while the Gaussian distribution is applied to measurement data.

B.4.1 Poisson distribution – predicting the number of events

The Poisson distribution – named after the French mathematician Siméon Poisson – describes the probability of a given number of events occurring within a fixed time interval, assuming that the events happen independently of one another and at a constant average rate, commonly denoted by lambda (λ). It is used to estimate the control limits for C- and U-charts.

Based on the birth data, we estimate the expected birth rate to be approximately 2,200 births per year, averaging to about 6 births per day.

I R, we can use the dpois() function to calculate the probability (or density) of any given number of births in a day. For example, the probability of having exactly 9 births in one days is dpois(9, lambda = 6) = 0.0688385. The standard deviation of a Poisson distribution conveniently equals the square root of lambda, \(SD = \sqrt{\lambda}\), in this case, approximately 2.45. When working with case rates, \(SD = \sqrt{\lambda / n_i}\), where n_i is the size of i^th time interval (the area of opportunity).

To visualise discrete probability distributions, we use bar (or stick) charts as demonstrated in Figure B.7.

# plot Poisson probability
x <- 0:16 # x-axis values
plot(x, dpois(x, lambda = 6), 
     type = 'h',
     lwd  = 2,
     ylab = 'Probability',
     xlab = 'Number of births')

Figure B.7: Poisson probability plot of the number of births in a day (mean = 6).

Note that the Poisson distribution is censored at zero – it cannot produce negative counts – but in principle, it extends towards infinity. As with all probability distributions, the sum of all probabilities is equal to 1.

For low values of lambda, the Poisson distribution is right-skewed; however, as lambda increases, the distribution becomes increasingly symmetric and begins to resemble the normal distribution.

B.4.2 Binomial distribution – predicting the number of cases of “success” or “failure”

The binomial distribution describes the probability of obtaining a given number of cases – often referred to as successes or failures – in a fixed number of independent trials (or opportunities), each with two possible outcomes (such as yes/no or pass/fail), and a constant probability of success.

The standard deviation of a binomial distribution is \(SD = \sqrt{np(1-p)}\), where n is the sample size and p the success proportion. Often, we prefer to work with the proportion of cases rather than the count (as in P-charts). In this case, the standard deviation of the proportion is calculated as \(SD = \sqrt{p(1-p)/n}\)

For example, if we consider each birth as a trial (opportunity) and define a case as a caesarean section (C-section), the binomial distribution can be used to model the probability of observing a specific number of C-sections within a set number of births.

In R, the dbinom() function is used to calculate binomial probabilities. In addition to the number of cases of interest, the function requires the total number of opportunites (sample size) and the average probability of a case. Based on the birth data, we estimate there are approximately 42 births per week, with about 9% resulting in a C-section. To calculate the probability of observing exactly 6 C-sections in a given week, we use the following command: dbinom(6, size = 42, prob = 0.09) = 0.093488.

# plot binomial probability
x <- 0:12
plot(x, dbinom(x, size = 42, prob = 0.09), 
     type = 'h', 
     lwd  = 2,
     ylab = 'Probability',
     xlab = 'Number of C-sections')

Figure B.8: Binomial probability plot of the number of C-sections in a week (size = 42, prob = 0.09).

As with the Poisson distribution, the binomial distribution is censored at zero. However, unlike the Poisson, its upper limit is fixed and equal to the sample size, since it models a finite number of opportunities The shape of the binomial distribution is perfectly symmetric when the success probability is close to 50% and grows increasingly skewed when this approaches 0% (right-skew) or 100% (left-skew). The degree of skewness depends largely on the sample size – bigger samples, less skew and more symmetry.

These considerations are important when designing a sampling plan. To quickly gather enough data points for a control chart, smaller samples are preferable; however, to maintain symmetry – which is important for the reliability of the P-chart – larger samples are needed.

B.4.3 Gaussian distribution – predicting the probability of continuous outcomes

The Gaussian distribution, named after the German mathematician Carl Gauss and often referred to as the normal distribution – though there is nothing inherently “normal” about it – is a continuous probability distribution widely used to model many real-world phenomena.

Figure B.9 shows a histogram of birth weights, where each bin represents the number of data points that fall within a specific range.

hist(births$weight, breaks = 20)

Figure B.9: Histogram of birth weights

Now, imagine gradually making the bins narrower and more numerous. As they become smaller, the tops of the bars would begin to blend together, eventually forming a smooth, continuous curve.

In Figure B.10, we have overlaid a curve representing the theoretical Gaussian distribution, using the empirical mean and standard deviation of the birth weights.

avg_w <- mean(births$weight)
std_w <- sd(births$weight)
hist(births$weight, breaks = 20, freq = F)
curve(dnorm(x, 
            mean = avg_w,
            sd   = std_w),
      add = TRUE)

Figure B.10: Histogram of birth weights with overlayed Gaussian density curve.

Note that the y-axis now represents density rather than frequency. In a histogram, density reflects the likelihood of values falling within a bin, with the total area summing to 1. The density curve, however, doesn’t assign probabilities to individual points – it represents the overall shape of the distribution, and probabilities are found by calculating the area under the curve over an interval. The total area under the curve – which, in theory ranges from minus infinity to plus infinity – also equals 1.

The Gaussian distribution has an associated density function in R, dnorm(), which returns the height of the density curve at a given value. For example, dnorm(3500, mean = avg_w, sd = std_w returns 0.000912. As mentioned, this value is not directly meaningful on its own – it represents the curve’s height at a point, not a probability.

To find the probability of a value being less than or equal to a given threshold, we use pnorm(), which returns the area under the curve up to that value. For instance, pnorm(3500, mean = avg_w, sd = std_w) = 0.5157967 indicates that approximately 51.6% of birth weights are expected to be below 3500 grams.

Using pnorm(), we can calculate the probability of outcomes within any interval. To find the probability of data falling within ±3 SD from the mean, we calculate the area under the curve like this:

pnorm(avg_w + 3 * std_w,    # area below mean + 3 SD
      mean = avg_w,
      sd   = std_w) - 
  pnorm(avg_w - 3 * std_w,  # aread below mean - 3 SD
        mean = avg_w, 
        sd   = std_w)

## [1] 0.9973002

B.5 Basic statistical concepts in summary

In this chapter, we have explored several fundamental concepts essential for understanding data in general, and for applying SPC methods in particular.

We began by examining the two primary types of data: categorical and numerical. Understanding these distinctions is crucial, as they determine the appropriate methods for summarising and analysing information. We then discussed how to generate meaningful summaries of data using both numerical measures (such as the mean, median, interquartile range, and standard deviation) and graphical techniques (including bar charts, histograms and boxplots).

Before undertaking any detailed data analysis, it is good practice to create initial plots that illustrate key characteristics of the dataset – specifically its centre, shape and spread. These visualisations can reveal important features such as skewness, the presence of outliers, or potential data quality issues, all of which may influence subsequent analysis and interpretation.