D Notes on R
This is not an R tutorial. Rather, this appendix aims to explain and clarify some of the tools and techniques we use in this book to import and prepare data for plotting using only base R functions. We are very well aware of new and modern approaches to data manipulation from the tidyverse and data.table packages – we use these tool ourselves every day. But we find it valuable to be able to handle data in base R. Especially when sharing code or building R packages it is generally a good strategy to avoid unnecessary dependencies on external packages.
First, we will look at some data structures and types that are essential to this book. Next, we will discuss some principles for importing data from text files into data frames and how to manipulate these to make data ready for plotting. Finally, we give some useful tips and tricks
D.1 Data structures and classes
The simplest data structure in R is the vector: zero or more data elements of the same type. Data come in several basic types (classes). We are mainly concerned with logical, numeric and character values. Date and datetime values, which are especially important in SPC, are basically just numeric values representing the number of days (dates) or seconds (datetimes) since the beginning of the year 1970.
The data frame is (for our purpose) probably the most important data structure in R. Data frames are collections of vectors that may be of different types but are all of the same length. Think matrix or table with rows and columns where each row represents an observation, each column represents a variable, and each cell represents a data value. All rows and all columns have the same length, and all cells have values (or NA for missing values).
D.2 Plot-ready data frames
When plotting time series data including spc data, R usually expects two vectors, one for the x axis representing the subgroups, which in its simplest form may be a sequence of numbers or dates and one for the y axis representing the indicator values to be plotted. To correctly calculate the control limits in spc charts, we often need a third variable representing the denominator of the count data in the y variable. Thus, a generic plot-ready data set for making an spc chart may look like this.
# A tibble: 12 × 3
x y n
<date> <int> <int>
1 2025-02-26 13 32
2 2025-02-27 14 29
3 2025-02-28 16 33
4 2025-03-01 19 30
5 2025-03-02 13 32
6 2025-03-03 18 35
7 2025-03-04 19 29
8 2025-03-05 16 33
9 2025-03-06 16 34
10 2025-03-07 11 27
11 2025-03-08 13 32
12 2025-03-09 12 26
To plot a P chart from these data with the qic() function from the qicharts2 package, we may do this, where dta is the name of the data frame containing the three variables x, y, and n:
To correctly plot dates or datetimes on the x axis, it is important that the x variable is of the correct class (Date or POSIXct).
D.3 Importing data from text files
When importing data into R from text files using one of the base R read.* functions, data are returned as data frames.
For this book, we provide all data sets as comma separated values (csv) in text files that can be read using the read.csv() function. Each data file begins with a number of commented lines that explains the content and the variables in data. For example, the first 18 lines of the “bacteremia.csv” file looks like this:
# Bacteremia
#
# Hospital acquired and all cause bacteremias and 30 days mortality
#
# Variables:
# month (date): month of infection
# ha_infections (numeric): number of hospital acquired infections
# risk_days (numeric): number of patient days without infection
# deaths (numeric): 30-day mortality after all-cause infection
# patients (numeric): number of patients with all-cause infection
month,ha_infections,risk_days,deaths,patients
2017-01-01,24,32421,23,100
2017-02-01,29,29349,22,105
2017-03-01,26,32981,13,99
2017-04-01,16,29588,14,85
2017-05-01,28,30856,17,98
2017-06-01,16,30544,15,85
...
Lines beginning with a hash symbol (#) represent comments. The first non-blank line after the comments holds the variable names, and the the rest of the file contains the data values separated by commas (,).
Note how the dates in the first column are formatted using the only unmistakeable way of writing dates: year-month-day (yyyy-mm-dd). We highly recommend to always store dates in this format, which also happens to be the international ISO standard for writing dates. ISO dates, when used in file names, sort correctly and also have the advantage of being easily recognised as dates by R (and other statistical software).
If dates are stored in any other format (e.g. dd-mm-yyyy), we may need to import them as character values and later convert them to dates using the as.Date() function.
When reading data in R programmes that are used in production systems, we recommend that you specify the data type (class) of each column using the colClasses argument.
# read data from file and assign to variable named d
d <- read.csv('data/bacteremia.csv',
comment.char = '#',
colClasses = c(month = 'Date',
ha_infections = 'integer',
risk_days = 'integer',
deaths = 'integer',
patients = 'integer'))
# print the first six lines of data
head(d)
month ha_infections risk_days deaths patients
1 2017-01-01 24 32421 23 100
2 2017-02-01 29 29349 22 105
3 2017-03-01 26 32981 13 99
4 2017-04-01 16 29588 14 85
5 2017-05-01 28 30856 17 98
6 2017-06-01 16 30544 15 85
'data.frame': 24 obs. of 5 variables:
$ month : Date, format: "2017-01-01" "2017-02-01" ...
$ ha_infections: int 24 29 26 16 28 16 14 18 27 30 ...
$ risk_days : int 32421 29349 32981 29588 30856 30544 26482 27637 30495 30600 ...
$ deaths : int 23 22 13 14 17 15 15 25 21 24 ...
$ patients : int 100 105 99 85 98 85 89 99 103 86 ...
D.4 Manipulating data frames
We will import the c. section data set to demonstrate adding variables and aggregating data.
d <- read.csv('data/csection_delay.csv',
comment.char = '#',
colClasses = c(datetime = 'POSIXct',
month = 'Date',
delay = 'integer'))
head(d)
datetime month delay
1 2016-01-06 03:55:40 2016-01-01 22
2 2016-01-06 20:52:34 2016-01-01 22
3 2016-01-07 02:50:43 2016-01-01 29
4 2016-01-07 22:32:27 2016-01-01 28
5 2016-01-09 14:56:09 2016-01-01 22
6 2016-01-09 21:21:24 2016-01-01 20
The c. section data contains 208 rows representing individual c. sections. Our aim for this exercise is to reduce data to a plot ready data frame with one row per month and the number of c. section that were on target and the total number of c. sections.
The delay variable is the number of minutes from decision to perform a c. section to delivery og the baby. The standard target value for grade 2 c. sections is less than 30 minutes. If we want to plot the proportion of c. sections that are on time (i.e. less than 30 min.), we first need to dichotomise the delay variable into a logical variable, ontime, that is TRUE when delay is less than 30. We add this new variable to the data frame using the $-notation.
D.4.1 Adding variables to data frames
datetime month delay ontime
1 2016-01-06 03:55:40 2016-01-01 22 TRUE
2 2016-01-06 20:52:34 2016-01-01 22 TRUE
3 2016-01-07 02:50:43 2016-01-01 29 TRUE
4 2016-01-07 22:32:27 2016-01-01 28 TRUE
5 2016-01-09 14:56:09 2016-01-01 22 TRUE
6 2016-01-09 21:21:24 2016-01-01 20 TRUE
D.4.2 Aggregating data frames
Then we aggregate (or summarise) data to one row per month. For this, we use the split-apply-combine strategy by first splitting the data frame into a list of data frames, one per month. Next, we apply the same summary function to all elements (months) of this list. Finally, we combine the groups back into a data frame again.
d2 <- split(d, d$month) # Split data frame by month.
d2 <- lapply(d2, function(i) { # Apply summaries to each group.
data.frame(month = i$month[1],
n = nrow(i),
n_ontime = sum(i$ontime))
})
d2 <- do.call(rbind, d2) # Combine groups into data frame.
rownames(d2) <- NULL # Remove row names (cosmetic only).
str(d2)
'data.frame': 24 obs. of 3 variables:
$ month : Date, format: "2016-01-01" "2016-02-01" ...
$ n : int 7 11 11 9 8 5 7 12 7 12 ...
$ n_ontime: int 7 9 10 9 8 4 5 12 6 11 ...
month n n_ontime
1 2016-01-01 7 7
2 2016-02-01 11 9
3 2016-03-01 11 10
4 2016-04-01 9 9
5 2016-05-01 8 8
6 2016-06-01 5 4
The three functions split(), lapply(), and do.call() may be unfamiliar to many R users, but it pays to get to know them – RTD (read the documentation).
The split-apply-combine strategy may be performed a lot easier with functions from the tidyverse or table.table packages. But, as mentioned, we find it useful to know the base R ways of doing things, not the least to maintain independence from external packages.