How to make an R script robust

count: false
class: left, bottom

# How to make an R script robust
 
### Matthew Suderman
#### Lecturer in Epigenetic Epidemiology

<img src="IEU-logo-colour.png" width="50%">
---
layout: true

.logo[.mrcieu[
MRC Integrative Epidemiology Unit    
]]

---

## Robust?

.center[
.iii[
"tailored for large puppies with a robust physique"

<img src="dog.jpg" width="75%">
]
.iii[
"human male Asian robust skull displays extreme male traits"

<img src="skull.jpg" width="100%">
]
.iii[
"strong, deep maple flavoured syrup is a firm favourite"

<img src="syrup.jpg" width="50%">
]
]

---

## Robust scripts

“robustness is the ability of a computer system to cope with errors during execution and cope with erroneous input.” https://en.wikipedia.org/wiki/Robustness_(computer_science)

---

## Challenges

* Versions of R
.right[
<img src="r-versions-by-date.png" width="150%">
]

* Versions of R packages

* Operating systems

* Changing requirements

* Code Reuse

---

## Solution: comments

.ii[
Commenting in R is very simple. 
```{r}
x <- 3 # all text after the 
       # hash symbol is not code
```

Deciding *what* to comment is more tricky.

[Roxygen2](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html) provides a useful framework for documenting functions
that could be applied to scripts as well.
]
--
.ii[
```{r}
#' My very cool scatterplot
#'
#' @param x x-coordinates of points 
#'          (numeric vector)
#' @param y y-coordinates of points 
#'          (numeric vector)
#' @param line whether to plot the regression 
#'             line (logical) (Default: T)
#' @return Linear model fit for y~x.
#' @examples
#' x <- c(1,2,3,3)
#' y <- c(2,2,4,5)
#' fit <- myscatterplot(x,y,line=F)
#'
myscatterplot <- function(x,y,line=T) {
  plot(x,y,pch=19)
  fit <- lm(y~x)
  if (line)
    abline(fit, col="red", lty="dashed")
  return(fit)
}
```
]

---

## Solution: keeping it short

* Break up long lines

```{r}
    dat.avg <- sapply(by(dat, as.factor(probes$symbol), colMeans), identity)
    ```
--

vs
    ```{r}
    symbols <- as.factor(probes$symbol)
    dat.list <- by(dat, symbols, colMeans)
    dat.avg <- sapply(dat.list, identity)
    ```
--

* Make functions/scripts short

.box[
  A useful **rule**: the entire script or function
  should fit on one screen.

Break up long bits of code into functions.
]
---

## Solution: assertions

.ii[
It is useful to test assumptions about user input
and the values of variables throughout the script.

```{r}
## ... lots of code

if (!is.numeric(x))
  stop("'x' is not numeric")
if (x < 50)
  stop("'x' is too small")

## ... lots of code
```

R provides a shorthand for this
using the `stopifnot()` function.
```{r}
stopifnot(is.numeric(x) && x >= 50)
```
]

--
.gap[ ]
.ii[
Assertions have 3 benefits:

1. They **catch** errors before they generate mysterious outputs.

2. They force the script writer to **think** more concretely
    about the values the variables could take.

3. They **document** the script.
]

---

## Solution: modular development

Split up code as much as possible into 
functions and possibly even packages.

.ii[
The following code simulates 
some data and generates two plots.

```{r}
## simulation 1
x1 <- rnorm(100)
y1 <- x1 + rnorm(length(x1))
plot(x1, y1, pch=19)
abline(lm(y1~x1), 
       col="red", 
       lty="dashed")
##
## simulation 2
x2 <- rnorm(100)
y2 <- x1 + x2 + rnorm(length(x1))
plot(x2, y2, pch=19)
abline(lm(y2~x2), 
       col="red", 
       lty="dashed")
```
]

.ii[
The following version is easier to **read**, **modify** and **reuse**.

```{r}
## data simulation
x1 <- rnorm(100)
y1 <- x1 + rnorm(length(x1))
x2 <- rnorm(100)
y2 <- x1 + x2 + rnorm(length(x1))

## scatterplots 
myscatter <- function(x,y) {
  plot(x1, y1, pch=19)
  fit <- lm(y~x)
  abline(fit, 
         col="red", lty="dashed")
  return(fit)
}
myscatter(x1,y1)
myscatter(x2,y2)
```

]

---

## Solution: exceptions

.ii[
The following script will stop when attempting to calculate `r`
just before printing the message at the end.
```{r}
x <- c(1,2,3)
y <- NA

## ... lots of code

r <- cor(x,y)
cat("The correlation is", r, "\n")
```
]

.gap[ ]
.ii[
The following script uses the `tryCatch` function 
to "catch" the error and set `r` to `NaN` ("Not A Number").
As a result, this script will run all the way to the end.
```{r}
x <- c(1,2,3)
y <- NA

## ... lots of code

r <- tryCatch(cor(x,y), 
              error=function(e) NaN) 
cat("The correlation is", r, "\n")
```
]

---

## Solution: debugging

*Debugging* is identifying errors in the code and fixing them.

If your script or R command has been stopped by an error, 
the simplest thing to do is to type `trackeback()`. 
```{r}
traceback()
```
--

.box[
Traceback tells you the specific line of code **where** the error was generated.

This may be helpful if the error occured 
when your script was running a function
from an R package.
]

---

## Solution: debugging, cont

If this doesn't help, 
you can insert print statements just before the error
to check variable values.

For example, 
```{r}
## ... lots of code 
print("The value of x:")
print(x)
print("the value of y:")
print(y)
## the script stops here 
## ... lots of code
```

---

## Solution: debugging, cont

A more advanced alternative is to use the `browser` function.
```{r}
## ... lots of code 
browser()
## the script stops here 
## ... lots of code
```

.box[
When a script reaches a call to `browser()`, it:

1. pauses the script

2. allows you to run R commands to look at the values of variables

3. allows you to run the rest of the script, pausing at each line
]