How to enter the tidyverse of madness

count: false
class: left, bottom

# How to enter the tidyverse of madness
 
### Matthew Suderman
#### Senior Lecturer in Epigenetic Epidemiology

<img src="IEU-logo-colour.png" width="50%">
---
layout: true

.logo[.mrcieu[
MRC Integrative Epidemiology Unit    
]]

---

## What's wrong with R?

R is old (circa 1993).

It was originally designed for doing **statistics** with relatively small, simple datasets.

Since then, it has become one of the most popular tools for 
the new discipline called **[data science](https://en.wikipedia.org/wiki/Data_science)**.

.quote["Data science is an interdisciplinary field that uses 
scientific methods, processes, algorithms and systems 
to extract knowledge and insights from 
noisy, structured and unstructured data, 
and apply knowledge and actionable insights
from data across a broad range of application domains."
https://en.wikipedia.org/wiki/Data_science]

---

## Shortcomings of R for data science

1. Lacks **consistency**

2. Often **suprising**

3. Tends to be **slow** and **memory-intensive**

4. Built around **vectors and matrices**

5. Functional language can be **awkward** to use

---

## 1. Lacks **consistency**

.ii[
**function naming**

```
names, colnames
row.names, rownames
rowSums, rowsum
rowMeans, (no parallel rowmean exists)
browseURL, contrib.url, fixup.package.URLs
package.contents, packageStatus
getMethod, getS3method
read.csv and write.csv, load and save, readRDS and saveRDS
Sys.time, system.time
```
[https://r4stats.com/articles/why-r-is-hard-to-learn/]

**multiple ways to do things**

e.g. two different object-oriented systems: S3 and S4 systems
]

.ii[
**missing values**

```r
> x <- c(1,2,3,4,NA)
> y <- c(1.1,2.2,3.3,4.4,5.5)

## missing values not okay
> quantile(x)
Error in quantile.default(x) :
  missing values not allowed if 'na.rm' is FALSE

## missing values okay
> fit <- lm(y ~ x)
> length(x)
[1] 5
> length(residuals(fit))
[1] 4
```
]

---

## 2. Often **surprising**

R often converts between data types behind the scenes, 
e.g. characters to factors, matrices to vectors, numeric to characters

```r
> x <- matrix(1:12,ncol=3)
> x
 [,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
```
--
```r
> class(x[,1:2])
[1] "matrix" "array"
```
--
```r
> class(x[,3]) ## what is this?
```
--
```r
[1] "integer"
```

---

```r
> x <- matrix(1:12,ncol=3)
> x
 [,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
```
--
```r
> x[1,1]
1
```
--
```r
> x[1,3] <- "a"
```
--
```r
> x[1,1] + 1 ## what is this equal to?
```
--
```r
Error in x[1, 1] + 1 : non-numeric argument to binary operator
```
--
```r
> class(x[1,1])
[1] "character"
```
--
```r
> x[1,1]
[1] "1"
```

---

## 3. Tends to be **slow** and **memory-intensive**

R was designed for small datasets, so it tends to do things

- the *simple* way

rather than

- the *efficient* way.

.box[
e.g. it is designed to **slurp** data

Slurping means loading entire datasets into memory
before analysing them.

This is not efficient when the analysis really only needs to see 
a table row-by-row.

*But*, this it simpler to just load (slurp) the entire dataset in one simple step.
]

---

## 4. Built around **vectors and matrices**

... whereas data science analyses **data frames** like this:

| participant|      age|sex |arm  |
|-----------:|--------:|:---|:----|
|           1| 23.58224|F   |arm3 |
|           2| 24.69102|M   |arm4 |
|           3| 25.42567|F   |arm2 |
|           4| 24.96019|F   |arm2 |
|           5| 23.35406|M   |arm4 |

Each column has its own data type.

.box[
Although R implements data frames, 
most functions are designed for vectors and matrices.
]

---

## 5. Functional language that can be **awkward** to use

e.g. even if you know what all the functions do, this expression is 
difficult to understand!

```r
summarize(group_by(left_join(filter(visits, wave=="wave2"),
 participants, by="participant"),arm,sex),n=n(), 
 age=mean(age), bmi=mean(bmi), blink=mean(blink))
```

---

## tidyverse

.ii[
[Tidyverse](https://www.tidyverse.org/) is a collection of R packages
that aim to make the analysis of large datasets more convenient.

<img src="tidyverse.png" width="75%">
]
.gap[ ]
--
.ii[
Here is intended workflow in the tidyverse:

- Data is **import**ed and made **tidy** so it can be analysed
- Analysis involves **transform**ing the data 
- Relationships in the data are **visualised** 
- **Models** are applied to test relationships
- Finally, conclusions are **communicate**d
]
---

## tidyverse book

https://r4ds.had.co.nz/ 
<img src="tidyverse-book.png" width="100%">

---

## Using tidyverse packages

Installing all tidyverse packages.
```r
install.packages("tidyverse")
```

Loading all tidyverse packages.
```r
library(tidyverse)
```

---

## Our dataset

.ii[
In this session, we will explore the tidyverse by performing a basic data analysis
of a simulated dataset from a randomized control trial.

<img src="blinking-guy.jpg" width="90%">
]
.gap[ ]
--
.ii[
**Background** (Pretend) studies have shown that blink rate is positively associated with disengaged attention.
A drug has been developed that is known to reduce blink rate.

**Aim** Determine if the drug will reduce attention deficits by reducing blink rate.

**Methods**

The were 4 arms:

controls (arm 1), low dose (arm 2), medium (arm 3), high (arm 4)

Each arm had 100 randomly selected individuals.

Data was collected in 3 waves:

baseline (wave 0), 6 months (wave 1), 12 months (wave 2)
]
---

## Dataset schema

---

## Import the dataset

The dataset is available as CSV files.

These can be imported into R using tidyverse function `readr::read_csv()`.

```r
data.dir <- "https://raw.githubusercontent.com/perishky/perishky.github.io/master/r/enter-the-tidyverse"

arms <- read_csv(file.path(data.dir, "arms.csv"))
participants <- read_csv(file.path(data.dir, "participants.csv"))
visits <- read_csv(file.path(data.dir, "visits.csv"))
```

```r
> arms <- read_csv("arms.csv")
Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): arm
dbl (1): dose

> arms
# A tibble: 4 × 2
 arm dose
 <chr> <dbl>
1 arm3 2
2 arm4 4
3 arm2 1
4 arm1 0
```

---

## Why use `read_csv` rather than `read.csv`?

**Speed** `read_csv` is typically about 10x faster.  If this isn't fast enough, `data.table::fread` is even faster!

**Better communication** `read_csv` provides more information about what it has imported into R and clearer error messages if the import failed via the `problems()` function.

**Tidy format** `read_csv` creates tibbles rather than data frames.
(we'll discuss tibbles vs data frames on the next slide)

```r
> arms
# A tibble: 4 × 2
 arm dose
 <chr> <dbl>
1 arm3 2
2 arm4 4
3 arm2 1
4 arm1 0
```

---

## Tidy

Tidy data is stored in tables called 'tibbles'.

1. "Each variable must have its own column."
2. "Each observation must have its own row."
3. "Each value must have its own cell."

> *Why use a tibble rather than a data frame?* (after all, tibbles are  data frames)
 
--

> **Predictability** `tibble` never changes the types of inputs
> e.g. character strings are never converted to factors. It never changes column names. It never adds row names.

> **Convenience** Whereas `tibble` can create data frames column-by-column, there is a `tribble` function to create data frames row-by-row.

> **Pretty printing** Tibbles are nicer to look at!
> Typing the name of a data frame in R and pressing enter will print the entire data frame to the screen, regardless of its length or width! Tibbles will only show the first few rows and columns along with the dimensions of the data frame and data type of each column.

---

## Creating tibbles

Tibbles can be created by column (just like a data frame)
```{r,eval=F}
tibble(
  x=c("a","b"),
  y=c(3,1),
  z=c(pi,2*pi)
)
```
--
or by row like they'd be written in a data file
```r
tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 3, pi,
  "b", 1,2*pi
)
```

---

## Answering questions

How many participants are in each study arm?

.ii[
We'll use the `participants` tibble.

```r
> participants
# A tibble: 400 × 4
 participant age sex arm
 <dbl> <dbl> <chr> <chr>
 1 1 23.6 F arm3
 2 2 24.7 M arm4
 3 3 25.4 F arm2
 4 4 25.0 F arm2
 5 5 23.4 M arm4
 6 6 25.9 M arm2
 7 7 26.5 M arm4
 8 8 26.6 F arm1
 9 9 25.3 F arm2
10 10 25.3 F arm3
# … with 390 more rows
```
]

.gap[ ]

.ii[
We'll use the `count()` function to count participants in each arm.

```r
> count(participants, arm)
# A tibble: 4 × 2
 arm n
 <chr> <int>
1 arm1 100
2 arm2 100
3 arm3 100
4 arm4 100
```

`count()` is very similar to the `table()` function in base R.

]

---

## Challenge!

See if you can count the number of female participants in each arm.

```r
> count(participants, arm, sex)
# A tibble: 8 × 3
 arm sex n
 <chr> <chr> <int>
1 arm1 F 48
2 arm1 M 52
3 arm2 F 43
4 arm2 M 57
5 arm3 F 50
6 arm3 M 50
7 arm4 F 46
8 arm4 M 54
```

---

## Filtering

That last output gives more information than we wanted. 
To focus on just females, 
we'll need to filter out information about males.

To do this, we use the `filter()` function.

```r
> counts <- count(participants, arm, sex)
> filter(counts, sex == "F")
# A tibble: 4 × 3
 arm sex n
 <chr> <chr> <int>
1 arm1 F 48
2 arm2 F 43
3 arm3 F 50
4 arm4 F 46
```

---

## Challenge!

See if you can obtain the same result by

1. first applying the `filter` function to `participants` to remove males

2. and then applying the `count` function to the result.

```r
> females <- filter(participants, sex=="F")
> count(females, arm, sex)
# A tibble: 4 × 3
 arm sex n
 <chr> <chr> <int>
1 arm1 F 48
2 arm2 F 43
3 arm3 F 50
4 arm4 F 46
```

---

## Intermediate variables

For longer queries, it can get a little tedious creating variable names for each step.

In the queries above, we had to create `counts` and `females` even though 
we aren't directly interested in those variables, e.g.

```r
counts <- count(participants, arm, sex)
filter(counts, sex == "F")
```

One solution is called **function composition**. 
```r
filter(count(participants, arm, sex), sex=="F")
```

## Challenge!

See if you can obtain the same result with `count` as the outer function, 
i.e. `count(filter(...), ...)`

```r
count(filter(participants, sex=="F"), arm, sex)
```

---

## Function composition gets extreme

Although function composition avoids having to name intermediate variables, 
they can make the code difficult to read.

For example, the following command summarizes information about each study arm
(we'll discuss the functions used here later).

```r
summarize(group_by(left_join(filter(visits, wave=="wave2"), 
  participants, by="participant"),arm,sex),n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink))
```

--
To be fair, we can split it across lines to make it a little easier to read.
```r
summarize(
  group_by(
   left_join(
     filter(visits, wave=="wave2"), 
     participants, 
     by="participant"), 
    arm, 
    sex), 
  n=n(), 
  age=mean(age), 
  bmi=mean(bmi), 
  blink=mean(blink))
```

---

## Pipes

The "pipe" operator (`%>%`) was invented to solve this problem.

This command ...

```r
filter(count(participants, arm, sex), sex=="F")
```

... can be written like this

```r
participants %>% filter(sex=="F") %>% count(arm, sex)
```

... or even better, split across multiple lines

```r
participants %>%
  filter(sex=="F") %>% 
  count(arm, sex)
```

i.e. 
- we 'pipe' `participants` to `filter`, 
- and then the output of `filter` to `count`

---

## Challenge!

See if you can rewrite the following using pipes:

```r
filter(count(participants,arm,sex), sex=="F")
```

```r
participants %>% 
  count(arm,sex) %>% 
  filter(sex=="F")
```

---

## Challenge!

Now see if you can rewrite this command using pipes:
```r
summarize(
  group_by(
   left_join(
     filter(visits, wave=="wave2"), 
     participants, 
     by="participant"), 
    arm, 
    sex), 
  n=n(), 
  age=mean(age), 
  bmi=mean(bmi), 
  blink=mean(blink))
```

```r
visits %>%
  filter(wave=="wave2") %>%
  left_join(participants, by="participant") %>%
  group_by(arm,sex) %>%
  summarize(n=n(),age=mean(age), bmi=mean(bmi), blink=mean(blink))
```

---

## Calculating data summaries

The `summarise(<tibble>,<summary1>,<summary2>,...)` summarizes the columns of a tibble.

For example, we can use it to calculate
1. the average participant age, and 
2. how many of them are females.

```r
> summarise(participants, age=median(age), nfemales=sum(sex=="F"))
# A tibble: 1 × 2
 age nfemales
 <dbl> <int>
1 25.0 187
```

Here is what it looks like using pipes.
```r
participants %>% 
  summarise(age=median(age), nfemales=sum(sex=="F"))
```

---

## Calculating summaries of subsets

The previous example above provided summaries across *all* rows.

It is also possible to summarize subsets of rows using `group_by(<tibble>,<groupby1>,<groupby2>,...)`.

For example, we can calculate, *for each arm of the study*
1. the average participant age, and 
2. the number of males and females.

```r
> participants %>% 
  group_by(arm,sex) %>%
  summarise(n=n(), age=mean(age))

# A tibble: 8 × 4
# Groups: arm [4]
 arm sex n age
 <chr> <chr> <int> <dbl>
1 arm1 F 48 25.1
2 arm1 M 52 25.1
3 arm2 F 43 25.1
4 arm2 M 57 24.8
5 arm3 F 50 25.1
6 arm3 M 50 24.8
7 arm4 F 46 24.9
8 arm4 M 54 25.1
```

.popup[.box[
Did you notice some **magic**?

1. `summarise()` magically knows when a tibble has come from `group_by()` 
  and that it should summarize by the resulting groups.

2. `n()` function magically knows about the tibble subsets being processed by `summarise`.
]]

---

## Challenge!

Can you think of an alternative to the mysterious `n()` function?  (*this is a tricky question*)

```r
participants %>% 
  group_by(arm,sex) %>%
  summarise(n=n(), age=mean(age))
```

*Hint:* Use the `length` function ...

```r
participants %>% 
  group_by(arm,sex) %>%
  summarise(n=length(age), age=mean(age))
```

---

## Merging datasets

We'd like to augment the summary with measurements from the first visit.

This information is in the `visits` dataset.

To do this, we'll first need to:
1. filter `visits` to get just the first visit (there were 3 visits)
2. merge the result with `participants`
3. group the result by arm and by sex
4. calculate the mean of age, BMI and blink rate

We've seen how to do each of these steps except step 2, merging.

```r
visits %>%
  filter(wave=="wave0") %>%
  ????????? merge with participants ????????? %>%
  group_by(arm,sex) %>%
  summarise(n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink))
```

In base R, we'd use the function `merge()`.

For tibbles, we have four different functions:
`left_join()`, `right_join()`, `inner_join()` and `full_join()`.

---

## `left_join()`

.ii[
To illustrate, we'll use a small version of `visits`.
```r
> visits.s <- visits %>% filter(participant %in% 1:3)
> visits.s
# A tibble: 9 × 8
participant wave month blink bmi hdl trig ldl
 dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 wave0 0 17.6 15.4 1.38 1.52 2.11
2 1 wave1 6.18 16.0 16.5 1.35 1.33 1.88
3 1 wave2 10.4 15.6 14.9 1.39 1.39 2.01
4 2 wave0 0 16.3 19.4 1.64 1.70 4.06
5 2 wave1 5.50 14.5 20.6 1.61 1.31 3.90
6 2 wave2 13.0 11.2 21.8 1.57 1.62 4.04
7 3 wave0 0 18.6 22.0 1.61 1.12 2.11
8 3 wave1 6.58 17.8 20.8 1.75 1.14 1.82
9 3 wave2 13.5 17.3 23.1 1.53 1.11 2.21
```
]

.gap[ ]

.ii[
Recall `participants`.
```r
> participants
# A tibble: 400 × 4
 participant age sex arm
 <dbl> <dbl> <chr> <chr>
 1 1 23.6 F arm3
 2 2 24.7 M arm4
 3 3 25.4 F arm2
 4 4 25.0 F arm2
 5 5 23.4 M arm4
 6 6 25.9 M arm2
 7 7 26.5 M arm4
 8 8 26.6 F arm1
 9 9 25.3 F arm2
10 10 25.3 F arm3
# … with 390 more rows
```
]

---

## `left_join()`

`left_join()` adds data from `participants`
that match the three participants in `visits.s`.

```r
> left_join(visits.s, participants, by="participant")
Joining, by = "participant"
# A tibble: 9 × 11
 participant wave month blink bmi hdl trig ldl age sex arm
 <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 wave0 0 17.6 15.4 1.38 1.52 2.11 23.6 F arm3
2 1 wave1 6.18 16.0 16.5 1.35 1.33 1.88 23.6 F arm3
3 1 wave2 10.4 15.6 14.9 1.39 1.39 2.01 23.6 F arm3
4 2 wave0 0 16.3 19.4 1.64 1.70 4.06 24.7 M arm4
5 2 wave1 5.50 14.5 20.6 1.61 1.31 3.90 24.7 M arm4
6 2 wave2 13.0 11.2 21.8 1.57 1.62 4.04 24.7 M arm4
7 3 wave0 0 18.6 22.0 1.61 1.12 2.11 25.4 F arm2
8 3 wave1 6.58 17.8 20.8 1.75 1.14 1.82 25.4 F arm2
9 3 wave2 13.5 17.3 23.1 1.53 1.11 2.21 25.4 F arm2
```

Notice how the result is just `visits.s` but with some additional columns
from `participants`.

Rows in `participants` that do not match are omitted.

---

## Challenge!

What do you think happens if some participants in `visits` do not match a participant in `participants`?

```r
> parts.s <- participants %>% filter(participant %in% 1:2)
> parts.s
# A tibble: 2 × 4
 participant age sex arm
 <dbl> <dbl> <chr> <chr>
1 1 23.6 F arm3
2 2 24.7 M arm4
```
--

```r
> left_join(visits.s, parts.s, by="participant")
# A tibble: 9 × 11
 participant wave month blink bmi hdl trig ldl age sex arm
 <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 wave0 0 17.6 15.4 1.38 1.52 2.11 23.6 F arm3
2 1 wave1 6.18 16.0 16.5 1.35 1.33 1.88 23.6 F arm3
3 1 wave2 10.4 15.6 14.9 1.39 1.39 2.01 23.6 F arm3
4 2 wave0 0 16.3 19.4 1.64 1.70 4.06 24.7 M arm4
5 2 wave1 5.50 14.5 20.6 1.61 1.31 3.90 24.7 M arm4
6 2 wave2 13.0 11.2 21.8 1.57 1.62 4.04 24.7 M arm4
7 3 wave0 0 18.6 22.0 1.61 1.12 2.11 NA NA NA
8 3 wave1 6.58 17.8 20.8 1.75 1.14 1.82 NA NA NA
9 3 wave2 13.5 17.3 23.1 1.53 1.11 2.21 NA NA NA
```

---

## Putting it altogether

Recall that we want to summarize information from all participants, 
grouped by age and sex.

```r
visits %>%
 filter(wave=="wave0") %>%
 left_join(participants,by="participant") %>% ## merging step
 group_by(arm,sex) %>%
 summarise(n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink))
```
--
```r
# A tibble: 8 × 6
# Groups: arm [4]
 arm sex n age bmi blink
 <chr> <chr> <int> <dbl> <dbl> <dbl>
1 arm1 F 48 25.1 21.4 17.7
2 arm1 M 52 25.1 23.9 17.5
3 arm2 F 43 25.1 21.3 17.6
4 arm2 M 57 24.8 23.4 17.5
5 arm3 F 50 25.1 21.0 17.7
6 arm3 M 50 24.8 23.8 17.6
7 arm4 F 46 24.9 21.4 17.2
8 arm4 M 54 25.1 22.9 17.6
```

---

## Challenge!

The previous output wasn't interesting because we're looking at the data
before any treatment.

Modify the command to see the same data from the final visit.

```r
visits %>%
 filter(wave=="wave2") %>%
 left_join(participants,by="participant") %>%
 group_by(arm,sex) %>%
 summarise(n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink))
```
--
```r
# A tibble: 8 × 6
# Groups: arm [4]
 arm sex n age bmi blink
 <chr> <chr> <int> <dbl> <dbl> <dbl>
1 arm1 F 48 25.1 21.6 17.7
2 arm1 M 52 25.1 24.0 17.5
3 arm2 F 43 25.1 21.6 16.5
4 arm2 M 57 24.8 24.0 16.3
5 arm3 F 50 25.1 22.1 15.3
6 arm3 M 50 24.8 25.1 15.2
7 arm4 F 46 24.9 23.7 12.6
8 arm4 M 54 25.1 25.2 13.0
```

Can you see differences between the study arms now?

---

## Merging multiple tables

To remember the treatment within each study arm, 
we'll add 'dose' to the output found  in the `arms` tibble.

```r
query <- visits %>%
 filter(wave=="wave2") %>%
 left_join(participants,by="participant") %>%
 left_join(arms,by="arm") %>% 
 group_by(arm,sex) %>%
 summarise(dose=dose[1],n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink))
```

```r
> query

# A tibble: 8 × 7
# Groups: arm [4]
 arm sex dose n age bmi blink
 <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
1 arm1 F 0 48 25.1 21.6 17.7
2 arm1 M 0 52 25.1 24.0 17.5
3 arm2 F 1 43 25.1 21.6 16.5
4 arm2 M 1 57 24.8 24.0 16.3
5 arm3 F 2 50 25.1 22.1 15.3
6 arm3 M 2 50 24.8 25.1 15.2
7 arm4 F 4 46 24.9 23.7 12.6
8 arm4 M 4 54 25.1 25.2 13.0
```

---

## Selecting columns

Having both 'arm' and 'dose' columns is a bit redundant, we just want to show 'dose'.

For this, we can use the `select()` function.

```r
> query %>% 
  ungroup() %>%
  select(dose,sex,n,age,bmi,blink)

# A tibble: 8 × 6
 dose sex n age bmi blink
 <dbl> <chr> <int> <dbl> <dbl> <dbl>
1 0 F 48 25.1 21.6 17.7
2 0 M 52 25.1 24.0 17.5
3 1 F 43 25.1 21.6 16.5
4 1 M 57 24.8 24.0 16.3
5 2 F 50 25.1 22.1 15.3
6 2 M 50 24.8 25.1 15.2
7 4 F 46 24.9 23.7 12.6
8 4 M 54 25.1 25.2 13.0
```

The `ungroup` command is necessary to turn 'off' grouping.

---

## Omitting columns

It's a bit tedious to type all the column names when we just want to remove one.
There is a way to do this.

```r
query <- query %>%
 ungroup() %>%
 select(-arm)
```

```r
> query
# A tibble: 8 × 6
 sex dose n age bmi blink
 <chr> <dbl> <int> <dbl> <dbl> <dbl>
1 F 0 48 25.1 21.6 17.7
2 M 0 52 25.1 24.0 17.5
3 F 1 43 25.1 21.6 16.5
4 M 1 57 24.8 24.0 16.3
5 F 2 50 25.1 22.1 15.3
6 M 2 50 24.8 25.1 15.2
7 F 4 46 24.9 23.7 12.6
8 M 4 54 25.1 25.2 13.0
```

---

## Reordering rows

With male rows next to female rows, it is difficult to see if there were any 
treatment effects on BMI.

We'll reorder/arrange the rows by sex and then by dose.

.ii[
Before reordering ...

```r
> query

# A tibble: 8 × 6
 sex dose n age bmi blink
 <chr> <dbl> <int> <dbl> <dbl> <dbl>
1 F 0 48 25.1 21.6 17.7
2 M 0 52 25.1 24.0 17.5
3 F 1 43 25.1 21.6 16.5
4 M 1 57 24.8 24.0 16.3
5 F 2 50 25.1 22.1 15.3
6 M 2 50 24.8 25.1 15.2
7 F 4 46 24.9 23.7 12.6
8 M 4 54 25.1 25.2 13.0
```
]

.gap[ ]

.ii[
After reordering ...

```r
> query %>%
  arrange(sex,dose)

# A tibble: 8 × 6
 sex dose n age bmi blink
 <chr> <dbl> <int> <dbl> <dbl> <dbl>
1 F 0 48 25.1 21.6 17.7
2 F 1 43 25.1 21.6 16.5
3 F 2 50 25.1 22.1 15.3
4 F 4 46 24.9 23.7 12.6
5 M 0 52 25.1 24.0 17.5
6 M 1 57 24.8 24.0 16.3
7 M 2 50 24.8 25.1 15.2
8 M 4 54 25.1 25.2 13.0
```

Now we can more easily see that BMI appears to be increasing with dose.
]
---

## Running statistical tests

We'd like to determine statistically if treatment actually increases BMI.

We can do this by fitting the following linear model:

.box[BMIwave 2 ~ BMIwave 0 + age + sex + dose]

For this, we'll need to prepare a tibble like this:

|bmi.wave0|bmi.wave2|age|sex|dose|
|---:|-----:|:----|---:|----:|
| 5.4|  14.9| 23.5|F   |    2|
|19.3|  21.8| 24.7|M   |    4|
|22.0|  23.1| 25.4|F   |    1|
|20.6|  19.7| 24.9|F   |    1|
|26.1|  28.6| 23.4|M   |    4|
| ...|      |     |    |     |

---

## Running statistical tests

Here are the steps:

1. Create a wave 0 subset of 'visits'

2. Create a wave 2 subset of 'visits'

3. Merge these two so we have a data frame with bmi at wave 0 and wave 2

4. Merge this with 'participants' and 'arms' to add participant age, sex and dose

5. Fit the linear model and extract the coefficients and p-values

---

```r
## 1. Create a wave 0 subset of 'visits'
```
--
```r
dat.wave0 <- visits %>% filter(wave=="wave0") %>% select(participant, bmi)
```
--
```r
## 2. Create a wave 2 subset of 'visits'
```
--
```r
dat.wave2 <- visits %>% filter(wave=="wave2") %>% select(participant, bmi)
```
--
```r
## 3. Merge these two so we have a data frame with bmi at wave 0 and wave 2
```
--
```r
dat.wave0 %>% 
 left_join(dat.wave2, by="participant", suffix=c(".wave0", ".wave2")) %>%
```
--
```r
 ## 4. Merge this with 'participants' and 'arms' to add participant age, sex and dose
```
--
```r
 left_join(participants, by="participant") %>%
 left_join(arms, by="arm") %>%
```
--
```r
 ## 5. Fit the linear model and extract the coefficients and p-values
```
--
```r
 lm(bmi.wave2 ~ bmi.wave0 + age + sex + dose, data=.) %>%
 summary() %>% coef()
```

---

## Wait, what does that mysterious '.' mean?

```r
  ... %>%
  lm(bmi.wave2 ~ bmi.wave0 + age + sex + dose, data=.) %>%
  ...
```
--
.box[
The '.' refers to the input coming from the the pipe ('%>%') 
to the `lm` function.  We want the incoming data frame/tibble
to be passed to `lm` via the `data` argument.

By default, pipe input is passed as the first argument to the function.
Most tidyverse functions are written with this default in mind.
]
---

## Running statistical tests

```r
dat.wave0 <- visits %>% filter(wave=="wave0") %>% select(participant, bmi)
dat.wave2 <- visits %>% filter(wave=="wave2") %>% select(participant, bmi)
dat.wave0 %>% 
 left_join(dat.wave2, by="participant", suffix=c(".wave0", ".wave2")) %>%
 left_join(participants, by="participant") %>%
 left_join(arms, by="arm") %>%
 lm(bmi.wave2 ~ bmi.wave0 + age + sex + dose, data=.) %>%
 summary() %>% coef()
```
--
```r
 Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.47273674 1.77929008 -1.3897322 1.653930e-01
bmi.wave0 1.01448861 0.01752561 57.8860586 4.790746e-195
age 0.08544156 0.06725427 1.2704259 2.046808e-01
sexM 0.08955528 0.14346982 0.6242099 5.328501e-01
dose 0.57515357 0.04672540 12.3092276 1.072305e-29
```
--

Treatment dose is strongly associated with increasing BMI (p=1.1e-29).

As expected, BMI is strongly associated across waves (p = 4.8e-195).

Beyond that, age and sex have almost no effect (p > 0.2).

---

## Data pivots

The data transformation we just applied is called a 'pivot from long to wide'.

1. Initially, the data is **'long'** with one row per BMI measurement per participant.

```r
> visits %>% subset(c("participant","wave","bmi"))

participant wave    bmi
 1           1 wave0  15.4
 2           1 wave1  16.5
 3           1 wave2  14.9
 ...
```

2. After the pivot, the data is **'shorter'** with one row per participant, but **'wider'** due to having two columns providing BMI measurements.

```r
> dat.wave0 %>% 
  left_join(dat.wave2, by="participant", suffix=c(".wave0", ".wave2"))

participant bmi.wave0 bmi.wave2
 1           1      15.4      14.9
 2           2      19.4      21.8
 3           3      22.0      23.1
```

---

Because pivots are so common, tidyr has created `pivot_wider()` and its inverse `pivot_longer()`.

.ii[
As before ...
```r
## 1. Create a wave 0 subset of 'visits'
> dat.wave0 <- visits %>% 
 filter(wave=="wave0") %>% 
 select(participant, bmi)

## 2. Create a wave 2 subset of 'visits'
> dat.wave2 <- visits %>% 
 filter(wave=="wave2") %>% 
 select(participant, bmi)

## 3. Merge these two 
> dat.wave0 %>% 
  left_join(dat.wave2, 
          by="participant", 
          suffix=c(".wave0", ".wave2")) %>%

## 4. Merge this with 'participants' and 'arms' 
  left_join(participants, by="participant") %>%
  left_join(arms, by="arm") %>%

## 5. Fit the linear model 
  lm(bmi.wave2~bmi.wave0+age+sex+dose,data=.) %>%
  summary() %>% 
  coef()
```

]

.gap[ ]

.ii[
With `pivot_wider` ...
```r
## 1-2. Create a wave 0 and 2 subset of 'visits'
> visits %>% 
  filter(wave %in% c("wave0","wave2")) %>%

## 3. pivot dataset (and rename columns)
  pivot_wider(
    id_cols=participant,     ## individuals
    names_from=wave,         ## new column names
    values_from=bmi) %>%     ## new column values
  rename(bmi.wave0=wave0,bmi.wave2=wave2) %>%

## 4. Merge this with 'participants' and 'arms'
  left_join(participants, by="participant") %>%
  left_join(arms, by="arm") %>%

## 5. Fit the linear model 
  lm(bmi.wave2~bmi.wave0+age+sex+dose,data=.) %>%
  summary() %>%
  coef()
```

]

---

## Visualizing data with `ggplot2`

We've looked at a lot of tables of statistics.

We can visualise these using using `ggplot()`.

We can visualise blink rate, treatment dose and time as follows:

.rightmid[
<img src="boxplot.png" width="50%">
]

1. pipe the data to `ggplot()`

2. tell `ggplot()` to show blink rates (y-axis) by dose (x-axis) and by wave

3. tell `ggplot` that blink rate distributions will be displayed as box plots

4. label the x and y axes appropriately

---

.ii[

```r
## 1. pipe the data to `ggplot()`
visits %>%
  left_join(participants,by="participant") %>%
  left_join(arms,by="arm") %>%

## 2. tell `ggplot()` to show blink rates (y-axis) 
##    by dose (x-axis) and by wave
  ggplot(aes(x=factor(dose),y=blink,fill=wave)) +

## 3. tell `ggplot` that blink rate distributions 
##    will be displayed as box plots  
    geom_boxplot() +

## 4. label the x and y axes appropriately
    xlab("treatment dose") + 
    ylab("blink rate") 
```

(without the comments)
```r
visits %>%
  left_join(participants,by="participant") %>%
  left_join(arms,by="arm") %>%
  ggplot(aes(x=factor(dose),y=blink,fill=wave)) + 
    geom_boxplot() + 
    xlab("treatment dose") + 
    ylab("blink rate") 
```
]

.gap[ ]

.ii[
<img src="boxplot.png" width="100%">
]

---

## Talking to `ggplot`

.ii[
1. `ggplot` likes "long" data format. 
   If your data is in "wide" format, use `pivot_longer()`.

2. Notice how the '+' is like '%>%'.

- We pipe *data* to functions with '%>%'.

- We pipe *instructions* to ggplot with '+'.

And just like pipes, we can refer to the plot with a variable.

```r
p <- visits %>%
 left_join(participants,by="participant") %>%
 left_join(arms,by="arm") %>%
 ggplot(aes(x=factor(dose),y=blink,fill=wave)) + 
 geom_boxplot() + 
 xlab("treatment dose") + 
 ylab("blink rate") 
```
]
.gap[ ]
--
.ii[
Later, we can add additional instructions
about the plot, 
e.g. to change the boxplot colors
```{r}
fill.cols <- c("#999999", "#E69F00", "#56B4E9")
p <- p + scale_fill_manual(values=fill.cols)
p
```

<img src="boxplot-colors.png" width="100%">
]

---

## Challenge!

Change the plot to show HDL cholesterol instead of blink rate.

```r
visits %>%
  left_join(participants,by="participant") %>%
  left_join(arms,by="arm") %>%
  ggplot(aes(x=factor(dose), y=hdl, fill=wave)) + 
   geom_boxplot() + 
   xlab("treatment dose") + 
   ylab("HDL cholesterol") 
```

Sorry! Not very exciting.

.right[
<img src="hdl.png" width="100%">
]

---

## Challenge!

Generate a plot showing HDL cholesterol vs BMI (baseline).

*Hint:* use `geom_point()`

```r
p <- visits %>% 
 filter(wave=="wave0") %>% 
 ggplot(aes(x=bmi, y=hdl)) +
 geom_point()
p
```
(saving the plot so we make a modification)

.right[
<img src="scatterplot.png" width="100%">
]

Now add a regression line to the scatterplot.

*Hint:* add `geom_smooth`

```r
p <- p + geom_smooth(method=lm)
p
```

.right[
<img src="scatterplot-line.png" width="100%">
]

---

## For more information about the tidyverse

https://r4ds.had.co.nz/

---

## Converting between base R and dplyr

If you are familiar with base R, you might find 
it useful to see how dplyr commands
map to base R commands.

https://dplyr.tidyverse.org/articles/base.html

e.g.

|dplyr|base|
|:----|:---|
|arrange(df, x) | df[order(x), , drop = FALSE]|
|distinct(df, x) |df[!duplicated(x), , drop = FALSE], unique()|
|filter(df, x)| df[which(x), , drop = FALSE], subset()|
|mutate(df, z = x + y)| df$z <- df$x + df$y, transform()|
|pull(df, 1)| df[[1]]|
|pull(df, x) |df$x|
|rename(df, y = x)| names(df)[names(df) == "x"] <- "y"|
|...||

---

## If you choose to enter the tidyverse ...

We've seen how tidyverse 
attempts to fix some problems in R.

The tidyverse solution isn't without challenges, e.g.

1. tidyverse needs more memorization (eg. dplyr has > 200 functions!)

2. tidyverse is new and changing (https://dplyr.tidyverse.org/articles/compatibility.html)

3. tidyverse functions can be difficult to debug

4. dplyr functions can be much slower than alternatives, e.g. data.table

5. the 'tidyverse way' can be confusing to new R users
 (e.g. tidyverse does a lot 'behind the scenes magic' that can seem mysterious)

https://github.com/matloff/TidyverseSkeptic