count: false class: left, bottom # How to enter the tidyverse of madness ### Matthew Suderman #### Senior Lecturer in Epigenetic Epidemiology <img src="IEU-logo-colour.png" width="50%"> --- layout: true .logo[.mrcieu[ MRC Integrative Epidemiology Unit ]] --- ## What's wrong with R? R is old (circa 1993). It was originally designed for doing **statistics** with relatively small, simple datasets. Since then, it has become one of the most popular tools for the new discipline called **[data science](https://en.wikipedia.org/wiki/Data_science)**. .quote["Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains." https://en.wikipedia.org/wiki/Data_science] --- ## Shortcomings of R for data science 1. Lacks **consistency** -- 2. Often **suprising** -- 3. Tends to be **slow** and **memory-intensive** -- 4. Built around **vectors and matrices** -- 5. Functional language can be **awkward** to use --- ## 1. Lacks **consistency** .ii[ **function naming** ``` names, colnames row.names, rownames rowSums, rowsum rowMeans, (no parallel rowmean exists) browseURL, contrib.url, fixup.package.URLs package.contents, packageStatus getMethod, getS3method read.csv and write.csv, load and save, readRDS and saveRDS Sys.time, system.time ``` [https://r4stats.com/articles/why-r-is-hard-to-learn/] **multiple ways to do things** e.g. two different object-oriented systems: S3 and S4 systems ] .ii[ **missing values** ```r > x <- c(1,2,3,4,NA) > y <- c(1.1,2.2,3.3,4.4,5.5) ## missing values not okay > quantile(x) Error in quantile.default(x) : missing values not allowed if 'na.rm' is FALSE ## missing values okay > fit <- lm(y ~ x) > length(x) [1] 5 > length(residuals(fit)) [1] 4 ``` ] --- ## 2. Often **surprising** R often converts between data types behind the scenes, e.g. characters to factors, matrices to vectors, numeric to characters ```r > x <- matrix(1:12,ncol=3) > x [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 ``` -- ```r > class(x[,1:2]) [1] "matrix" "array" ``` -- ```r > class(x[,3]) ## what is this? ``` -- ```r [1] "integer" ``` --- ```r > x <- matrix(1:12,ncol=3) > x [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 ``` -- ```r > x[1,1] 1 ``` -- ```r > x[1,3] <- "a" ``` -- ```r > x[1,1] + 1 ## what is this equal to? ``` -- ```r Error in x[1, 1] + 1 : non-numeric argument to binary operator ``` -- ```r > class(x[1,1]) [1] "character" ``` -- ```r > x[1,1] [1] "1" ``` --- ## 3. Tends to be **slow** and **memory-intensive** R was designed for small datasets, so it tends to do things - the *simple* way rather than - the *efficient* way. -- .box[ e.g. it is designed to **slurp** data Slurping means loading entire datasets into memory before analysing them. This is not efficient when the analysis really only needs to see a table row-by-row. *But*, this it simpler to just load (slurp) the entire dataset in one simple step. ] --- ## 4. Built around **vectors and matrices** ... whereas data science analyses **data frames** like this: | participant| age|sex |arm | |-----------:|--------:|:---|:----| | 1| 23.58224|F |arm3 | | 2| 24.69102|M |arm4 | | 3| 25.42567|F |arm2 | | 4| 24.96019|F |arm2 | | 5| 23.35406|M |arm4 | Each column has its own data type. -- .box[ Although R implements data frames, most functions are designed for vectors and matrices. ] --- ## 5. Functional language that can be **awkward** to use e.g. even if you know what all the functions do, this expression is difficult to understand! ```r summarize(group_by(left_join(filter(visits, wave=="wave2"), participants, by="participant"),arm,sex),n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink)) ``` --- ## tidyverse .ii[ [Tidyverse](https://www.tidyverse.org/) is a collection of R packages that aim to make the analysis of large datasets more convenient. <img src="tidyverse.png" width="75%"> ] .gap[ ] -- .ii[ Here is intended workflow in the tidyverse: <img src="data-science.png" width="100%"> - Data is **import**ed and made **tidy** so it can be analysed - Analysis involves **transform**ing the data - Relationships in the data are **visualised** - **Models** are applied to test relationships - Finally, conclusions are **communicate**d ] --- ## tidyverse book https://r4ds.had.co.nz/
<img src="tidyverse-book.png" width="100%"> --- ## Using tidyverse packages Installing all tidyverse packages. ```r install.packages("tidyverse") ``` Loading all tidyverse packages. ```r library(tidyverse) ``` --- ## Our dataset .ii[ In this session, we will explore the tidyverse by performing a basic data analysis of a simulated dataset from a randomized control trial. <img src="blinking-guy.jpg" width="90%"> ] .gap[ ] -- .ii[ **Background** (Pretend) studies have shown that blink rate is positively associated with disengaged attention. A drug has been developed that is known to reduce blink rate. **Aim** Determine if the drug will reduce attention deficits by reducing blink rate. **Methods** The were 4 arms: controls (arm 1), low dose (arm 2), medium (arm 3), high (arm 4) Each arm had 100 randomly selected individuals. Data was collected in 3 waves: baseline (wave 0), 6 months (wave 1), 12 months (wave 2) ] --- ## Dataset schema <img src="datamodel.png" width="90%"> --- ## Import the dataset The dataset is available as CSV files. These can be imported into R using tidyverse function `readr::read_csv()`. ```r data.dir <- "https://raw.githubusercontent.com/perishky/perishky.github.io/master/r/enter-the-tidyverse" arms <- read_csv(file.path(data.dir, "arms.csv")) participants <- read_csv(file.path(data.dir, "participants.csv")) visits <- read_csv(file.path(data.dir, "visits.csv")) ``` -- ```r > arms <- read_csv("arms.csv") Rows: 4 Columns: 2 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): arm dbl (1): dose > arms # A tibble: 4 × 2 arm dose
1 arm3 2 2 arm4 4 3 arm2 1 4 arm1 0 ``` --- ## Why use `read_csv` rather than `read.csv`? -- **Speed** `read_csv` is typically about 10x faster. If this isn't fast enough, `data.table::fread` is even faster! -- **Better communication** `read_csv` provides more information about what it has imported into R and clearer error messages if the import failed via the `problems()` function. ```r > arms <- read_csv("arms.csv") Rows: 4 Columns: 2 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): arm dbl (1): dose ``` -- **Tidy format** `read_csv` creates tibbles rather than data frames. (we'll discuss tibbles vs data frames on the next slide) ```r > arms # A tibble: 4 × 2 arm dose
1 arm3 2 2 arm4 4 3 arm2 1 4 arm1 0 ``` --- ## Tidy Tidy data is stored in tables called 'tibbles'. 1. "Each variable must have its own column." 2. "Each observation must have its own row." 3. "Each value must have its own cell." -- > *Why use a tibble rather than a data frame?* (after all, tibbles are data frames) -- > **Predictability** `tibble` never changes the types of inputs >
e.g. character strings are never converted to factors. It never changes column names. It never adds row names. -- > **Convenience** Whereas `tibble` can create data frames column-by-column, there is a `tribble` function to create data frames row-by-row. -- > **Pretty printing** Tibbles are nicer to look at! >
Typing the name of a data frame in R and pressing enter will print the entire data frame to the screen, regardless of its length or width! Tibbles will only show the first few rows and columns along with the dimensions of the data frame and data type of each column. --- ## Creating tibbles Tibbles can be created by column (just like a data frame) ```{r,eval=F} tibble( x=c("a","b"), y=c(3,1), z=c(pi,2*pi) ) ``` -- or by row like they'd be written in a data file ```r tribble( ~x, ~y, ~z, #--|--|---- "a", 3, pi, "b", 1,2*pi ) ``` --- ## Answering questions How many participants are in each study arm? -- .ii[ We'll use the `participants` tibble. ```r > participants # A tibble: 400 × 4 participant age sex arm
1 1 23.6 F arm3 2 2 24.7 M arm4 3 3 25.4 F arm2 4 4 25.0 F arm2 5 5 23.4 M arm4 6 6 25.9 M arm2 7 7 26.5 M arm4 8 8 26.6 F arm1 9 9 25.3 F arm2 10 10 25.3 F arm3 # … with 390 more rows ``` ] .gap[ ] -- .ii[ We'll use the `count()` function to count participants in each arm. ```r > count(participants, arm) # A tibble: 4 × 2 arm n
1 arm1 100 2 arm2 100 3 arm3 100 4 arm4 100 ``` `count()` is very similar to the `table()` function in base R. ] --- ## Challenge! See if you can count the number of female participants in each arm. -- ```r > count(participants, arm, sex) # A tibble: 8 × 3 arm sex n
1 arm1 F 48 2 arm1 M 52 3 arm2 F 43 4 arm2 M 57 5 arm3 F 50 6 arm3 M 50 7 arm4 F 46 8 arm4 M 54 ``` --- ## Filtering That last output gives more information than we wanted. To focus on just females, we'll need to filter out information about males. To do this, we use the `filter()` function. ```r > counts <- count(participants, arm, sex) > filter(counts, sex == "F") # A tibble: 4 × 3 arm sex n
1 arm1 F 48 2 arm2 F 43 3 arm3 F 50 4 arm4 F 46 ``` --- ## Challenge! See if you can obtain the same result by 1. first applying the `filter` function to `participants` to remove males 2. and then applying the `count` function to the result. -- ```r > females <- filter(participants, sex=="F") > count(females, arm, sex) # A tibble: 4 × 3 arm sex n
1 arm1 F 48 2 arm2 F 43 3 arm3 F 50 4 arm4 F 46 ``` --- ## Intermediate variables For longer queries, it can get a little tedious creating variable names for each step. In the queries above, we had to create `counts` and `females` even though we aren't directly interested in those variables, e.g. ```r counts <- count(participants, arm, sex) filter(counts, sex == "F") ``` -- One solution is called **function composition**. ```r filter(count(participants, arm, sex), sex=="F") ``` -- ## Challenge! See if you can obtain the same result with `count` as the outer function, i.e. `count(filter(...), ...)` -- ```r count(filter(participants, sex=="F"), arm, sex) ``` --- ## Function composition gets extreme Although function composition avoids having to name intermediate variables, they can make the code difficult to read. For example, the following command summarizes information about each study arm (we'll discuss the functions used here later). ```r summarize(group_by(left_join(filter(visits, wave=="wave2"), participants, by="participant"),arm,sex),n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink)) ``` -- To be fair, we can split it across lines to make it a little easier to read. ```r summarize( group_by( left_join( filter(visits, wave=="wave2"), participants, by="participant"), arm, sex), n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink)) ``` --- ## Pipes The "pipe" operator (`%>%`) was invented to solve this problem. This command ... ```r filter(count(participants, arm, sex), sex=="F") ``` -- ... can be written like this ```r participants %>% filter(sex=="F") %>% count(arm, sex) ``` -- ... or even better, split across multiple lines ```r participants %>% filter(sex=="F") %>% count(arm, sex) ``` i.e. - we 'pipe' `participants` to `filter`, - and then the output of `filter` to `count` --- ## Challenge! See if you can rewrite the following using pipes: ```r filter(count(participants,arm,sex), sex=="F") ``` -- ```r participants %>% count(arm,sex) %>% filter(sex=="F") ``` --- ## Challenge! Now see if you can rewrite this command using pipes: ```r summarize( group_by( left_join( filter(visits, wave=="wave2"), participants, by="participant"), arm, sex), n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink)) ``` -- ```r visits %>% filter(wave=="wave2") %>% left_join(participants, by="participant") %>% group_by(arm,sex) %>% summarize(n=n(),age=mean(age), bmi=mean(bmi), blink=mean(blink)) ``` --- ## Calculating data summaries The `summarise(
,
,
,...)` summarizes the columns of a tibble. For example, we can use it to calculate 1. the average participant age, and 2. how many of them are females. -- ```r > summarise(participants, age=median(age), nfemales=sum(sex=="F")) # A tibble: 1 × 2 age nfemales
1 25.0 187 ``` -- Here is what it looks like using pipes. ```r participants %>% summarise(age=median(age), nfemales=sum(sex=="F")) ``` --- ## Calculating summaries of subsets The previous example above provided summaries across *all* rows. It is also possible to summarize subsets of rows using `group_by(
,
,
,...)`. For example, we can calculate, *for each arm of the study* 1. the average participant age, and 2. the number of males and females. -- ```r > participants %>% group_by(arm,sex) %>% summarise(n=n(), age=mean(age)) # A tibble: 8 × 4 # Groups: arm [4] arm sex n age
1 arm1 F 48 25.1 2 arm1 M 52 25.1 3 arm2 F 43 25.1 4 arm2 M 57 24.8 5 arm3 F 50 25.1 6 arm3 M 50 24.8 7 arm4 F 46 24.9 8 arm4 M 54 25.1 ``` -- .popup[.box[ Did you notice some **magic**? 1. `summarise()` magically knows when a tibble has come from `group_by()` and that it should summarize by the resulting groups. 2. `n()` function magically knows about the tibble subsets being processed by `summarise`. ]] --- ## Challenge! Can you think of an alternative to the mysterious `n()` function? (*this is a tricky question*) ```r participants %>% group_by(arm,sex) %>% summarise(n=n(), age=mean(age)) ``` -- *Hint:* Use the `length` function ... -- ```r participants %>% group_by(arm,sex) %>% summarise(n=length(age), age=mean(age)) ``` --- ## Merging datasets We'd like to augment the summary with measurements from the first visit. This information is in the `visits` dataset. To do this, we'll first need to: 1. filter `visits` to get just the first visit (there were 3 visits) 2. merge the result with `participants` 3. group the result by arm and by sex 4. calculate the mean of age, BMI and blink rate -- We've seen how to do each of these steps except step 2, merging. ```r visits %>% filter(wave=="wave0") %>% ????????? merge with participants ????????? %>% group_by(arm,sex) %>% summarise(n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink)) ``` -- In base R, we'd use the function `merge()`. For tibbles, we have four different functions: `left_join()`, `right_join()`, `inner_join()` and `full_join()`. --- ## `left_join()` .ii[ To illustrate, we'll use a small version of `visits`. ```r > visits.s <- visits %>% filter(participant %in% 1:3) > visits.s # A tibble: 9 × 8 participant wave month blink bmi hdl trig ldl dbl>
1 1 wave0 0 17.6 15.4 1.38 1.52 2.11 2 1 wave1 6.18 16.0 16.5 1.35 1.33 1.88 3 1 wave2 10.4 15.6 14.9 1.39 1.39 2.01 4 2 wave0 0 16.3 19.4 1.64 1.70 4.06 5 2 wave1 5.50 14.5 20.6 1.61 1.31 3.90 6 2 wave2 13.0 11.2 21.8 1.57 1.62 4.04 7 3 wave0 0 18.6 22.0 1.61 1.12 2.11 8 3 wave1 6.58 17.8 20.8 1.75 1.14 1.82 9 3 wave2 13.5 17.3 23.1 1.53 1.11 2.21 ``` ] .gap[ ] -- .ii[ Recall `participants`. ```r > participants # A tibble: 400 × 4 participant age sex arm
1 1 23.6 F arm3 2 2 24.7 M arm4 3 3 25.4 F arm2 4 4 25.0 F arm2 5 5 23.4 M arm4 6 6 25.9 M arm2 7 7 26.5 M arm4 8 8 26.6 F arm1 9 9 25.3 F arm2 10 10 25.3 F arm3 # … with 390 more rows ``` ] --- ## `left_join()` `left_join()` adds data from `participants` that match the three participants in `visits.s`. ```r > left_join(visits.s, participants, by="participant") Joining, by = "participant" # A tibble: 9 × 11 participant wave month blink bmi hdl trig ldl age sex arm
1 1 wave0 0 17.6 15.4 1.38 1.52 2.11 23.6 F arm3 2 1 wave1 6.18 16.0 16.5 1.35 1.33 1.88 23.6 F arm3 3 1 wave2 10.4 15.6 14.9 1.39 1.39 2.01 23.6 F arm3 4 2 wave0 0 16.3 19.4 1.64 1.70 4.06 24.7 M arm4 5 2 wave1 5.50 14.5 20.6 1.61 1.31 3.90 24.7 M arm4 6 2 wave2 13.0 11.2 21.8 1.57 1.62 4.04 24.7 M arm4 7 3 wave0 0 18.6 22.0 1.61 1.12 2.11 25.4 F arm2 8 3 wave1 6.58 17.8 20.8 1.75 1.14 1.82 25.4 F arm2 9 3 wave2 13.5 17.3 23.1 1.53 1.11 2.21 25.4 F arm2 ``` -- Notice how the result is just `visits.s` but with some additional columns from `participants`. Rows in `participants` that do not match are omitted. --- ## Challenge! What do you think happens if some participants in `visits` do not match a participant in `participants`? ```r > parts.s <- participants %>% filter(participant %in% 1:2) > parts.s # A tibble: 2 × 4 participant age sex arm
1 1 23.6 F arm3 2 2 24.7 M arm4 ``` -- ```r > left_join(visits.s, parts.s, by="participant") # A tibble: 9 × 11 participant wave month blink bmi hdl trig ldl age sex arm
1 1 wave0 0 17.6 15.4 1.38 1.52 2.11 23.6 F arm3 2 1 wave1 6.18 16.0 16.5 1.35 1.33 1.88 23.6 F arm3 3 1 wave2 10.4 15.6 14.9 1.39 1.39 2.01 23.6 F arm3 4 2 wave0 0 16.3 19.4 1.64 1.70 4.06 24.7 M arm4 5 2 wave1 5.50 14.5 20.6 1.61 1.31 3.90 24.7 M arm4 6 2 wave2 13.0 11.2 21.8 1.57 1.62 4.04 24.7 M arm4 7 3 wave0 0 18.6 22.0 1.61 1.12 2.11 NA NA NA 8 3 wave1 6.58 17.8 20.8 1.75 1.14 1.82 NA NA NA 9 3 wave2 13.5 17.3 23.1 1.53 1.11 2.21 NA NA NA ``` --- ## Putting it altogether Recall that we want to summarize information from all participants, grouped by age and sex. ```r visits %>% filter(wave=="wave0") %>% left_join(participants,by="participant") %>% ## merging step group_by(arm,sex) %>% summarise(n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink)) ``` -- ```r # A tibble: 8 × 6 # Groups: arm [4] arm sex n age bmi blink
1 arm1 F 48 25.1 21.4 17.7 2 arm1 M 52 25.1 23.9 17.5 3 arm2 F 43 25.1 21.3 17.6 4 arm2 M 57 24.8 23.4 17.5 5 arm3 F 50 25.1 21.0 17.7 6 arm3 M 50 24.8 23.8 17.6 7 arm4 F 46 24.9 21.4 17.2 8 arm4 M 54 25.1 22.9 17.6 ``` --- ## Challenge! The previous output wasn't interesting because we're looking at the data before any treatment. Modify the command to see the same data from the final visit. -- ```r visits %>% filter(wave=="wave2") %>% left_join(participants,by="participant") %>% group_by(arm,sex) %>% summarise(n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink)) ``` -- ```r # A tibble: 8 × 6 # Groups: arm [4] arm sex n age bmi blink
1 arm1 F 48 25.1 21.6 17.7 2 arm1 M 52 25.1 24.0 17.5 3 arm2 F 43 25.1 21.6 16.5 4 arm2 M 57 24.8 24.0 16.3 5 arm3 F 50 25.1 22.1 15.3 6 arm3 M 50 24.8 25.1 15.2 7 arm4 F 46 24.9 23.7 12.6 8 arm4 M 54 25.1 25.2 13.0 ``` Can you see differences between the study arms now? --- ## Merging multiple tables To remember the treatment within each study arm, we'll add 'dose' to the output found in the `arms` tibble. -- ```r query <- visits %>% filter(wave=="wave2") %>% left_join(participants,by="participant") %>% left_join(arms,by="arm") %>% group_by(arm,sex) %>% summarise(dose=dose[1],n=n(), age=mean(age), bmi=mean(bmi), blink=mean(blink)) ``` -- ```r > query # A tibble: 8 × 7 # Groups: arm [4] arm sex dose n age bmi blink
1 arm1 F 0 48 25.1 21.6 17.7 2 arm1 M 0 52 25.1 24.0 17.5 3 arm2 F 1 43 25.1 21.6 16.5 4 arm2 M 1 57 24.8 24.0 16.3 5 arm3 F 2 50 25.1 22.1 15.3 6 arm3 M 2 50 24.8 25.1 15.2 7 arm4 F 4 46 24.9 23.7 12.6 8 arm4 M 4 54 25.1 25.2 13.0 ``` --- ## Selecting columns Having both 'arm' and 'dose' columns is a bit redundant, we just want to show 'dose'. For this, we can use the `select()` function. -- ```r > query %>% ungroup() %>% select(dose,sex,n,age,bmi,blink) # A tibble: 8 × 6 dose sex n age bmi blink
1 0 F 48 25.1 21.6 17.7 2 0 M 52 25.1 24.0 17.5 3 1 F 43 25.1 21.6 16.5 4 1 M 57 24.8 24.0 16.3 5 2 F 50 25.1 22.1 15.3 6 2 M 50 24.8 25.1 15.2 7 4 F 46 24.9 23.7 12.6 8 4 M 54 25.1 25.2 13.0 ``` The `ungroup` command is necessary to turn 'off' grouping. --- ## Omitting columns It's a bit tedious to type all the column names when we just want to remove one. There is a way to do this. -- ```r query <- query %>% ungroup() %>% select(-arm) ``` -- ```r > query # A tibble: 8 × 6 sex dose n age bmi blink
1 F 0 48 25.1 21.6 17.7 2 M 0 52 25.1 24.0 17.5 3 F 1 43 25.1 21.6 16.5 4 M 1 57 24.8 24.0 16.3 5 F 2 50 25.1 22.1 15.3 6 M 2 50 24.8 25.1 15.2 7 F 4 46 24.9 23.7 12.6 8 M 4 54 25.1 25.2 13.0 ``` --- ## Reordering rows With male rows next to female rows, it is difficult to see if there were any treatment effects on BMI. -- We'll reorder/arrange the rows by sex and then by dose. .ii[ Before reordering ... ```r > query # A tibble: 8 × 6 sex dose n age bmi blink
1 F 0 48 25.1 21.6 17.7 2 M 0 52 25.1 24.0 17.5 3 F 1 43 25.1 21.6 16.5 4 M 1 57 24.8 24.0 16.3 5 F 2 50 25.1 22.1 15.3 6 M 2 50 24.8 25.1 15.2 7 F 4 46 24.9 23.7 12.6 8 M 4 54 25.1 25.2 13.0 ``` ] .gap[ ] -- .ii[ After reordering ... ```r > query %>% arrange(sex,dose) # A tibble: 8 × 6 sex dose n age bmi blink
1 F 0 48 25.1 21.6 17.7 2 F 1 43 25.1 21.6 16.5 3 F 2 50 25.1 22.1 15.3 4 F 4 46 24.9 23.7 12.6 5 M 0 52 25.1 24.0 17.5 6 M 1 57 24.8 24.0 16.3 7 M 2 50 24.8 25.1 15.2 8 M 4 54 25.1 25.2 13.0 ``` Now we can more easily see that BMI appears to be increasing with dose. ] --- ## Running statistical tests We'd like to determine statistically if treatment actually increases BMI. We can do this by fitting the following linear model: .box[BMI
wave 2
~ BMI
wave 0
+ age + sex + dose] -- For this, we'll need to prepare a tibble like this: |bmi.wave0|bmi.wave2|age|sex|dose| |---:|-----:|:----|---:|----:| | 5.4| 14.9| 23.5|F | 2| |19.3| 21.8| 24.7|M | 4| |22.0| 23.1| 25.4|F | 1| |20.6| 19.7| 24.9|F | 1| |26.1| 28.6| 23.4|M | 4| | ...| | | | | --- ## Running statistical tests Here are the steps: -- 1. Create a wave 0 subset of 'visits' -- 2. Create a wave 2 subset of 'visits' -- 3. Merge these two so we have a data frame with bmi at wave 0 and wave 2 -- 4. Merge this with 'participants' and 'arms' to add participant age, sex and dose -- 5. Fit the linear model and extract the coefficients and p-values --- ```r ## 1. Create a wave 0 subset of 'visits' ``` -- ```r dat.wave0 <- visits %>% filter(wave=="wave0") %>% select(participant, bmi) ``` -- ```r ## 2. Create a wave 2 subset of 'visits' ``` -- ```r dat.wave2 <- visits %>% filter(wave=="wave2") %>% select(participant, bmi) ``` -- ```r ## 3. Merge these two so we have a data frame with bmi at wave 0 and wave 2 ``` -- ```r dat.wave0 %>% left_join(dat.wave2, by="participant", suffix=c(".wave0", ".wave2")) %>% ``` -- ```r ## 4. Merge this with 'participants' and 'arms' to add participant age, sex and dose ``` -- ```r left_join(participants, by="participant") %>% left_join(arms, by="arm") %>% ``` -- ```r ## 5. Fit the linear model and extract the coefficients and p-values ``` -- ```r lm(bmi.wave2 ~ bmi.wave0 + age + sex + dose, data=.) %>% summary() %>% coef() ``` --- ## Wait, what does that mysterious '.' mean? ```r ... %>% lm(bmi.wave2 ~ bmi.wave0 + age + sex + dose, data=.) %>% ... ``` -- .box[ The '.' refers to the input coming from the the pipe ('%>%') to the `lm` function. We want the incoming data frame/tibble to be passed to `lm` via the `data` argument. By default, pipe input is passed as the first argument to the function. Most tidyverse functions are written with this default in mind. ] --- ## Running statistical tests ```r dat.wave0 <- visits %>% filter(wave=="wave0") %>% select(participant, bmi) dat.wave2 <- visits %>% filter(wave=="wave2") %>% select(participant, bmi) dat.wave0 %>% left_join(dat.wave2, by="participant", suffix=c(".wave0", ".wave2")) %>% left_join(participants, by="participant") %>% left_join(arms, by="arm") %>% lm(bmi.wave2 ~ bmi.wave0 + age + sex + dose, data=.) %>% summary() %>% coef() ``` -- ```r Estimate Std. Error t value Pr(>|t|) (Intercept) -2.47273674 1.77929008 -1.3897322 1.653930e-01 bmi.wave0 1.01448861 0.01752561 57.8860586 4.790746e-195 age 0.08544156 0.06725427 1.2704259 2.046808e-01 sexM 0.08955528 0.14346982 0.6242099 5.328501e-01 dose 0.57515357 0.04672540 12.3092276 1.072305e-29 ``` -- Treatment dose is strongly associated with increasing BMI (p=1.1e-29). As expected, BMI is strongly associated across waves (p = 4.8e-195). Beyond that, age and sex have almost no effect (p > 0.2). --- ## Data pivots The data transformation we just applied is called a 'pivot from long to wide'. -- 1. Initially, the data is **'long'** with one row per BMI measurement per participant. ```r > visits %>% subset(c("participant","wave","bmi")) participant wave bmi 1 1 wave0 15.4 2 1 wave1 16.5 3 1 wave2 14.9 ... ``` -- 2. After the pivot, the data is **'shorter'** with one row per participant, but **'wider'** due to having two columns providing BMI measurements. ```r > dat.wave0 %>% left_join(dat.wave2, by="participant", suffix=c(".wave0", ".wave2")) participant bmi.wave0 bmi.wave2 1 1 15.4 14.9 2 2 19.4 21.8 3 3 22.0 23.1 ``` --- Because pivots are so common, tidyr has created `pivot_wider()` and its inverse `pivot_longer()`. -- .ii[ As before ... ```r ## 1. Create a wave 0 subset of 'visits' > dat.wave0 <- visits %>% filter(wave=="wave0") %>% select(participant, bmi) ## 2. Create a wave 2 subset of 'visits' > dat.wave2 <- visits %>% filter(wave=="wave2") %>% select(participant, bmi) ## 3. Merge these two > dat.wave0 %>% left_join(dat.wave2, by="participant", suffix=c(".wave0", ".wave2")) %>% ## 4. Merge this with 'participants' and 'arms' left_join(participants, by="participant") %>% left_join(arms, by="arm") %>% ## 5. Fit the linear model lm(bmi.wave2~bmi.wave0+age+sex+dose,data=.) %>% summary() %>% coef() ``` ] .gap[ ] .ii[ With `pivot_wider` ... ```r ## 1-2. Create a wave 0 and 2 subset of 'visits' > visits %>% filter(wave %in% c("wave0","wave2")) %>% ## 3. pivot dataset (and rename columns) pivot_wider( id_cols=participant, ## individuals names_from=wave, ## new column names values_from=bmi) %>% ## new column values rename(bmi.wave0=wave0,bmi.wave2=wave2) %>% ## 4. Merge this with 'participants' and 'arms' left_join(participants, by="participant") %>% left_join(arms, by="arm") %>% ## 5. Fit the linear model lm(bmi.wave2~bmi.wave0+age+sex+dose,data=.) %>% summary() %>% coef() ``` ] --- ## Visualizing data with `ggplot2` We've looked at a lot of tables of statistics. We can visualise these using using `ggplot()`. -- We can visualise blink rate, treatment dose
and time as follows: .rightmid[ <img src="boxplot.png" width="50%"> ] -- 1. pipe the data to `ggplot()` -- 2. tell `ggplot()` to show blink rates (y-axis)
by dose (x-axis) and by wave -- 3. tell `ggplot` that blink rate distributions will be
displayed as box plots -- 4. label the x and y axes appropriately --- .ii[ ```r ## 1. pipe the data to `ggplot()` visits %>% left_join(participants,by="participant") %>% left_join(arms,by="arm") %>% ## 2. tell `ggplot()` to show blink rates (y-axis) ## by dose (x-axis) and by wave ggplot(aes(x=factor(dose),y=blink,fill=wave)) + ## 3. tell `ggplot` that blink rate distributions ## will be displayed as box plots geom_boxplot() + ## 4. label the x and y axes appropriately xlab("treatment dose") + ylab("blink rate") ``` (without the comments) ```r visits %>% left_join(participants,by="participant") %>% left_join(arms,by="arm") %>% ggplot(aes(x=factor(dose),y=blink,fill=wave)) + geom_boxplot() + xlab("treatment dose") + ylab("blink rate") ``` ] .gap[ ] -- .ii[ <img src="boxplot.png" width="100%"> ] --- ## Talking to `ggplot` .ii[ 1. `ggplot` likes "long" data format. If your data is in "wide" format, use `pivot_longer()`. 2. Notice how the '+' is like '%>%'. - We pipe *data* to functions with '%>%'. - We pipe *instructions* to ggplot with '+'. And just like pipes, we can refer to the plot with a variable. ```r p <- visits %>% left_join(participants,by="participant") %>% left_join(arms,by="arm") %>% ggplot(aes(x=factor(dose),y=blink,fill=wave)) + geom_boxplot() + xlab("treatment dose") + ylab("blink rate") ``` ] .gap[ ] -- .ii[ Later, we can add additional instructions about the plot, e.g. to change the boxplot colors ```{r} fill.cols <- c("#999999", "#E69F00", "#56B4E9") p <- p + scale_fill_manual(values=fill.cols) p ``` <img src="boxplot-colors.png" width="100%"> ] --- ## Challenge! Change the plot to show HDL cholesterol instead of
blink rate. -- ```r visits %>% left_join(participants,by="participant") %>% left_join(arms,by="arm") %>% ggplot(aes(x=factor(dose), y=hdl, fill=wave)) + geom_boxplot() + xlab("treatment dose") + ylab("HDL cholesterol") ``` Sorry! Not very exciting. .right[ <img src="hdl.png" width="100%"> ] --- ## Challenge! Generate a plot showing HDL cholesterol vs BMI (baseline). *Hint:* use `geom_point()` -- ```r p <- visits %>% filter(wave=="wave0") %>% ggplot(aes(x=bmi, y=hdl)) + geom_point() p ``` (saving the plot so we make a modification) .right[ <img src="scatterplot.png" width="100%"> ] -- Now add a regression line to the scatterplot. *Hint:* add `geom_smooth` -- ```r p <- p + geom_smooth(method=lm) p ``` .right[ <img src="scatterplot-line.png" width="100%"> ] --- ## For more information about the tidyverse https://r4ds.had.co.nz/
<img src="tidyverse-book.png" width="100%"> --- ## Converting between base R and dplyr If you are familiar with base R, you might find it useful to see how dplyr commands map to base R commands. https://dplyr.tidyverse.org/articles/base.html e.g. |dplyr|base| |:----|:---| |arrange(df, x) | df[order(x), , drop = FALSE]| |distinct(df, x) |df[!duplicated(x), , drop = FALSE], unique()| |filter(df, x)| df[which(x), , drop = FALSE], subset()| |mutate(df, z = x + y)| df$z <- df$x + df$y, transform()| |pull(df, 1)| df[[1]]| |pull(df, x) |df$x| |rename(df, y = x)| names(df)[names(df) == "x"] <- "y"| |...|| --- ## If you choose to enter the tidyverse ... We've seen how tidyverse attempts to fix some problems in R. -- The tidyverse solution isn't without challenges, e.g. -- 1. tidyverse needs more memorization (eg. dplyr has > 200 functions!) -- 2. tidyverse is new and changing (https://dplyr.tidyverse.org/articles/compatibility.html) -- 3. tidyverse functions can be difficult to debug -- 4. dplyr functions can be much slower than alternatives, e.g. data.table -- 5. the 'tidyverse way' can be confusing to new R users
(e.g. tidyverse does a lot 'behind the scenes magic' that can seem mysterious) https://github.com/matloff/TidyverseSkeptic