The implementation of logic to facilitate specified computing operations and functionality
What we will cover
Conditional Execution
Defining Function Arguments
Explicit Constraints
Dot-dot-dot (…)
Pipes
Iterations with purr
While loops
Other loops – purrr functions
The map family
Shortcuts
Multiple arguments
walk
Package load and conflicts
We’ll load our packages early as is good practise!
As an aide, you may notice a verbose message when loading the tidyverse mentioning conflicts
If we load the conflicted package it will force an error if you use a function that has multiple sources (filter is commonly guilty of this!)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(conflicted)
Package load and conflicts
# conflicted will make this errormtcars |>filter(gear ==4)
Error:
! [conflicted] filter found in 2 packages.
Either pick the one you want with `::`:
• dplyr::filter
• stats::filter
Or declare a preference with `conflicts_prefer()`:
• `conflicts_prefer(dplyr::filter)`
• `conflicts_prefer(stats::filter)`
# We can explicitly state which filter we prefer just onceconflicts_prefer(dplyr::filter)mtcars[1:3,] |>filter(gear ==4)
There are two types of arguments: Mandatory and Optional
The mandatory arguments are always at the beginning of the list of arguments, followed by optional arguments and their default values.
Example:
pow <-function(x, y =2) {return(x ** y)}
What do you think will happen if you try pow(3) and pow(3,3) in the console? Try it!
Defining Function Arguments
There are two ways of passing the values to a function: by order and by name.
Check the description of mean() by typing ?mean
Passing unnamed arguments by order is bad programming because it means you have to remember the command and each of its arguments (try doing that for 1000 functions!), and it makes your code much less clear!
Passing them by name means you can change the order
Best practice is to define include the first argument (can be without name) of the function as the input data to be processed:
mean(1:101, ,TRUE) # bad!mean(na.rm =TRUE, x =1:101) # bettermean(1:101, na.rm =TRUE) # best :)mean(1:101, ,TRUE) # bad!mean(na.rm =TRUE, x =1:101) # bettermean(1:101, na.rm =TRUE) # best :)mean(1:101, ,TRUE) # bad!mean(na.rm =TRUE, x =1:101) # bettermean(1:101, na.rm =TRUE) # best :)
Defining Function Arguments
Adding restrictions to the function means they can be more efficient
For example:
midValue <-function(x) {if (length(x) %%2==0) {stop("'x' has an even number of elements", call. =FALSE) } midIndex <- (length(x) +1) /2return (x[midIndex])}
The stop function is executed when the modulus (remainder from division) is zero. A good error checking mechanism (even gives a message!)
Defining Function Arguments
The previous code can be simplified by using: stopifnot()
midValue <-function(x) {stopifnot("'x' has an even number of elements"=length(x) %%2==1) midIndex <- (length(x) +1) /2return (x[midIndex])}
And we can add multiple expressions in one go
calMean =function(x) {stopifnot(exprs = {"'x' has to have a mean of 4 for some reason..."=mean(x) ==4"'x' has to have length 4, don't ask why..."=length(x) ==4 })mean(x)}
Dot-Dot-Dot (…)
An ellipsis means that the function can take any number of named or unnamed arguments (run ?print() for an example)
For example: We can use … to pass those additional arguments on to another function. Essentially, placeholders for other arguments.
i01 <-function(y, z) {list(y = y, z = z)}i02 <-function(a, ...) {# Add 'a' to all elements in the listlapply(i01(...), function(x) x + a)}str(i02(a =1, y =2, z =3))
List of 2
$ y: num 3
$ z: num 4
Dot-Dot-Dot (…)
By adding numbers at the end, it is possible to refer to elements of … by position (what position the generic arguments will sit in).
i03 <-function(...) {list(first = ..1, third = ..3)}str(i03(1, 2, 3))
List of 2
$ first: num 1
$ third: num 3
More useful is list(...), which evaluates the arguments and stores them in a list. Very useful when working with data!
i04 <-function(...) {list(...)}str(i04(a =1, b =2))
List of 2
$ a: num 1
$ b: num 2
Pipes |>
There are two types of pipeable functions: transformations and side-effects.
Transformations are where an object is passed to the function’s first argument and a modified object is returned.
With side-effects, the passed object is not transformed. Instead, the function performs a function on that object, such as drawing a plot or saving a file.
print_missings <-function(df) { n <-sum(is.na(df))cat("Missing values: ", n, "\n", sep ="")invisible(df)}
Pipes |>
If we use our newly created print_missings() function, the invisible() command means that the input data frame will not get printed out but we can still use it in a pipe.
We want to keep code efficient and less repetitive: performing the same thing on multiple inputs, repeating the operation on multiple columns, or on different datasets.
To help achieve this, iterations are used. For example:
df <-data.frame(a =rnorm(10), b =rnorm(10), c =rnorm(10), d =rnorm(10))df$a <-rescale(df$a)df$b <-rescale(df$b)df$c <-rescale(df$c)df$d <-rescale(df$d) # Wow this is tedious...
Iterations with purrr
This can be simplified with a for loop
for (i inseq_along(df)) { df[[i]] <-rescale(df[[i]])} # This is much shorter (and less error prone)!
A breif note, it’s generally better to vectorise your R code as it’s faster and often more succinct code, see here and here for more details
If you’re writing a pipeline that’ll be rerun many times by lots of people, then optimisation is more important
Other loops - purr functions
purrr is a package that helps to enhance R’s functional programming toolkit
purrr functions help to break common challenges in list manipulation into independent pieces.
Base R has family of functions known as “apply family”, that eliminates the need for many common for loops, apply(), lapply(), tapply()
purrr has a family of functions called the “map family”.
Each function takes a vector as input, applies a function to each piece, and then returns a new vector that has the same length as the input.
The map family
Essentially, map() is the tidyverse equivalent of the base R apply family of functions.
The basic syntax is map(.x, .f, ...) where:
.x is a list, vector or dataframe
.f is a function
map() will then apply .f to each element of .x in turn.
The map family
We can use the map function to compute the mean and standard deviation of previous dataset.
# I'm using round here so we don't get so many decimal places printedmap_dbl(df, mean) |>round(digits =2)
a b c d
0.54 0.51 0.57 0.46
map_dbl(df, sd) |>round(digits =2)
a b c d
0.31 0.39 0.37 0.31
# And here's the native R sapply equivalentsapply(df, mean) |>round(digits =2)
a b c d
0.54 0.51 0.57 0.46
And if you see the help page for map_dbl() you’ll notice the ... which allows us to pass function arguments:
map_dbl(df, mean, na.rm =TRUE) |>round(digits =2)
a b c d
0.54 0.51 0.57 0.46
The map family
We can even use a string or a position (integer) to extract components from the input data - very useful when working with big datasets!
x <-list(x =list(a =1, b =2, c =3),y =list(a =4, b =5, c =6),z =list(a =7, b =8, c =9))x |>map_dbl("a")
x y z
1 4 7
x |>map_dbl(2)
x y z
2 5 8
map_ functions
One property of the map() function is that it will always return a list.
To change the output data type, we can use multiple versions of map_*():
map_lgl() returns a logical.
map_int() returns a integer vector.
map_dbl() returns a double vector.
map_chr() returns a character vector.
map_df() returns a data frame.
Shortcuts
Fit a linear model to each group in a dataset. This example splits up the mtcars dataset into three pieces and fits the linear model to each piece.
models <- mtcars %>%# note this is an example where the native R pipe won't work!split(.$cyl) |>map(function(df) {lm(mpg ~ wt, data = df) })# Using tidyverse annonymous function syntaxmodels2 <- mtcars %>%split(.$cyl) |>map( ~lm(mpg ~ wt, data = .))# And here is the the native R equivalentmodels3 <- mtcars %>%split(.$cyl) |>map(\(df) lm(mpg ~ wt, data = df))models <- mtcars %>%# note this is an example where the native R pipe won't work!split(.$cyl) |>map(function(df) {lm(mpg ~ wt, data = df) })# Using tidyverse annonymous function syntaxmodels2 <- mtcars %>%split(.$cyl) |>map( ~lm(mpg ~ wt, data = .))# And here is the the native R equivalentmodels3 <- mtcars %>%split(.$cyl) |>map(\(df) lm(mpg ~ wt, data = df))models <- mtcars %>%# note this is an example where the native R pipe won't work!split(.$cyl) |>map(function(df) {lm(mpg ~ wt, data = df) })# Using tidyverse annonymous function syntaxmodels2 <- mtcars %>%split(.$cyl) |>map( ~lm(mpg ~ wt, data = .))# And here is the the native R equivalentmodels3 <- mtcars %>%split(.$cyl) |>map(\(df) lm(mpg ~ wt, data = df))
Shortcuts
The . used in the second example is a placeholder for the dataset we’ve piped in (mtcars) so we can access parts of it (via $)
Note that the placeholder for the native R pipe (|>) is a _, but it doesn’t have the same functionality as the magritter pipe! See here for more info on differences
Example: we want to get the \(R^2\) from our models
List of 3
$ : num [1:5] 2.38 4.39 3.78 5.9 3.91
$ : num [1:5] 11.99 11.67 7.05 12.97 6.73
$ : num [1:5] 17.649 -2.993 8.66 -10.82 0.831
Multiple Arguments
The code can be understood by the following figure:
Multiple Arguments
There is no map3 or map4, so what if we want more than 2 arguments? purrr has the function pmap() for an arbitrary number of args.
n <-list(1, 3, 5)arguments <-list(n, mu, sigma)arguments |>pmap(rnorm) |>str()
List of 3
$ : num 6.36
$ : num [1:3] 13.47 8.43 19.6
$ : num [1:5] 1.38 -16.04 -11.14 -14.45 1.97
We can go even further by increasing the complexity of the problem using the exec() function.
funcs <-c("runif", "rnorm", "rpois")params <-list(list(min =-1, max =1), list(sd =5), list(lambda =10))map2(funcs, params, \(fn, args) exec(fn, !!!args, n =5)) |>str()
List of 3
$ : num [1:5] 0.0428 0.5661 0.66 -0.2062 0.8389
$ : num [1:5] 2.405 -0.353 -6.163 -3.41 -1.423
$ : int [1:5] 10 9 12 6 9
walk
walk() is an alternative to map that we use we call a function for its side effects, disregarding its return value.
x <-list(1, "a", 3)x |>walk(print)
[1] 1
[1] "a"
[1] 3
Really useful when outputting datasets in lists! (such as microarray data)
Similar to map(), purrr also has walk2() and pwalk()
walk2
df0 <-tibble(x =1:3, y =rnorm(3))df1 <-tibble(x =1:3, y =rnorm(3))df2 <-tibble(x =1:3, y =rnorm(3))animalFrames <-tibble(animals =c('sheep', 'cow', 'horse'),frames =list(df0, df1, df2))# Save a list of dataframesanimalFrames %>%walk2(.x = .$animals,.y = .$frames,.f =~write_csv(.y, str_c("test_", .x, ".csv")) )
pwalk
pmap() and pwalk() allow you to provide any number of arguments in a list.
# Let's set-up a dataframe and function to use with pwalkds_mt <- mtcars |>rownames_to_column("model") |>mutate(am =factor(am, labels =c("auto", "manual"))) |>select(model, mpg, wt, cyl, am) |>sample_n(3)foo <-function(model, am, mpg) {print(paste("The", model, "has a", am, "transmission and gets", mpg, "mpgs."))}# Now we can use pwalkds_mt |>select(model, am, mpg) |>pwalk(.l = _, .f = foo)# Let's set-up a dataframe and function to use with pwalkds_mt <- mtcars |>rownames_to_column("model") |>mutate(am =factor(am, labels =c("auto", "manual"))) |>select(model, mpg, wt, cyl, am) |>sample_n(3)foo <-function(model, am, mpg) {print(paste("The", model, "has a", am, "transmission and gets", mpg, "mpgs."))}# Now we can use pwalkds_mt |>select(model, am, mpg) |>pwalk(.l = _, .f = foo)
[1] "The Cadillac Fleetwood has a auto transmission and gets 10.4 mpgs."
[1] "The Camaro Z28 has a auto transmission and gets 13.3 mpgs."
[1] "The Maserati Bora has a manual transmission and gets 15 mpgs."
While loops
You’re unlikely to every use them in a data analysis context (I never have!), but they evaluate their body code until a condition is met
Example: see how many times we need to flip a coin to get three heads in a row:
flip_coin <-function() {sample(c("T", "H"), 1)}numFlips <-0numHeads <-0while (numHeads <3) {if (flip_coin() =="H") { numHeads <- numHeads +1 } else { numHeads <-0 } numFlips <- numFlips +1}cat("Number of flips to get 3 heads in a row: ", numFlips)
Number of flips to get 3 heads in a row: 4
Beware that while loops can crash if the condition being evaluated never becomes false!
Workshop time!
These slides and the workshop can be found on the website here:
9
1
›
This component is an instance of the CodeMirror interactive text editor. The editor has been configured so that the Tab key controls the indentation of code. To move focus away from the editor, press the Escape key, and then press the Tab key directly after it. Escape and then Shift-Tab can also be used to move focus backwards.
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
Programming in R Gabriel Mateus Bernardo Harrington bernardo-harringtong@cardiff.ac.uk Cardiff University UK Dementia Research Institue