Explore R

Slides inherited from Dr. Georgina Menzies, Dr. Nicos Angelopoulos & Matthew Bracher-Smith

Gabriel Mateus Bernardo Harrington

bernardo-harringtong@cardiff.ac.uk

Cardiff University

UK Dementia Research Institue

Todays Aims

Basics of Quarto
Using Quarto to create pdf’s
The R package tibble
The R package readr

Learning Objectives

Get familiar and comfortable with Quarto and creating pdf’s
Understand the differences between Rmarkdown and Quarto
Understand how to use tibble to make tables of data
Understand how to use readr to read in different formats of data

Homework

For the first homework you may have just used R scripts
For future homework you’ll learn how to make pdf’s or html files from R code, which look like the pdf file for the homework

Homework answers

Question 1

sqrt(6 * 2)

[1] 3.464102

4 + 3 - 2

[1] 5

1046 * 934

[1] 976964

Question 2

library(MASS)
nrow(women)

[1] 15

colnames(women)

[1] "height" "weight"

Homework answers - continued

women$ages <- sample(18:90, nrow(women), replace = TRUE)
head(women)

  height weight ages
1     58    115   22
2     59    117   60
3     60    120   47
4     61    123   25
5     62    126   90
6     63    129   32

sum(women$height)

[1] 975

sum(women$weight)

[1] 2051

sum(women$ages)

[1] 745

# A more elegant way to apply a function to all columns!
sapply(women, mean)

   height    weight      ages 
 65.00000 136.73333  49.66667

Homework answers - continued

Question 3

cohort <- read.table(
  "http://tbb.bio.uu.nl/BDA/fig4.tsv",
  sep = "\t",
  header = TRUE,
  stringsAsFactors = FALSE
)

females <- cohort[cohort$Gender == "F", ]
head(females)

       Name First_Name  Last_Name Age Weight Gender Married
1 Patient01    Adriana     Mattos  35   64.5      F    TRUE
5 Patient05      Janet Thomlinson  31   87.5      F   FALSE
6 Patient06 Frederique        Vos  73   69.4      F    TRUE

write.csv(x = females, file = "fig4-females.csv")

Homework answers - continued

Question 4

a <- c(1, 2, 5.3, 6, -2, 4) # numeric vector
b <- c("one", "two", "three") # character vector
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE) #logical vector
a

[1]  1.0  2.0  5.3  6.0 -2.0  4.0

[1] "one"   "two"   "three"

[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

c[c(1, 3)]

[1] TRUE TRUE

Homework answers - continued

# generates 4 x 4 numeric matrix
y <- matrix(1:20, nrow = 4, ncol = 4)
rnames <- c("R1", "R2", "R3", "R4")
cnames <- c("C1", "C2", "C3", "C4")
mymatrix <- matrix(
  y,
  nrow = 4,
  ncol = 4,
  byrow = TRUE,
  dimnames = list(rnames, cnames)
)
mymatrix

   C1 C2 C3 C4
R1  1  2  3  4
R2  5  6  7  8
R3  9 10 11 12
R4 13 14 15 16

“A dimnames attribute for the matrix: NULL or a list of length 2 giving the row and column names respectively. An empty list is treated as NULL, and a list of length one as row names. The list can be named, and the list names will be used as names for the dimensions.”

Homework answers - continued

d <- c(1:4)
e <- c("red", "white", "red", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
mydata <- data.frame(d, e, f)
names(mydata) <- c("id", "color", "passed")
mydata

  id color passed
1  1   red   TRUE
2  2 white   TRUE
3  3   red   TRUE
4  4  <NA>  FALSE

Quarto

Today we’re going to talk about Quarto, which we can use to make pdf and html files (and much more!) that integrate plain language, code and output (images etc.). This is excellent because it’s a dynamic document (and so more reproducible), and hugely flexible - various languages and engines, and outputs are supported.

But before Quarto, there was Rmarkdown…

Some background on Rmarkdown

Today we will be working on a .qmd (Quarto) file, but first, we need to talk about .rmd files.

This (an .rmd file) is an Rmarkdown file. You should remember a little about markdown from your Unix lectures. You will be completing your homework in this file, so let’s have a look at it now.

RMarkdown is a way to make documents which include R code. You can use this to write html documents, pdf’s and PowerPoints to show answers to coding problems, or the code themselves.

Can you think of any reasons for using this?

Some background on Quarto

Announced in 2022 and becoming more widely adopted only last year, based on pandoc
Actually a separate software that we run within RStudio
The successor to Rmarkdown in many ways (made by the same devs), using .qmd files, made to support more languages (Python, Julia, etc.) and be more consistent in formatting
Can be used to make documents in Rstudio, or Jupyter notebooks or elsewhere (i.e. it’s both multi-language and multi-engine)

Some background on Quarto

A note on Quarto

Fundamental usage for reports is the same as creating in Rmarkdown
It’s also what’s used to make many of the R books you may read online (including R for Data Science 2nd edition!)
Recommended that you use Quarto by default, because:
- You will be easily be able to swap-in other languages besides R
- You will more easily be able to use other Quarto features like those for making blogs and journal articles if you’re already familiar
- It’s highly compatible - you can render most Rmarkdown or jupyter notebooks in Quarto easily, so you can still make normal Rmarkdown documents with your knowledge if needed.
- You’re up to date with the latest in reproducible research
- Newest features will likely be added to Quarto over Rmarkdown

Making your first .qmd file

Use the menu bar to create a new Quarto document, we will focus on html today
Notice that the file contains three types of content:
- A YAML header surrounded by - - -
- R code chunks surrounded by ```
- Text mixed with simple markdown formatting

What do each of the bits of your new file do?

Let’s work through these now and create a knitted document!
Create your own with code/data you have, or download an example file from quarto from here

Elements of Quarto | The YAML header

Quarto documents start with a YAML header that sets metadata and configurations for the document.
YAML is a data serialisation language which aims to be useful for computers (often config files) and humans (it’s relatively easy to read)
Typical YAML options:
- title: Title of the document
- author: Author’s name
- date: Document date
- format: Specifies the output format (html, pdf, etc.)
- theme: Defines a theme for visual style

There are a lot of options we won’t cover today and they differ by document type!

Elements of Quarto | Code chunks

When you open the file in the RStudio IDE, it becomes a notebook interface for R. You can run each code chunk by clicking the green arrow icon. RStudio executes the code and display the results in line with your file.

There are three ways to insert R code into the file:

The keyboard shortcut Ctrl + Alt + I (OS X: Cmd + Option + I)
The +option in the menu bar
Or by typing the chunk delimiters ``` {r} ```

Elements of Quarto | Code chunks

In Quarto, chunk output can be customized with options prefixed by #| inside the chunk header. Here are some common options:

#| include: false
- Excludes code and results from the rendered document but runs the code.
#| echo: false
- Excludes code but displays results. Useful for embedding figures.

#| message: false
- Suppresses messages generated by code.
#| warning: false
- Suppresses warnings generated by code.
#| fig-cap: “This is a caption”
- Adds a caption to figures.

In html documents, adding code-tools and using code-folding are usually preferred over include/echo

Elements of Quarto | Inline code

Code results can be inserted directly into the text of a .qmd file by enclosing the code using single ` like so:

The mean of the data was {r} mean(mtcars$gear)
- Which renders as:
- The mean of the data was 3.6875

Elements of Quarto | Multiple languages

Quarto allows the use of multiple languages in code chunks. Specify the language using {} after the chunk delimiter, like python, julia, or r.

Elements of Quarto | Text Options

Do you remember your markdown formatting?
The main body is just normal markdown, so all the same stuff works
Examples on the website here

Creating documents | Previewing and Rendering

In RStudio, you can preview and render Quarto documents easily
- Render by clicking the ‘Render’ button
- Preview by toggling the visual button

Quarto provides a command-line interface as well, which allows rendering with: quarto render myfile.qmd
You can specify output formats in the YAML header or when rendering

Creating documents | Quarto output formats

Quarto supports various output formats:
- html: Web format with dynamic features
- pdf: Portable document format for printing
- docx: Microsoft Word format
- revealjs: HTML presentations (for slides, this is what these slide are made with!)
- beamer: PDF presentations (for slides)
Specify these in the YAML or use them with the quarto render command, more details here

Creating documents | Practice

Here is a version of the document with other elements added
See if you can make an ordered list with sub items that contain at least one example of bold, italic, superscript and strikethrough text, as well as a 2x2 table with headers

Now make an .rmd file

Compare it to a .rmd document
What do you notice that’s different?

Quarto (.qmd) and Rmarkdown (.rmd): Differences you need to know

It’s pretty much mainly the syntax of the YAML options (html VS html_document), code chunk option formats (although Quarto is compatible with the rmd format), and the language support
Rmarkdown is wedded to R, even if you use other languages like Python, it’s still actually being run through R (via the reticulate package in the case of Python)
Quarto is language and engine agnostic, and thus more more versatile

Final Quarto links

The Quarto site, is very good: https://quarto.org/
Cheatsheet here

For PDFs:

You’ll need latex installed to create pdfs
- See options here (TinyTex is probably best bet)
- The pdf YAML reference is here
You can output to multiple formats and use the quarto package directly
- See tutorial on authoring here

Break

Tibbles!

“Tibbles” are a new modern data frame. It keeps many important features of the original data frame. It removes many of the outdated features. They are another amazing feature added to R by Hadley Wickham. We will use them in the tidyverse to replace the older dataframe that we just learned about.

Tibbles! - continued

Compared to Data Frames:

A tibble never changes the input type.
A tibble can have columns that are lists.
A tibble can have non-standard variable names.
- can start with a number or contain spaces.
- To use this refer to these in a backtick.
It only recycles vectors of length 1.
It never creates row names.
Enhanced print() behaviour

Tibbles! - continued

The syntax to make a tibble is nearly identical to data frames

library(tibble)
test <- tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
test

# A tibble: 3 × 2
      x y         
  <int> <list>    
1     1 <int [5]> 
2     2 <int [10]>
3     3 <int [20]>

Whereas if we try this as a dataframe

test <- as.data.frame(c(x = 1:3, y = list(1:5, 1:10, 1:20)))
head(test)

  x1 x2 x3 y1 y2 y3
1  1  2  3  1  1  1
2  1  2  3  2  2  2
3  1  2  3  3  3  3
4  1  2  3  4  4  4
5  1  2  3  5  5  5
6  1  2  3  1  6  6

Tibbles! - continued

We can easily coerce dataframes to tibbles with as_tibble()

Try the following, what differences do you notice:

data(iris)
as_tibble(iris)

Tibbles on print the first 10 rows and all the columns that fit on the screen
You will not accidentally print too much!
Each column displays its data type

Tribble

Sometimes you might need to make a small table in R
tribble() allows you make a tibble and fill it row wise
The ~ is used to define column headers

tribble(~x, ~y, ~z, "a", 2, 3.6, "b", 1, 8.5)

# A tibble: 2 × 3
  x         y     z
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b         1   8.5

Tibble exercises

How can you tell if an object is a tibble?
Compare and contrast the following operations on a data.frame and equivalent tibble. What is different?

df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]

If you have the name of a column stored in an object, e.g. var <- "mpg", how can you extract the column from a tibble?

Readr

There are many ways to import data into R, from inputting the data yourself to reading it in using the traditional R tools we used in lesson one.
The tidyverse way is to use a package called readr, there are several functions within this package you can use to read in different types of data.

Readr functions

read_csv() reads comma delimited files
read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place)
read_tsv() reads tab delimited files
read_delim() reads in files with any delimiter.
read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions().
read_table() reads a common variation of fixed width files where columns are separated by white space.
read_log() reads Apache style log files.

Readr exercises

Use the base R function read.table() to import the pheno.txt file. Then repeat this with read_tsv() from Readr, what is the difference?

You may notice that read_csv automatically assumes your first row is your column headers, you may wish to alter this behaviour is your file comes with a header of information on the top row.

Open the “pheno.txt” in a text editor (can use notepad on Windows), and add a header to the file
What happens when you open this using read_csv?

Readr exercises continued

Let’s try again, but skipping this header.

You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with “#”
You may not have column names, in that case you can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn

Readr VS base R

Why use the readr functions?

They are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try fread() from the data.table package. It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.
They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.
They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so code that works on your computer might not work on someone else’s.

Additional Resources

Extra reading from R for Data Science:
- Chapter 8: Data import
- Chapter 29: Quarto
- Chapter 30: Quarto formats
Advanced reading:
- Chapter 21-25 (importing data)
These slides and the workshop can be found on the website here: