Explore R

Slides inherited from Dr. Georgina Menzies, Dr. Nicos Angelopoulos & Matthew Bracher-Smith

Todays Aims

  • Basics of Quarto
  • Using Quarto to create pdf’s
  • The R package tibble
  • The R package readr

Learning Objectives

  • Get familiar and comfortable with Quarto and creating pdf’s
  • Understand the differences between Rmarkdown and Quarto
  • Understand how to use tibble to make tables of data
  • Understand how to use readr to read in different formats of data

Homework

  • For the first homework you may have just used R scripts
  • For future homework you’ll learn how to make pdf’s or html files from R code, which look like the pdf file for the homework

Homework answers

  • Question 1
sqrt(6 * 2)
[1] 3.464102
4 + 3 - 2
[1] 5
1046 * 934
[1] 976964
  • Question 2
library(MASS)
nrow(women)
[1] 15
colnames(women)
[1] "height" "weight"

Homework answers - continued

women$ages <- sample(18:90, nrow(women), replace = TRUE)
head(women)
  height weight ages
1     58    115   22
2     59    117   60
3     60    120   47
4     61    123   25
5     62    126   90
6     63    129   32
sum(women$height)
[1] 975
sum(women$weight)
[1] 2051
sum(women$ages)
[1] 745
# A more elegant way to apply a function to all columns!
sapply(women, mean)
   height    weight      ages 
 65.00000 136.73333  49.66667 

Homework answers - continued

  • Question 3
cohort <- read.table(
  "http://tbb.bio.uu.nl/BDA/fig4.tsv",
  sep = "\t",
  header = TRUE,
  stringsAsFactors = FALSE
)

females <- cohort[cohort$Gender == "F", ]
head(females)
       Name First_Name  Last_Name Age Weight Gender Married
1 Patient01    Adriana     Mattos  35   64.5      F    TRUE
5 Patient05      Janet Thomlinson  31   87.5      F   FALSE
6 Patient06 Frederique        Vos  73   69.4      F    TRUE
write.csv(x = females, file = "fig4-females.csv")

Homework answers - continued

  • Question 4
a <- c(1, 2, 5.3, 6, -2, 4) # numeric vector
b <- c("one", "two", "three") # character vector
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE) #logical vector
a
[1]  1.0  2.0  5.3  6.0 -2.0  4.0
b
[1] "one"   "two"   "three"
c
[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE
c[c(1, 3)]
[1] TRUE TRUE

Homework answers - continued

# generates 4 x 4 numeric matrix
y <- matrix(1:20, nrow = 4, ncol = 4)
rnames <- c("R1", "R2", "R3", "R4")
cnames <- c("C1", "C2", "C3", "C4")
mymatrix <- matrix(
  y,
  nrow = 4,
  ncol = 4,
  byrow = TRUE,
  dimnames = list(rnames, cnames)
)
mymatrix
   C1 C2 C3 C4
R1  1  2  3  4
R2  5  6  7  8
R3  9 10 11 12
R4 13 14 15 16
  • “A dimnames attribute for the matrix: NULL or a list of length 2 giving the row and column names respectively. An empty list is treated as NULL, and a list of length one as row names. The list can be named, and the list names will be used as names for the dimensions.”

Homework answers - continued

d <- c(1:4)
e <- c("red", "white", "red", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
mydata <- data.frame(d, e, f)
names(mydata) <- c("id", "color", "passed")
mydata
  id color passed
1  1   red   TRUE
2  2 white   TRUE
3  3   red   TRUE
4  4  <NA>  FALSE

Quarto

Today we’re going to talk about Quarto, which we can use to make pdf and html files (and much more!) that integrate plain language, code and output (images etc.). This is excellent because it’s a dynamic document (and so more reproducible), and hugely flexible - various languages and engines, and outputs are supported.

But before Quarto, there was Rmarkdown…

Some background on Rmarkdown

Today we will be working on a .qmd (Quarto) file, but first, we need to talk about .rmd files.

This (an .rmd file) is an Rmarkdown file. You should remember a little about markdown from your Unix lectures. You will be completing your homework in this file, so let’s have a look at it now.

RMarkdown is a way to make documents which include R code. You can use this to write html documents, pdf’s and PowerPoints to show answers to coding problems, or the code themselves.

Can you think of any reasons for using this?

Some background on Quarto

  • Announced in 2022 and becoming more widely adopted only last year, based on pandoc
  • Actually a separate software that we run within RStudio
  • The successor to Rmarkdown in many ways (made by the same devs), using .qmd files, made to support more languages (Python, Julia, etc.) and be more consistent in formatting
  • Can be used to make documents in Rstudio, or Jupyter notebooks or elsewhere (i.e. it’s both multi-language and multi-engine)

Some background on Quarto

A note on Quarto

  • Fundamental usage for reports is the same as creating in Rmarkdown
  • It’s also what’s used to make many of the R books you may read online (including R for Data Science 2nd edition!)
  • Recommended that you use Quarto by default, because:
    • You will be easily be able to swap-in other languages besides R
    • You will more easily be able to use other Quarto features like those for making blogs and journal articles if you’re already familiar
    • It’s highly compatible - you can render most Rmarkdown or jupyter notebooks in Quarto easily, so you can still make normal Rmarkdown documents with your knowledge if needed.
    • You’re up to date with the latest in reproducible research
    • Newest features will likely be added to Quarto over Rmarkdown

Making your first .qmd file

  • Use the menu bar to create a new Quarto document, we will focus on html today

  • Notice that the file contains three types of content:

    • A YAML header surrounded by - - -
    • R code chunks surrounded by ```
    • Text mixed with simple markdown formatting

What do each of the bits of your new file do?

  • Let’s work through these now and create a knitted document!
  • Create your own with code/data you have, or download an example file from quarto from here

Elements of Quarto | The YAML header

  • Quarto documents start with a YAML header that sets metadata and configurations for the document.
  • YAML is a data serialisation language which aims to be useful for computers (often config files) and humans (it’s relatively easy to read)
  • Typical YAML options:
    • title: Title of the document
    • author: Author’s name
    • date: Document date
    • format: Specifies the output format (html, pdf, etc.)
    • theme: Defines a theme for visual style

There are a lot of options we won’t cover today and they differ by document type!

Elements of Quarto | Code chunks

When you open the file in the RStudio IDE, it becomes a notebook interface for R. You can run each code chunk by clicking the green arrow icon. RStudio executes the code and display the results in line with your file.

There are three ways to insert R code into the file:

  1. The keyboard shortcut Ctrl + Alt + I (OS X: Cmd + Option + I)
  2. The +option in the menu bar
  3. Or by typing the chunk delimiters ``` {r} ```

Elements of Quarto | Code chunks

In Quarto, chunk output can be customized with options prefixed by #| inside the chunk header. Here are some common options:

  • #| include: false
    • Excludes code and results from the rendered document but runs the code.
  • #| echo: false
    • Excludes code but displays results. Useful for embedding figures.
  • #| message: false
    • Suppresses messages generated by code.
  • #| warning: false
    • Suppresses warnings generated by code.
  • #| fig-cap: “This is a caption”
    • Adds a caption to figures.

In html documents, adding code-tools and using code-folding are usually preferred over include/echo

Elements of Quarto | Inline code

Code results can be inserted directly into the text of a .qmd file by enclosing the code using single ` like so:

  • The mean of the data was {r} mean(mtcars$gear)
    • Which renders as:
    • The mean of the data was 3.6875

Elements of Quarto | Multiple languages

Quarto allows the use of multiple languages in code chunks. Specify the language using {} after the chunk delimiter, like python, julia, or r.

Elements of Quarto | Text Options

  • Do you remember your markdown formatting?
  • The main body is just normal markdown, so all the same stuff works
  • Examples on the website here

Creating documents | Previewing and Rendering

  • In RStudio, you can preview and render Quarto documents easily
    • Render by clicking the ‘Render’ button
    • Preview by toggling the visual button
  • Quarto provides a command-line interface as well, which allows rendering with: quarto render myfile.qmd
  • You can specify output formats in the YAML header or when rendering

Creating documents | Quarto output formats

  • Quarto supports various output formats:
    • html: Web format with dynamic features
    • pdf: Portable document format for printing
    • docx: Microsoft Word format
    • revealjs: HTML presentations (for slides, this is what these slide are made with!)
    • beamer: PDF presentations (for slides)
  • Specify these in the YAML or use them with the quarto render command, more details here

Creating documents | Practice

  • Here is a version of the document with other elements added
  • See if you can make an ordered list with sub items that contain at least one example of bold, italic, superscript and strikethrough text, as well as a 2x2 table with headers

Now make an .rmd file

  • Compare it to a .rmd document
  • What do you notice that’s different?

Quarto (.qmd) and Rmarkdown (.rmd): Differences you need to know

  • It’s pretty much mainly the syntax of the YAML options (html VS html_document), code chunk option formats (although Quarto is compatible with the rmd format), and the language support
  • Rmarkdown is wedded to R, even if you use other languages like Python, it’s still actually being run through R (via the reticulate package in the case of Python)
  • Quarto is language and engine agnostic, and thus more more versatile

Break

Tibbles!

“Tibbles” are a new modern data frame. It keeps many important features of the original data frame. It removes many of the outdated features. They are another amazing feature added to R by Hadley Wickham. We will use them in the tidyverse to replace the older dataframe that we just learned about.

Tibbles! - continued

Compared to Data Frames:

  1. A tibble never changes the input type.
  2. A tibble can have columns that are lists.
  3. A tibble can have non-standard variable names.
    • can start with a number or contain spaces.
    • To use this refer to these in a backtick.
  4. It only recycles vectors of length 1.
  5. It never creates row names.
  6. Enhanced print() behaviour

Tibbles! - continued

  • The syntax to make a tibble is nearly identical to data frames
library(tibble)
test <- tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
test
# A tibble: 3 × 2
      x y         
  <int> <list>    
1     1 <int [5]> 
2     2 <int [10]>
3     3 <int [20]>
  • Whereas if we try this as a dataframe
test <- as.data.frame(c(x = 1:3, y = list(1:5, 1:10, 1:20)))
head(test)
  x1 x2 x3 y1 y2 y3
1  1  2  3  1  1  1
2  1  2  3  2  2  2
3  1  2  3  3  3  3
4  1  2  3  4  4  4
5  1  2  3  5  5  5
6  1  2  3  1  6  6

Tibbles! - continued

We can easily coerce dataframes to tibbles with as_tibble()

Try the following, what differences do you notice:

data(iris)
as_tibble(iris)
  • Tibbles on print the first 10 rows and all the columns that fit on the screen
  • You will not accidentally print too much!
  • Each column displays its data type

Tribble

  • Sometimes you might need to make a small table in R
  • tribble() allows you make a tibble and fill it row wise
  • The ~ is used to define column headers
tribble(~x, ~y, ~z, "a", 2, 3.6, "b", 1, 8.5)
# A tibble: 2 × 3
  x         y     z
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b         1   8.5

Tibble exercises

  1. How can you tell if an object is a tibble?
  2. Compare and contrast the following operations on a data.frame and equivalent tibble. What is different?
df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]
  1. If you have the name of a column stored in an object, e.g. var <- "mpg", how can you extract the column from a tibble?

Readr

  • There are many ways to import data into R, from inputting the data yourself to reading it in using the traditional R tools we used in lesson one.
  • The tidyverse way is to use a package called readr, there are several functions within this package you can use to read in different types of data.

Readr functions

  • read_csv() reads comma delimited files
  • read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place)
  • read_tsv() reads tab delimited files
  • read_delim() reads in files with any delimiter.
  • read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions().
  • read_table() reads a common variation of fixed width files where columns are separated by white space.
  • read_log() reads Apache style log files.

Readr exercises

  1. Use the base R function read.table() to import the pheno.txt file. Then repeat this with read_tsv() from Readr, what is the difference?

You may notice that read_csv automatically assumes your first row is your column headers, you may wish to alter this behaviour is your file comes with a header of information on the top row.

  1. Open the “pheno.txt” in a text editor (can use notepad on Windows), and add a header to the file
  2. What happens when you open this using read_csv?

Readr exercises continued

Let’s try again, but skipping this header.

  1. You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with “#”
  2. You may not have column names, in that case you can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn

Readr VS base R

Why use the readr functions?

  • They are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try fread() from the data.table package. It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.
  • They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.
  • They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so code that works on your computer might not work on someone else’s.

Additional Resources

  • Extra reading from R for Data Science:
    • Chapter 8: Data import
    • Chapter 29: Quarto
    • Chapter 30: Quarto formats
  • Advanced reading:
    • Chapter 21-25 (importing data)
  • These slides and the workshop can be found on the website here: