Data Visualisation in R

Visualisation

Making you data come to life!

Objectives

  • How data can be visualised in R
  • Learn how to adjust graphics
  • Get familiar with ggplot2 and it’s grammar of graphics to create amazing figures

Data Visualisation - An Overview

  • The overall aim of visualising data:
    • Make all your plots as self explanatory as possible!
  • For this lecture, we will focus on ggplot2, a tidyverse package
    • Inspired on the Grammer of Graphics, a book that aims to formalise visualisations into layers
  • Has multiple add-ons, such as ggrepel (for text labels) and ggpubr (for publication-ready plots) or patchwork (for combining multiple plots)

Base R plots

  • Graphs can be easily generated with the base R syntax
data <- c(2, 3, 6, 4, 9)
plot(data)

plot(data, type = "l")

Base R plots

  • Graphs can be easily generated with the base R syntax
plot(data, type = "o", col = "blue")

barplot(data)

ggplot2

  • ggplot2 is a lot more useful and user friendly than base R, making plots look a lot nicer and with more options for building and displaying graphics
  • We’ll start with an old favourite, the mtcars dataset!
# Load packages
# ggplot2 is in the tidyverse
library(tidyverse)
library(conflicted)
library(ggrepel)
library(ggstatsplot)
library(plotly)
conflicts_prefer(dplyr::filter)

mtcars |>
  ggplot(aes(x = mpg, y = hp)) +
  geom_point()

ggplot2 - layers

  • ggplot() is the main function, and this creates the initial ggplot object where we then add multiple layers
  • A layer is a collection of geometric elements (geoms) and statistical transformations
  • An easy example of a “geom” element layer is geom_point(), which adds a scatter plot
  • Aesthetic mappings (aes) are specified with aes(). This is how variables (columns) in the input data are mapped to visual, or “aesthetic”, properties.
  • You can give global “aesthetics” to a plot (will appear in every layer) by specifying this in the ggplot() function, or local aesthetics in individual layers (such as in geom_point()).

Themes

  • A “theme” controls the finer points of the plot, like the font size and background colour
  • This is essentially customising the non-data elements
  • For example, change the default grey background to white background1
mtcars |>
  ggplot(aes(x = mpg, y = hp)) +
  geom_point() +
  theme_bw()

Themes - Global

  • These themes can be set globally, so for all plots, using the theme_set() function
theme_set(theme_bw())

mtcars |>
  ggplot(aes(x = mpg, y = hp)) +
  geom_point()

  • You can see some more themes provided by ggplot2 here

Quick plots with qplot

  • A quicker version of ggplot! Good for very basic figures
qplot(mpg, hp, data = mtcars)

Aesthetics: Colour

  • We can visualise more information by colouring the data points by another variable
  • For example, in mtcars we can map the number of cylinders to the colour aesthetic (or color if you want to spell it wrong…)
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp, colour = factor(cyl))) +
  geom_point()

Aesthetics: Size and Shape

  • We can also map the number of cylinders to the size or shape aesthetic
# Using size
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp, size = cyl)) +
  geom_point()

# Using shape
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp, shape = factor(cyl))) +
  geom_point()

Aesthetics: Shape & Colour

  • We can combine aesthetics as we like too!
ggplot(
  data = mtcars,
  mapping = aes(x = mpg, y = hp, shape = factor(cyl), colour = factor(cyl))
) +
  geom_point()

Aesthetics: Conditional Colour

  • We can even map an aesthetic to a datapoint based on a condition, i.e. only change colour when a certain condition is met.
  • For example here, the colour varies depending on whether the car has 4 cylinders or not (cyl == 4 being TRUE or FALSE)
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp, colour = cyl == 4)) +
  geom_point()

Aesthetics: Fill

  • Fill is yet another aesthetic
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))

Plotting practice!

  • Using the mpg data:
    • Visualise the relationship between engine size (displ) and highway mileage (hwy). What relationship do you see?
    • Add colour to show a categorical variable (e.g. class). Does adding color help you notice anything?
    • Count cars by manufacturer in a plot. Try reording the bars by frequency.

Facets

  • A “facet” is one section of something that has many sections.
  • A “facet” in ggplot allows you to break up the data into different subsets and plot individual panels based on it
  • Creates “subplots” or panel-like figures
  • Really useful when you’ve got categorical variables (such as gender)
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp, colour = factor(cyl))) +
  geom_point() +
  facet_wrap(~cyl)

Facets: layout

  • We can change the layout of the “facets” locally:
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp, colour = factor(cyl))) +
  geom_point() +
  facet_wrap(~cyl, ncol = 1)
# Note the following is equivalent
# facet_wrap(~ cyl, dir = "v")

Facet grids

  • We can combine 2 variables with facet_grid()
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp, colour = factor(cyl))) +
  geom_point() +
  facet_grid(am ~ cyl)

Additional Geoms

  • So far we’ve only used geom_point(), but there are naturally many more geoms we can use
  • geom_smooth() draws a smoothed line based on the trend of the provided data
  • They can be used individually, or layered on top of one another, which is the core of the grammer of graphics
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_smooth()

Combining geoms

  • Here we layer two geoms
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_smooth() +
  geom_point()

  • Note that the order of geoms can matter! (though in this case it doesn’t :P)

Layering geoms and additional aesthetics

  • The order in which we give aesthetics can also matter
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp, colour = factor(cyl))) +
  geom_smooth() +
  geom_point()

# A more sensible order?
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_smooth() +
  geom_point(aes(colour = factor(cyl)))

Statistical Transformations: count

  • Some plots transform your data internally and plot those new values instead of raw values
  • The stat argument of different plot types (geom functions) specifies the statistical transformation
  • For example, geom_bar() uses stat = "count" as it’s default to create counts of the mapped variable (as in what a bar chart does):
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

# Worth noting: geom and stats are often interchangeable
ggplot(data = diamonds) +
  stat_count(mapping = aes(x = cut))

Statistical Transformations: identity

  • Another commonly used stat is “identity” when plotting bars with heights based on raw values
## example tibble
demo <- tribble(
  ~cut        , ~value ,
  "Fair"      ,   1610 ,
  "Good"      ,   4906 ,
  "Very Good" ,  12082 ,
  "Premium"   ,  13791 ,
  "Ideal"     ,  21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = value), stat = "identity")

Break

Positional Adjustments

  • The position argument can control how geoms occupy space
# stacked
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))

# side by side
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
# Note the default is position = "stack"

Labels: Titles and Captions

  • Another editable part of a plot are the text labels, which we can add or modify using labs()
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(colour = factor(cyl))) +
  labs(title = "Title", subtitle = "Subtitle", caption = "Small Caption")

Labels: Axis and legends

ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(colour = factor(cyl))) +
  labs(
    x = "Miles per gallon (mpg)",
    y = "Horsepower (hp)",
    colour = "Cylinders"
  )

Labels: Axis alternative

ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(colour = factor(cyl))) +
  xlab("Miles per gallon") +
  ylab("Horsepower")

More practice

  • Using the same data (mpg), build on your earlier scatterplot of displ vs hwy by adding a trend line.
  • Add two different geoms to the same plot. Experiment!
  • Use faceting to compare trends across drive types (drv).
  • Start with a bar plot of drive type (drv) by cylinder (cyl):
ggplot(mpg, aes(x = drv, fill = factor(cyl))) +
  geom_bar()
  • Then try setting position to fill or dodge. What changes?
  • Add clear axis labels, a title, subtitle, and caption to a plot

Annotations

  • To add annotation to data points, we can use geom_text() and geom_label()
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  geom_text(aes(label = rownames(mtcars)))

ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  geom_label(aes(label = rownames(mtcars)), alpha = 0)

Annotations: ggrepl

  • The ggrepl package can help make more legible labels
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  ggrepel::geom_text_repel(
    aes(
      label = rownames(mtcars)
    ),
    max.overlaps = 100
  )
# Label only a subset
ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  ggrepel::geom_text_repel(
    aes(
      label = rownames(mtcars[1:3, ])
    ),
    data = mtcars[1:3, ]
  )

More elegant annotations

top10 <- mpg |> slice_max(hwy, n = 10)
mpg_top <- mpg |>
  mutate(lbl = if_else(model %in% top10$model, model, NA))

ggplot(mpg_top, aes(displ, hwy)) +
  geom_point(alpha = 0.6) +
  geom_text_repel(aes(label = lbl), na.rm = TRUE) +
  labs(title = "Top 10 Highway MPG Models")

Zooming

  • To control the plot limits, you have 3 methods:
    • Adjusting the data that’s plotted
    • Setting xlim and ylim in coord_cartesian() (do this!!)
    • Setting the limits in each scale
# Not ideal - need strong justification!
mtcars |>
  filter(mpg >= 20, hp <= 150) |>
  ggplot(mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  geom_smooth()

Zooming - coord_cartesian

ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  geom_smooth() +
  coord_cartesian(xlim = c(20, 35), ylim = c(0, 150))

This is the RIGHT WAY, using coord_cartesian()

Zooming - lims

ggplot(data = mtcars, mapping = aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  geom_smooth() +
  lims(x = c(20, 40), y = c(0, 150))

This is also not ideal, as it removes data outside the limits!

Scales

  • “scale” allows you control mapping things like colour, size and shape to data values
  • “scale” draws a legend or axes
  • ggplot2 automatically adds default scales behind the scenes
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl)))

Scales - defaults

  • Is the same as:
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

Scales - axis breaks

  • The naming scheme tells you the aesthetic (x_, y_, colour_, etc) and the name of the scale (continuous, discrete)
# Change the breaks on the axis
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  scale_y_continuous(
    breaks = seq(0, 350, by = 50)
  )

Scales - axis labels

# Adding text to labels
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  scale_y_continuous(
    breaks = seq(0, 350, by = 50),
    labels = paste0(
      "HP ",
      seq(0, 350, by = 50)
    )
  )

# No labels
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  scale_y_continuous(
    breaks = seq(0, 350, by = 50),
    labels = NULL
  )

Legend Layout - position

  • The legend can of course also be modified in lots of ways
p <- ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl)))

# Legend at the top
p + theme(legend.position = "top")
# No legend
p + theme(legend.position = "none")
# Note - the default is "right"

Legend Layout - guides

  • We can use the guides() function to control the legend display
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(colour = factor(cyl)), alpha = 0.5) +
  guides(
    colour = guide_legend(
      ncol = 2,
      override.aes = list(size = 3, alpha = 1)
    )
  )

Controlling the Colour Scale - alt palettes

  • The default ggplot2 colours we get are a bit rubbish
  • Many pre-defined colour palettes are available to change that
  • Such as from the RColorBrewer package
  • Palette explainer here and interactive browser here
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  scale_colour_brewer(palette = "Set1")

Controlling the Colour Scale - manually

ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  scale_colour_manual(values = c("black", "pink", "turquoise"))

# Explicitly setting the values to colours
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl))) +
  scale_colour_manual(
    values = c(
      `4` = "black",
      `6` = "pink",
      `8` = "red"
    )
  )

Saving plot - the ggplot2 way

  • We of course want to be able to save the beautiful plots we make! We can do this using ggsave()
  • If not specified, it will save the most recent plot we create to our disk
  • The format of the plot is defined in the filename extension (.pdf or .png for example)
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl)))
# Save the last printed plot
ggsave(filename = "my_plot_1.pdf")
# Save the plot to a variable first
plot <- ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl)))
ggsave(filename = "my_plot_1.png", plot)

Saving plot - the base R way

  • With base R we need to set the device we want to save with first, then print the plot, and then close the device
# Save to png
png(filename = "my_plot.png", width = 500, height = 400)
# Print the plot
ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl)))
# Close the device
dev.off()

Yet more practice

  • Using the mpg dataset, label only cars with hwy > 35 & displ < 2, using geom_text_repel() so labels don’t overlap.
  • On mpg, set x breaks at integers and y limits to 10–50.
  • Save a plot object as both high-res PNG and vector PDF.

Compound plots

  • What if you want to put two or more plots together to save?
plot1 <- ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = factor(cyl)))
plot2 <- ggplot(mtcars, aes(x = qsec, y = hp)) +
  geom_point(aes(color = factor(cyl)))
plot1 + plot2

Boxplots and Violin Plots

  • Boxplots and Violin Plots are very common within biolosciences(protein levels, patient data, SNP frequency etc.)
  • Be warned that boxplots can sometimes be misleading and so it’s always good to check the raw data too! See here for more info
  • Have a go at creating your own boxplots and violin plots using the mtcars/diamonds/other datasets!
mtcars$cyl <- as.factor(mtcars$cyl)
# Make base plot
p <- ggplot(mtcars, aes(cyl, mpg))

p + geom_boxplot(aes(colour = cyl))

Boxplots and Violin Plots - examples

  • geom_jitter is like geom_point, but adds noise so point aren’t on top of each other, handy for mapping the raw data onto other geoms!
p + geom_boxplot(aes(fill = cyl), alpha = 0.3) + geom_jitter(size = 0.8)
# Violin
p + geom_violin(aes(fill = cyl))

Boxplots and Violin Plots - combining geoms

  • We can of course layer them on top too!
# All three!
p +
  geom_violin(aes(fill = cyl), width = 1.4) +
  geom_boxplot(width = 0.1, colour = "grey", alpha = 0.3) +
  geom_jitter(size = 0.8, width = 0.1, colour = "grey") +
  scale_fill_viridis_d() +
  theme_minimal()

Boxplots and Violin Plots - numeric to categorical

# discretise numeric data into categorical
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_boxplot(aes(
    group = cut_width(carat, 0.2)
  ))
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_boxplot(aes(
    group = cut_number(carat, 20)
  ))

3D Plots

  • plotly can be used to make 3D plots, but this is very situational and should probably only be done in cases where the plot is intended to be interactive
mtcars$gear <- as.factor(mtcars$gear)

plot_ly(
  mtcars,
  x = ~wt,
  y = ~hp,
  z = ~qsec,
  color = ~gear
) |>
  add_markers() |>
  plotly::layout(
    scene = list(
      xaxis = list(title = 'Weight'),
      yaxis = list(title = 'Gross horsepower'),
      zaxis = list(title = '1/4 mile time')
    )
  )

3D Plots

ggstatsplot

  • A handy way to quickly look at correlations!
ggstatsplot::ggscatterstats(mtcars, x = hp, y = qsec)

That’s all folks!

  • These slides can be found on the website here: