MET581 Lecture 05

Wrangling Data 3: (factors, dates and functions)

Matthew Bracher-Smith

2024-10-21

Anonymous Feedback | Course content

Use the QR code, this link or type code #3267575 into slido.com

Overview

Setup

Please install and load the following packages for today

library(forcats)
library(ggplot2)
library(dplyr)
library(lubridate)
library(testthat)
library(gapminder)

Review

  • dplyr verbs, tidyr, joins, stringr
  • homework

Significantly More Important Review

The Plan

  • Factors and Forcats
  • Dates and Lubridate
  • Functions and conditionals

Factors

Factors

A factor:

  • is how we store categorical variables in R
  • contains a fixed and known set of possible values

We can create them using the factor() function, which takes the format: factor(vector, levels, labels)

# e.g.
factor(c(0, 1, 1, 1, 0), labels=c('Female', 'Male'))
[1] Female Male   Male   Male   Female
Levels: Female Male

Making Factors

We can make factors that have an inherent order

monthLevels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                 "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
data <- factor(c("Dec", "Jun", "Apr"))
data
[1] Dec Jun Apr
Levels: Apr Dec Jun

But sorting them may not give us what we expect

sort(data)
[1] Apr Dec Jun
Levels: Apr Dec Jun

Making Factors

  • Factors always have an internal order, even if you don’t give one
  • If you don’t set the levels, they will be alphabetical
  • If you want a specific order, you need to give it:
monthLevels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                 "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
data <- factor(c("Dec", "Jun", "Apr"), levels = monthLevels)
sort(data)
[1] Apr Jun Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Making Factors

Strings that aren’t in your levels are silently set as NA

factor(c("Dec", "Jum", "Apr"), levels = monthLevels)
[1] Dec  <NA> Apr 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

By contrast, readr’s parse_factor() will warn you

readr::parse_factor(c("Dec", "Jum", "Apr"), levels = monthLevels)
[1] Dec  <NA> Apr 
attr(,"problems")
# A tibble: 1 × 4
    row   col expected           actual
  <int> <int> <chr>              <chr> 
1     2    NA value in level set Jum   
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Factors Practice

  • create a factor vector called ‘marauders’ that contains the strings ‘moony’, ‘wormtail’, ‘padfoot’ and ‘prongs’ in alphabetical order
  • create a factor called ‘patronus’ with the strings ‘stag’, ‘dog’, ‘otter’, creating levels from the order they appear in the input vector
  • print only the levels of these factors

Using diamonds

  • use dplyr::count() to show how many rows there are for each factor level in the ‘cut’ column. Is it any different to using forcats::fct_count()?
  • use dplyr::arrange(desc()) to sort the cut column in descending order. What were the rows sorted by?

Why should I care?

  1. you’re better than that
  2. plots
  3. models
  4. forcats

Why use the Forcats* package?

  • It enables a lot of the common needs we have with factors
  • It works well with ggplot2 (also written by Hadley/the tidyverse team)
  • It generally tries to warn you when something may be wrong
  • It has the word cats in it
  • It’s an anagram of factors
  • It’s for categoricals (factors)
  • Something about cats

*strictly for humans

fct_inorder()

fct_inorder()

# sets the levels to be the
# order they appear in the vector
head(fct_inorder(gss_cat$marital))
[1] Never married Divorced      Widowed       Never married Divorced     
[6] Married      
Levels: Never married Divorced Widowed Married Separated No answer

fct_relevel()

fct_relevel()

# moves one or more levels to the start
head(fct_relevel(gss_cat$marital, 'Married'))
[1] Never married Divorced      Widowed       Never married Divorced     
[6] Married      
Levels: Married No answer Never married Separated Divorced Widowed

fct_recode()

For changing the names of existing levels by hand

myFactor <- factor(c("M", "F", "O", "M", "P", "M",
                     "F", "F", "F", "M", "O", "P"))
myFactorPub <- fct_recode(myFactor, male = "M", female = "F",
                          unknown = "O", unknown = "P")
myFactorPub
 [1] male    female  unknown male    unknown male    female  female  female 
[10] male    unknown unknown
Levels: female male unknown

fct_reorder()

fct_reorder()

fct_reorder()

  • the most useful (in my opinion)
  • reorder your factor levels by another variable
  • allows you to bring structure to plots
gss_cat |>
    group_by(marital) |>
    summarise(tvhours = mean(tvhours, na.rm = TRUE)) |>
    ggplot(aes(tvhours, fct_reorder(marital, tvhours))) + # <<<---
      geom_point()

Forcats - Practice!

Using gss_cat

  • how many levels are there in the ‘relig’ column?
  • reorder the levels of ‘denom’ in order of appearance
  • take the code below which plots income by age. Try changing the order of the levels in ‘rincome’ to be sorted by ‘age’. Now try just moving n/a to the start. Which option works best?
gss_cat |>
  group_by(rincome) |>
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()) |>
  ggplot(aes(age, rincome)) +
    geom_point()

fct_reorder2()

fct_reorder2()

fct_reorder2()

  • probably the second most useful
  • suprisingly helpful when reading graphs
gss_cat |>
  filter(!is.na(age)) |>
  count(age, marital) |>
  group_by(age) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(age, prop, colour = fct_reorder2(marital, age, prop))) + # <<<---
    geom_line() +
    labs(colour = "marital")

Other useful Forcats functions

  • fct_rev() reverses the order of the levels
  • fct_lump() combines least common factor levels into ‘other’
  • fct_expand() adds new levels to your factors
  • fct_relabel() automatically relabels factor levels
  • fct_infreq() order factors from most frequent to least frequent

Forcats - More Practice!

Using gss_cat

  • change the names of the levels of partyid from “Not str republican”, “Ind,near dem” and “Ind,near rep” to “Not strong republican”, “Independent, near democrat” and “Independent, near republican”
  • run the code chunk below and view the output. Now edit the code so that the bars are sorted from lowest to highest using fct_infreq() and fct_rev()
gss_cat |>
  ggplot(aes(marital)) +
    geom_bar()

Forcats - Still More Practice!

Using gss_cat

  • change the code below so that the legend colours match the order of the lines at the right side of the plot
diamonds |>
  filter(color == 'J', depth > 55, carat <=2.5) |>
  ggplot(aes(carat, price, col=cut)) +
    geom_line(alpha=0.6)

Gotchas: coercing factors

Errors may be obvious:

x <- factor(c(1, 1, 0, 0, 1, 0, 1, 1, 1, 0))
as.numeric(x)
 [1] 2 2 1 1 2 1 2 2 2 1

Gotchas: coercing factors

Or they can be more subtle:

x <- factor(c(1, 1, 2, 5, 3, 3, 1, 6, 5, 1, 6, 2))
as.numeric(x)
 [1] 1 1 2 4 3 3 1 5 4 1 5 2

Gotchas: coercing factors

  • NEVER convert from factor to numeric unless you know that’s what you want
  • It returns R’s internal codes for the factors, not their values
  • Instead, convert to character first, then to numeric, i.e.:
x <- factor(c(1, 1, 2, 5, 3, 3, 1, 6, 5, 1, 6, 2))
as.numeric(as.character(x))
 [1] 1 1 2 5 3 3 1 6 5 1 6 2

Dates

Dates

How Dates Work in R

  • Dates and times are not just strings, they have many formats like YYYY-MM-DD, MM/DD/YYYY, or even DD-MM-YYYY.
  • Handling dates involves dealing with varied formats, time zones, leap years, and calculations between dates
  • R has special packages to parse and manage dates efficiently.
class("2024-10-20") # a simple date string
[1] "character"
class(as.Date("2024-10-20")) # a date object
[1] "Date"
typeof(as.Date("2024-10-20")) # but a double under the hood
[1] "double"

Introduction to lubridate

  • lubridate is an R package designed to make working with dates and times easier
  • It helps parse different date formats, manipulate dates, and perform calculations.
  • It’s just not possible to be accurate in handling dates without a dedicated date package
  • But with lubridate, you can:
    • parse dates from strings in common formats
    • do arithmetic with dates easily
    • account for time zones, leap years etc.
    • handle times too (though we don’t cover that here)

lubridate::functions()

  • ymd() converts strings to dates in the format YYYY-MM-DD (year-month-day)
  • it returns a date object, which is a special type of object in R
  • there are several related functions which parse slightly different formats, such as mdy(), dmy(), ymd_hms() etc.
  • today() returns the current date
  • year(), month(), day() extract the year, month, and day from a date object

lubridate::functions()

  • interval() creates an interval between two dates
  • we can pass this to time_length(), which calculates the length of an interval in a specified unit
# Create two date objects
start_date <- ymd("2015-05-15")
end_date <- ymd("2024-10-20")
date_interval <- interval(start_date, end_date)

print(time_length(date_interval, "years"))
[1] 9.43287671233

lubridate::practice()

  • John Doe was born on 4th September, 1983. Create a date object for his birth date
  • How old was John on the 6th June, 2020?
  • What about today?

using the lakers dataset which comes with lubridate

  • choose the correct function to replace some_function below:
lakers |> 
  as_tibble() |> 
  mutate(date = some_function(sprintf("%08d", date)))

lubridate::practice_more()

using the economics dataset from ggplot2

  • take the date column from the economics dataset and create two new columns which contain the year and month of each date
  • create a new column called time_since_nyse, which contains the number of years between the founding of the New York Stock Exchange on 17th May, 1792 and the date column

Anonymous Feedback | Comments

Use the QR code, this link or type code #3267575 into slido.com

Functions

Functions

Functions in R:

  • allow you to automate tasks in a more powerful way than copy/paste
  • mean you only need to update code in one place
  • reduce the likelihood of errors
  • are for others, but mainly for YOU
  • should be written for readability and reusability
  • have the format
function_name <- function(arg1, arg2, arg3) {
  # function body
}

Functions - Practice!

call your functions after creating them to check their output

  • write an empty function with no arguments, called ‘stub_func’.
  • write a function that takes no arguments and returns the number 3 using the return() statement. Choose an appropriate name.
  • create another function with no return statement or arguments, which only contains the number 3. Run the function. Is it different to before?
  • create a function called my_divide() which takes two arguments, ‘x’ and ‘y’ and returns x divided by y. Use an explicit return statement.
  • we want to still return a value when y is zero. Change the my_divide() function so that there is another argument called ‘tol’. Set it to a very low value, and add it to y before dividing.

General Rules for Writing Functions

Try to decipher the following code below - what does it do? - does it work? Are there errors?

df <- data.frame(a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10))
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

General Rules for Writing Functions

Is this better?

rescale <- function(x) {
    return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)))
}

General Rules for Writing Functions

Even better?

rescale <- function(x) {
    y <- min(x, na.rm = TRUE)
    return((x - y) / (max(x, na.rm = TRUE) - y))
}

General Rules for Writing Functions

Perfect?

rescale <- function(x) {
  # rescale vector to the range from 0 to 1
  min_value <- min(x, na.rm = TRUE)
  return((x - min_value) / (max(x, na.rm = TRUE) - min_value))
}

General Rules for Writing Functions

  • never take variables created outside a function and use them from inside a function unless you pass them as an argument
  • if you have a lot of functions, it’s good practice to put them in a separate file and use ‘source’
  • generally, functions should do one thing, but you can do whatever you like, including nested functions

When to write a function

  • if you find yourself writing/pasting the same thing 2/3 times

When NOT to write a function

  • if it’s a one-off bit of code you’ll never use again
  • you’re lazy
  • you hate future you

Anonymous functions

  • sometimes we need to package up some code in a function, but we know we’ll never need it again
  • this is common where we do something trivial, like a simple calculation
  • in these cases we often create a function on the fly in the call to another function
  • these are called anonymous functions (because we don’t name them) or lambda functions (from lambda calculus)
  • they’re reasonably common in R, especially when using dplyr

Anonymous functions

  • you’ve already seen these in calls to functions like summarise()!
  • the tilde (~) is used to create a lambda function, and the dot (.x) is used to refer to the input
  • you will also see . used to refer to the input - this is the same as .x
  • you can actually use anything you like, but . and .x are common conventions
# example use of an anonymous function with summarise
gapminder |>
  group_by(continent) |>
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) |>
  head(3)
continent year lifeExp pop gdpPercap
Africa 1979.5 48.8653301282 9916003.14263 2193.75457829
Americas 1979.5 64.6587366667 24504794.99667 7136.11035559
Asia 1979.5 60.0649032323 77038721.97222 7902.15042805

Anonymous functions

  • In this way,
~ mean(.x, na.rm = TRUE)

is equivalent to

function(.x) {
  mean(.x, na.rm = TRUE)
}

Flow control: conditionals

if-else statements

if statements have the format:

if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}

if-else statements

We can add multiple else if statements to these

if (x > 0) {
    print("Positive")
} else if (x < 0) {
    print("Negative")
} else {
    print("Zero")
}

if-else statements

  • ifelse() lets us use vectorised if-else statements (note, dplyr has a version called dplyr::if_else() that’s a bit stricter)
  • if you have a lot of if statements, check out the switch() function
  • dplyr::case_when() handles multiple vectorised if_else() statements
# example use of case_when to simplify multiple if_else statements
gapminder |>
  mutate(lifeExp_category = case_when(
    lifeExp < 50 ~ "Low",
    lifeExp >= 50 & lifeExp <= 70 ~ "Medium",
    lifeExp > 70 ~ "High"
  )) |>
  head(3)
country continent year lifeExp pop gdpPercap lifeExp_category
Afghanistan Asia 1952 28.801 8425333 779.4453145 Low
Afghanistan Asia 1957 30.332 9240934 820.8530296 Low
Afghanistan Asia 1962 31.997 10267083 853.1007100 Low

Using your functions

Using your functions

  • modularising code
  • unit testing
  • packages

Modularising code

  • separating code into functions and files makes it easier to re-use across projects
  • it also makes it easier to maintain as we know where to go to change them
  • testing functions is also easier than testing code spread out in a script and interweaved with results

Modularising code - practice!

  • create a new R script
  • create a function in the script called my_add() that takes two arguments, x and y and returns their sum (do not paste it into the terminal!)
  • save the R script
  • in the console, source the R script with source("path/to/script.R")
  • run the function with my_add(2, 3)

Types of testing

  • there are many ways to test your code! We have:
    • unit tests in your local development environment or CI/CD pipeline
    • integration tests to check that your code works with other code
    • quality assurance tests to check that your code meets a certain standard (usually done by a separate team)
    • end-to-end tests to check that your code works in a real-world (production) environment

Unit Testing

  • for testing the smallest unit of code (functions)
  • they are simply functions that test your code to make sure it does what it is supposed to
  • this can mean it gives the correct output with expected input, but also that it errors as expected when you give it faulty input
  • allows you to change your code and quickly check it still works
  • allows you to sleep at night
  • done in r using the testthat package

Unit testing - practice!

  • create a function called my_multiply() that takes two arguments, x and y and returns their product
  • write a test for this function that checks that my_multiply(2, 3) returns 6
  • use the example of my_add below and modify it:
my_add <- function(x, y) {
  return(x + y)
}

testthat::test_that("my_add() works as expected",
{
  testthat::expect_equal(my_add(2, 3), 5)
})

Putting it all together (optional extra homework)

  • create a github account
  • install devtools and testthat
  • make a personal package following the outline here
  • add a function to the package for a common task you do (can be simple)
  • write test for your function using testthat
  • push it to github and (double optional) send me the link!

Putting it all together (optional extra homework)

  • this:
    • is a great way to learn how to write functions and test them
    • it will set you up for future projects
    • is entirely optional and not required for the course

Homework