Wrangling Data 3: (factors, dates and functions)
2024-10-21
Use the QR code, this link or type code #3267575 into slido.com
Please install and load the following packages for today
A factor:
We can create them using the factor()
function, which takes the format: factor(vector, levels, labels)
We can make factors that have an inherent order
monthLevels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
data <- factor(c("Dec", "Jun", "Apr"))
data
[1] Dec Jun Apr
Levels: Apr Dec Jun
But sorting them may not give us what we expect
Strings that aren’t in your levels are silently set as NA
[1] Dec <NA> Apr
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
By contrast, readr’s parse_factor()
will warn you
Using diamonds
dplyr::count()
to show how many rows there are for each factor level in the ‘cut’ column. Is it any different to using forcats::fct_count()
?dplyr::arrange(desc())
to sort the cut column in descending order. What were the rows sorted by?*strictly for humans
fct_inorder()
fct_relevel()
For changing the names of existing levels by hand
myFactor <- factor(c("M", "F", "O", "M", "P", "M",
"F", "F", "F", "M", "O", "P"))
myFactorPub <- fct_recode(myFactor, male = "M", female = "F",
unknown = "O", unknown = "P")
myFactorPub
[1] male female unknown male unknown male female female female
[10] male unknown unknown
Levels: female male unknown
Using gss_cat
fct_rev()
reverses the order of the levelsfct_lump()
combines least common factor levels into ‘other’fct_expand()
adds new levels to your factorsfct_relabel()
automatically relabels factor levelsfct_infreq()
order factors from most frequent to least frequentUsing gss_cat
fct_infreq()
and fct_rev()
Using gss_cat
Errors may be obvious:
Or they can be more subtle:
YYYY-MM-DD
, MM/DD/YYYY
, or even DD-MM-YYYY
.ymd()
converts strings to dates in the format YYYY-MM-DD
(year-month-day)mdy()
, dmy()
, ymd_hms()
etc.today()
returns the current dateyear()
, month()
, day()
extract the year, month, and day from a date objectinterval()
creates an interval between two datestime_length()
, which calculates the length of an interval in a specified unitusing the lakers
dataset which comes with lubridate
some_function
below:using the economics dataset from ggplot2
Use the QR code, this link or type code #3267575 into slido.com
Functions in R:
call your functions after creating them to check their output
return()
statement. Choose an appropriate name.my_divide()
which takes two arguments, ‘x’ and ‘y’ and returns x divided by y. Use an explicit return statement.my_divide()
function so that there is another argument called ‘tol’. Set it to a very low value, and add it to y before dividing.Try to decipher the following code below - what does it do? - does it work? Are there errors?
df <- data.frame(a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10))
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
Is this better?
Even better?
Perfect?
summarise()
!~
) is used to create a lambda function, and the dot (.x
) is used to refer to the input.
used to refer to the input - this is the same as .x
.
and .x
are common conventions# example use of an anonymous function with summarise
gapminder |>
group_by(continent) |>
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) |>
head(3)
continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|
Africa | 1979.5 | 48.8653301282 | 9916003.14263 | 2193.75457829 |
Americas | 1979.5 | 64.6587366667 | 24504794.99667 | 7136.11035559 |
Asia | 1979.5 | 60.0649032323 | 77038721.97222 | 7902.15042805 |
is equivalent to
if
statements have the format:
We can add multiple else if
statements to these
ifelse()
lets us use vectorised if-else statements (note, dplyr has a version called dplyr::if_else()
that’s a bit stricter)switch()
functiondplyr::case_when()
handles multiple vectorised if_else()
statements# example use of case_when to simplify multiple if_else statements
gapminder |>
mutate(lifeExp_category = case_when(
lifeExp < 50 ~ "Low",
lifeExp >= 50 & lifeExp <= 70 ~ "Medium",
lifeExp > 70 ~ "High"
)) |>
head(3)
country | continent | year | lifeExp | pop | gdpPercap | lifeExp_category |
---|---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453145 | Low |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530296 | Low |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007100 | Low |
my_add()
that takes two arguments, x
and y
and returns their sum (do not paste it into the terminal!)source("path/to/script.R")
my_add(2, 3)
testthat
packagemy_multiply()
that takes two arguments, x
and y
and returns their productmy_multiply(2, 3)
returns 6devtools
and testthat
testthat