Wrangling Data 3: (factors, dates and functions)
2024-10-21
Use the QR code, this link or type code #3267575 into slido.com
Please install and load the following packages for today
A factor:
We can create them using the factor() function, which takes the format: factor(vector, levels, labels)
We can make factors that have an inherent order
monthLevels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
data <- factor(c("Dec", "Jun", "Apr"))
data[1] Dec Jun Apr
Levels: Apr Dec Jun
But sorting them may not give us what we expect
Strings that aren’t in your levels are silently set as NA
[1] Dec <NA> Apr
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
By contrast, readr’s parse_factor() will warn you
Using diamonds
dplyr::count() to show how many rows there are for each factor level in the ‘cut’ column. Is it any different to using forcats::fct_count()?dplyr::arrange(desc()) to sort the cut column in descending order. What were the rows sorted by?*strictly for humans
fct_inorder()
fct_relevel()
For changing the names of existing levels by hand
myFactor <- factor(c("M", "F", "O", "M", "P", "M",
"F", "F", "F", "M", "O", "P"))
myFactorPub <- fct_recode(myFactor, male = "M", female = "F",
unknown = "O", unknown = "P")
myFactorPub [1] male female unknown male unknown male female female female
[10] male unknown unknown
Levels: female male unknown
Using gss_cat
fct_rev() reverses the order of the levelsfct_lump() combines least common factor levels into ‘other’fct_expand() adds new levels to your factorsfct_relabel() automatically relabels factor levelsfct_infreq() order factors from most frequent to least frequentUsing gss_cat
fct_infreq() and fct_rev()Using gss_cat
Errors may be obvious:
Or they can be more subtle:
YYYY-MM-DD, MM/DD/YYYY, or even DD-MM-YYYY.ymd() converts strings to dates in the format YYYY-MM-DD (year-month-day)mdy(), dmy(), ymd_hms() etc.today() returns the current dateyear(), month(), day() extract the year, month, and day from a date objectinterval() creates an interval between two datestime_length(), which calculates the length of an interval in a specified unitusing the lakers dataset which comes with lubridate
some_function below:using the economics dataset from ggplot2
Use the QR code, this link or type code #3267575 into slido.com
Functions in R:
call your functions after creating them to check their output
return() statement. Choose an appropriate name.my_divide() which takes two arguments, ‘x’ and ‘y’ and returns x divided by y. Use an explicit return statement.my_divide() function so that there is another argument called ‘tol’. Set it to a very low value, and add it to y before dividing.Try to decipher the following code below - what does it do? - does it work? Are there errors?
df <- data.frame(a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10))
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))Is this better?
Even better?
Perfect?
summarise()!~) is used to create a lambda function, and the dot (.x) is used to refer to the input. used to refer to the input - this is the same as .x. and .x are common conventions# example use of an anonymous function with summarise
gapminder |>
group_by(continent) |>
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) |>
head(3)| continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|
| Africa | 1979.5 | 48.8653301282 | 9916003.14263 | 2193.75457829 |
| Americas | 1979.5 | 64.6587366667 | 24504794.99667 | 7136.11035559 |
| Asia | 1979.5 | 60.0649032323 | 77038721.97222 | 7902.15042805 |
is equivalent to
if statements have the format:
We can add multiple else if statements to these
ifelse() lets us use vectorised if-else statements (note, dplyr has a version called dplyr::if_else() that’s a bit stricter)switch() functiondplyr::case_when() handles multiple vectorised if_else() statements# example use of case_when to simplify multiple if_else statements
gapminder |>
mutate(lifeExp_category = case_when(
lifeExp < 50 ~ "Low",
lifeExp >= 50 & lifeExp <= 70 ~ "Medium",
lifeExp > 70 ~ "High"
)) |>
head(3)| country | continent | year | lifeExp | pop | gdpPercap | lifeExp_category |
|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453145 | Low |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530296 | Low |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007100 | Low |
my_add() that takes two arguments, x and y and returns their sum (do not paste it into the terminal!)source("path/to/script.R")my_add(2, 3)testthat packagemy_multiply() that takes two arguments, x and y and returns their productmy_multiply(2, 3) returns 6devtools and testthattestthat