met581 lecture 03

wrangling data 1: dplyr

matthew bracher-smith
gabriel mateus bernardo harrington

2024-10-17

Overview

Review

  • Introduction to R packages and data types
  • RMarkdown, Tibbles, and Readr
  • Homework

dplyr and magrittr

  • Overview of dplyr verbs
  • Using |> or %>% for data manipulation
  • Examples with Tibbles and Real-world Data

The Plan

  • Introduce tidy data
  • Get familiar with the 6 main verbs of dplyr
  • Understand what pipes do
  • Be able to manipulate and summarise data
  • Know when not to use magrittr and dplyr

The Hadleyverse

The Tidyverse

Tidy Data

  • Each observation must have its own row
  • Each variable must have its own column
  • Each value must have its own cell

Tidy Data

We should also keep to one type of observational unit per table

dplyr

The 6 main verbs in dplyr:

  1. select() → select variables by name
  2. filter() → return rows with matching conditions
  3. arrange() → arrange rows by variables
  4. mutate() → add new variables
  5. group_by() → return grouping of variables
  6. summarise() → reduces multiple values down to a single value

dplyr

ALL verbs follow the same format:

  • verb(dataset, options)

Gapminder

Load gapminder and dplyr

library(gapminder)
library(dplyr)

Also install/load:

  • stringr
  • nycflights13
  • devtools
  • skimr

Select

Select variables by name

# explicitly naming each column
select(gapminder, country, year, lifeExp)

Select

Select variables by name

# define a range to select from
select(gapminder, country:year) |> head(4) # select all columns from country to year
country continent year
Afghanistan Asia 1952
Afghanistan Asia 1957
Afghanistan Asia 1962
Afghanistan Asia 1967

Select

Select variables by name

# say which columns NOT to select
select(gapminder, -country) |> head(4) # select all columns BUT country
continent year lifeExp pop gdpPercap
Asia 1952 28.801 8425333 779.4453145
Asia 1957 30.332 9240934 820.8530296
Asia 1962 31.997 10267083 853.1007100
Asia 1967 34.020 11537966 836.1971382

Select

There are lots of helper functions that work with select, e.g.:

  • starts_with(“Hap”) -> ALL columns whose name starts with “Hap”
  • ends_with(“ppy”) -> ALL columns whose name ends with “ppy”
  • contains(“app”) -> ALL columns whose name contains “app”
  • **matches(“^(SNP|CHR)_”)** -> ALL columns who name match a regular expression
  • num_range(“x”, 1980:1983) -> ALL columns named x1980, x1981, x1982, and x1983
  • one_of(char_vector) -> ALL columns who name appears in character vector (char_vector)
# e.g.
library('nycflights13')
select(flights, starts_with("dep")) |> head(4) # match columns by starting string

Select

dep_time dep_delay
517 2
533 4
542 2
544 -1
select(flights, matches('sched_(dep|arr)_time')) |> head(4) # match columns by regex
sched_dep_time sched_arr_time
515 819
529 830
540 850
545 1022

Select - Practice!

Using gapminder

  • Select the column range from country to population, but exclude continent

Using starwars - load with data("starwars")

  • Select name, height and all columns which end in the word ‘color’

Using flights - load with library(nycflights13)

  • Select all columns in the vector (‘year’, ‘month’, ‘day’, ‘carrier’, ‘flight’, ‘dest’)

Select - Extra Practice!

Using gapminder

  • Select country, year and population columns from gapminder

Using starwars

  • Select only the columns stored as characters (tip: use select_if())

Using flights

  • Select all columns ending in ‘time’, but exclude those starting with ‘sched’

Filter

return rows with matching conditions

# using a single filter
filter(gapminder, year == 1997L)

Filter

return rows with matching conditions

##| output: false
# combining multiple filters in a single statement
country_names <- c(
  'Afghanistan',
  'Angola',
  'Belgium',
  'China',
  'Mauritania',
  'Mauritius',
  'Mongolia',
  'Korea, Rep.',
  'Sri Lanka',
  'Saudi Arabia',
  'Vietnam',
  'Yemen, Rep.'
)
filter(
  gapminder,
  year == 1997L,
  continent != "asia",
  pop >= 1000000L,
  country %in% country_names
) |>
  head()

Filter

country continent year lifeExp pop gdpPercap
Afghanistan Asia 1997 41.763 22227415 635.341351
Angola Africa 1997 40.963 9875024 2277.140884
Belgium Europe 1997 77.530 10199787 27561.196630
China Asia 1997 70.426 1230075000 2289.234136
Korea, Rep. Asia 1997 74.647 46173816 15993.527960
Mauritania Africa 1997 60.430 2444741 1483.136136

Filter

combining filters

# multiple filters with and
filter(gapminder, year == 1997L & gdpPercap > 1000) |> head(4)
country continent year lifeExp pop gdpPercap
Albania Europe 1997 72.950 3428038 3193.054604
Algeria Africa 1997 69.152 29072015 4797.295051
Angola Africa 1997 40.963 9875024 2277.140884
Argentina Americas 1997 73.275 36203463 10967.281950

Filter

combining filters

# multiple filters with or
filter(gapminder, year == 1997L | year == 1967L) |> head(4)
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1967 34.020 11537966 836.1971382
Afghanistan Asia 1997 41.763 22227415 635.3413510
Albania Europe 1967 66.220 1984060 2760.1969310
Albania Europe 1997 72.950 3428038 3193.0546040

Filter - Practice!

Using gapminder

  • Keep only rows where continent is Americas or Europe
  • Keep only rows where country is Canada or Australia and the year is before 1974

Using starwars

  • Keep only the rows where name contains ‘light’ or eye_color contains ‘blue’ using filter() and grepl()

Filter - Extra Practice!

Using gapminder

  • Keep only the rows with life expectancy less than 35

Using starwars

  • Remove rows with brown hair colour
  • Keep females with brown or blue eyes

Arrange

arrange rows by variables

Arrange

arrange rows by variables

# arranging by multiple columns
arrange(gapminder, desc(lifeExp), pop) |> head(4)
country continent year lifeExp pop gdpPercap
Japan Asia 2007 82.603 127467972 31656.06806
Hong Kong, China Asia 2007 82.208 6980412 39724.97867
Japan Asia 2002 82.000 127065841 28604.59190
Iceland Europe 2007 81.757 301931 36180.78919

note: order is important when using multiple variables!

Arrange - Practice!

Using gapminder

  • arrange by country then continent
  • arrange by country then descending year

Mutate

add new variables

# single variables
mutate(gapminder, gdp_billion = gdpPercap * pop / 10^9)

Mutate - Practice!

Using starwars

  • Add a new column called ‘BMI’, created using height * mass

Using flights

  • Dichotomise height into the strings ‘tall’ and ‘short’ and assign as factor to ‘height_dichot’ (tip: use ifelse())

Mutate - Extra Practice!

Using starwars

  • Make the values in hair_color, skin_color, eye_color and gender Title Case (tip: use str_to_title() from stringr)

Using flights

  • Assume ‘EWR’ is the code for unknown airport. Recode it to NA in columns ‘origin’ and ‘dest’ using na_if()

Combining Operations

Let’s combine lots of operations together to get the data we want!

It’s as simple as

df_no_pop <- select(gapminder, -pop)
df_no_pop_just_oceania_1980 <- filter(
  df_no_pop,
  continent == 'Oceania',
  year > 1980L
)
df_no_pop_just_oceania_1980_sorted_by_gdp <- arrange(
  df_no_pop_just_oceania_1980,
  desc(gdpPercap)
)

…right???!

The Pipe |>

WRONG!! (ish)

Instead of this:

df_no_pop <- select(gapminder, -pop)
df_just_oceania_1980 <- filter(df_no_pop, continent == 'Oceania', year > 1980L)
df_sorted_by_gdp <- arrange(df_just_oceania_1980, desc(gdpPercap))

Do this:

gapminder |>
  select(-pop) |>
  filter(continent == 'Oceania', year > 1980L) |>
  arrange(desc(gdpPercap))

The Pipe |>

Pipes:

  • pass the information forward to the next verb
  • are a really useful way of expressing a series of operations
  • allow us to quickly see what is being done
  • mean we focus on the verbs, not the nouns

The Pipe |> Native Pipes

  • used to be in the magrittr package, then dplyr
  • are now part of base R! (as of version 4.1.0)
  • we can use them with |> (no need to load tidyverse)
  • every time you see |> here, you could use %>% instead (but load dplyr first)
  • you will still see %>% in a lot of code in the wild, so it’s good to know both
  • you will need to use %>% yourself if you’re forced to use an older version of R

Pipe |> Practice()

Using starwars

  • Filter for all rows that don’t contain NAs and select columns that are characters, then create a new boolean column called ‘hair_eye_mismatch’, which is TRUE for anyone with exclusively brown hair and blue eyes

tip: use is.character and ifelse() to create the new column following the format:

starwars |>
  na.omit() |>
  select_if(is.character) |>
  mutate()
name hair_color skin_color eye_color sex gender homeworld species
Luke Skywalker blond fair blue male masculine Tatooine Human
Darth Vader none white yellow male masculine Tatooine Human
Leia Organa brown light brown female feminine Alderaan Human
Owen Lars brown, grey light blue male masculine Tatooine Human
Beru Whitesun Lars brown light blue female feminine Tatooine Human
Biggs Darklighter black light brown male masculine Tatooine Human
Obi-Wan Kenobi auburn, white fair blue-gray male masculine Stewjon Human
Anakin Skywalker blond fair blue male masculine Tatooine Human
Chewbacca brown unknown blue male masculine Kashyyyk Wookiee
Han Solo brown fair brown male masculine Corellia Human
Wedge Antilles brown fair hazel male masculine Corellia Human
Palpatine grey pale yellow male masculine Naboo Human
Boba Fett black fair brown male masculine Kamino Human
Bossk none green red male masculine Trandosha Trandoshan
Lando Calrissian black dark brown male masculine Socorro Human
Lobot none light blue male masculine Bespin Human
Ackbar none brown mottle orange male masculine Mon Cala Mon Calamari
Wicket Systri Warrick brown brown brown male masculine Endor Ewok
Padmé Amidala brown light brown female feminine Naboo Human
Jar Jar Binks none orange orange male masculine Naboo Gungan
Darth Maul none red yellow male masculine Dathomir Zabrak
Ayla Secura none blue hazel female feminine Ryloth Twi’lek
Mace Windu none dark brown male masculine Haruun Kal Human
Ki-Adi-Mundi white pale yellow male masculine Cerea Cerean
Plo Koon none orange black male masculine Dorin Kel Dor
Luminara Unduli black yellow blue female feminine Mirial Mirialan
Barriss Offee black yellow blue female feminine Mirial Mirialan
Dooku white fair brown male masculine Serenno Human
Jango Fett black tan brown male masculine Concord Dawn Human

Pipe |> Practice_Extra()

Using flights

  • Show the carrier and flight number for flights with arrival delays greater than 10 and distances over 1000, sorted by descending distance.

Group_by

Useful only in conjunction with other methods

# grouping by single variables
gapminder |>
  group_by(continent) |>
  head(4)
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453145
Afghanistan Asia 1957 30.332 9240934 820.8530296
Afghanistan Asia 1962 31.997 10267083 853.1007100
Afghanistan Asia 1967 34.020 11537966 836.1971382

Group_by

Useful only in conjunction with other methods

# grouping by multiple variables
gapminder |>
  group_by(country, continent) |>
  head(4)

Summarise

reduces multiple values down to a single value

# get mean of life expectancy
summarise(gapminder, mean_lifexp = mean(lifeExp))
mean_lifexp
59.4744393662
# get mean of life expectancy and mean of GDP
summarise(gapminder, mean_lifexp = mean(lifeExp), mean_gdp = mean(gdpPercap))
mean_lifexp mean_gdp
59.4744393662 7215.32708121

Summarise

Combining group_by() and summarise() gives us so much more power!

gapminder |>
  select(-pop) |>
  filter(continent == 'Oceania', year > 1980L) |>
  arrange(desc(gdpPercap)) |>
  group_by(country) |>
  summarise(
    n_years = n(),
    Mean_Life_Exp = mean(lifeExp),
    SD_Life_Exp = sd(lifeExp),
    Max_GDP = max(gdpPercap)
  )
country n_years Mean_Life_Exp SD_Life_Exp Max_GDP
Australia 6 78.1758333333 2.45882576989 34435.36744
New Zealand 6 76.8923333333 2.55215843291 25185.00911

Summarise

We can use lots of built-in functions when using summarise:

Summarise

Note: Don’t forget to include na.rm = TRUE if NAs are present

# e.g.
data("airquality")
airquality |>
  as_tibble() |>
  summarise(mean_ozone = mean(Ozone, na.rm = TRUE))
mean_ozone
42.1293103448

Summarise - Practice!

Using gapminder

  • Get the median and standard deviation for gdpPercap

Using airquality

  • Convert to tibble, remove Month and Day, keep Temperatures above 60, then summarise Solar.R with number, mean, median and standard deviation

Summarise - Extra Practice!

Using airquality

  • Summarise the mean values for integer columns only using summarise_if() and is.integer()

Using starwars

  • What is the mean height and weight of all Droids? Include a count of how many droids there are
  • We want to compare Humans from different planets. Show the number of people and their mean height and mass, broken down by homeworld then gender. Make sure you show all rows when printing by piping the last line into print(n = x), where x is a suitably high number of rows

Prettier summaries with skimr

library(skimr)
gapminder |>
  skim()

Other really useful verbs and tools

  • dplyr::glimpse() instead of str()
  • use everything() to re-order columns
  • use n_distinct() instead of length(unique())
  • dplyr::rename(new_name = old_name) for renaming columns
  • dplyr::select_() and other variants
  • dplyr::near() and dplyr::between()
  • coalesce(), recode() and case_when() from dplyr
  • tibble::rownames_to_column()

Things to be aware of

  • Packages can have functions with the same name - sometimes it helps to be specific e.g. dplyr::select()
  • We can now used the conflicted package to be more explicit about this!
  • Filtering using grepl() is better done using stringr’s str_detect() (introduced tomorrow)
  • stringr (tomorrow) combined with dplyr (more tomorrow) and the native pipe (or maggritr pipe if needed) will cover a lot of your everyday needs

Should we always use dplyr?

What if:

  • you have a single operation? or 20?
  • the flow of operations isn’t linear?
  • you need to inspect an intermediate step?
  • speed is more important to you than readability?

Homework

  • Quarto file
  • Regex practice before tomorrow, like this site

Suggested Reading

The Tidyverse Life