MET581 Lecture 03

Wrangling Data 1: dplyr

Matthew Bracher-Smith

Gabriel Mateus Bernardo Harrington

2024-10-17

Overview

Review

Introduction to R packages and data types
RMarkdown, Tibbles, and Readr
Homework

dplyr and magrittr

Overview of dplyr verbs
Using |> or %>% for data manipulation
Examples with Tibbles and Real-world Data

The Plan

Introduce tidy data
Get familiar with the 6 main verbs of dplyr
Understand what pipes do
Be able to manipulate and summarise data
Know when not to use magrittr and dplyr

The Hadleyverse

The Tidyverse

Tidy Data

Each observation must have its own row
Each variable must have its own column
Each value must have its own cell

Tidy Data

We should also keep to one type of observational unit per table

dplyr

The 6 main verbs in dplyr:

select() → select variables by name
filter() → return rows with matching conditions
arrange() → arrange rows by variables
mutate() → add new variables
group_by() → return grouping of variables
summarise() → reduces multiple values down to a single value

dplyr

ALL verbs follow the same format:

verb(dataset, options)

Gapminder

Load gapminder and dplyr

library(gapminder)
library(dplyr)

Also install/load:

stringr
nycflights13
devtools
skimr

Select

Select variables by name

# explicitly naming each column
select(gapminder, country, year, lifeExp)

Select

Select variables by name

# define a range to select from
select(gapminder, country:year) |> head(4) # select all columns from country to year

country	continent	year
Afghanistan	Asia	1952
Afghanistan	Asia	1957
Afghanistan	Asia	1962
Afghanistan	Asia	1967

Select

Select variables by name

# say which columns NOT to select
select(gapminder, -country) |> head(4) # select all columns BUT country

continent	year	lifeExp	pop	gdpPercap
Asia	1952	28.801	8425333	779.4453145
Asia	1957	30.332	9240934	820.8530296
Asia	1962	31.997	10267083	853.1007100
Asia	1967	34.020	11537966	836.1971382

Select

There are lots of helper functions that work with select, e.g.:

starts_with(“Hap”) -> ALL columns whose name starts with “Hap”
ends_with(“ppy”) -> ALL columns whose name ends with “ppy”
contains(“app”) -> ALL columns whose name contains “app”
**matches(“^(SNP|CHR)_”)** -> ALL columns who name match a regular expression
num_range(“x”, 1980:1983) -> ALL columns named x1980, x1981, x1982, and x1983
one_of(char_vector) -> ALL columns who name appears in character vector (char_vector)

# e.g.
library('nycflights13')
select(flights, starts_with("dep")) |> head(4) # match columns by starting string

Select

dep_time	dep_delay
517	2
533	4
542	2
544	-1

select(flights, matches('sched_(dep|arr)_time')) |> head(4) # match columns by regex

sched_dep_time	sched_arr_time
515	819
529	830
540	850
545	1022

Select - Practice!

Using gapminder

Select the column range from country to population, but exclude continent

Using starwars - load with data("starwars")

Select name, height and all columns which end in the word ‘color’

Using flights - load with library(nycflights13)

Select all columns in the vector (‘year’, ‘month’, ‘day’, ‘carrier’, ‘flight’, ‘dest’)

Select - Extra Practice!

Using gapminder

Select country, year and population columns from gapminder

Using starwars

Select only the columns stored as characters (tip: use select_if())

Using flights

Select all columns ending in ‘time’, but exclude those starting with ‘sched’

Filter

return rows with matching conditions

# using a single filter
filter(gapminder, year == 1997L)

Filter

return rows with matching conditions

##| output: false
# combining multiple filters in a single statement
country_names <- c(
  'Afghanistan',
  'Angola',
  'Belgium',
  'China',
  'Mauritania',
  'Mauritius',
  'Mongolia',
  'Korea, Rep.',
  'Sri Lanka',
  'Saudi Arabia',
  'Vietnam',
  'Yemen, Rep.'
)
filter(
  gapminder,
  year == 1997L,
  continent != "asia",
  pop >= 1000000L,
  country %in% country_names
) |>
  head()

Filter

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1997	41.763	22227415	635.341351
Angola	Africa	1997	40.963	9875024	2277.140884
Belgium	Europe	1997	77.530	10199787	27561.196630
China	Asia	1997	70.426	1230075000	2289.234136
Korea, Rep.	Asia	1997	74.647	46173816	15993.527960
Mauritania	Africa	1997	60.430	2444741	1483.136136

Filter

combining filters

# multiple filters with and
filter(gapminder, year == 1997L & gdpPercap > 1000) |> head(4)

country	continent	year	lifeExp	pop	gdpPercap
Albania	Europe	1997	72.950	3428038	3193.054604
Algeria	Africa	1997	69.152	29072015	4797.295051
Angola	Africa	1997	40.963	9875024	2277.140884
Argentina	Americas	1997	73.275	36203463	10967.281950

Filter

combining filters

# multiple filters with or
filter(gapminder, year == 1997L | year == 1967L) |> head(4)

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1967	34.020	11537966	836.1971382
Afghanistan	Asia	1997	41.763	22227415	635.3413510
Albania	Europe	1967	66.220	1984060	2760.1969310
Albania	Europe	1997	72.950	3428038	3193.0546040

Filter - Practice!

Using gapminder

Keep only rows where continent is Americas or Europe
Keep only rows where country is Canada or Australia and the year is before 1974

Using starwars

Keep only the rows where name contains ‘light’ or eye_color contains ‘blue’ using filter() and grepl()

Filter - Extra Practice!

Using gapminder

Keep only the rows with life expectancy less than 35

Using starwars

Remove rows with brown hair colour
Keep females with brown or blue eyes

Arrange

arrange rows by variables

Arrange

arrange rows by variables

# arranging by multiple columns
arrange(gapminder, desc(lifeExp), pop) |> head(4)

country	continent	year	lifeExp	pop	gdpPercap
Japan	Asia	2007	82.603	127467972	31656.06806
Hong Kong, China	Asia	2007	82.208	6980412	39724.97867
Japan	Asia	2002	82.000	127065841	28604.59190
Iceland	Europe	2007	81.757	301931	36180.78919

note: order is important when using multiple variables!

Arrange - Practice!

Using gapminder

arrange by country then continent
arrange by country then descending year

Mutate

add new variables

# single variables
mutate(gapminder, gdp_billion = gdpPercap * pop / 10^9)

Mutate - Practice!

Using starwars

Add a new column called ‘BMI’, created using height * mass

Using flights

Dichotomise height into the strings ‘tall’ and ‘short’ and assign as factor to ‘height_dichot’ (tip: use ifelse())

Mutate - Extra Practice!

Using starwars

Make the values in hair_color, skin_color, eye_color and gender Title Case (tip: use str_to_title() from stringr)

Using flights

Assume ‘EWR’ is the code for unknown airport. Recode it to NA in columns ‘origin’ and ‘dest’ using na_if()

Combining Operations

Let’s combine lots of operations together to get the data we want!

It’s as simple as

df_no_pop <- select(gapminder, -pop)
df_no_pop_just_oceania_1980 <- filter(
  df_no_pop,
  continent == 'Oceania',
  year > 1980L
)
df_no_pop_just_oceania_1980_sorted_by_gdp <- arrange(
  df_no_pop_just_oceania_1980,
  desc(gdpPercap)
)

…right???!

The Pipe |>

WRONG!! (ish)

Instead of this:

df_no_pop <- select(gapminder, -pop)
df_just_oceania_1980 <- filter(df_no_pop, continent == 'Oceania', year > 1980L)
df_sorted_by_gdp <- arrange(df_just_oceania_1980, desc(gdpPercap))

Do this:

gapminder |>
  select(-pop) |>
  filter(continent == 'Oceania', year > 1980L) |>
  arrange(desc(gdpPercap))

The Pipe |>

Pipes:

pass the information forward to the next verb
are a really useful way of expressing a series of operations
allow us to quickly see what is being done
mean we focus on the verbs, not the nouns

The Pipe |> Native Pipes

used to be in the magrittr package, then dplyr
are now part of base R! (as of version 4.1.0)
we can use them with |> (no need to load tidyverse)
every time you see |> here, you could use %>% instead (but load dplyr first)
you will still see %>% in a lot of code in the wild, so it’s good to know both
you will need to use %>% yourself if you’re forced to use an older version of R

Pipe |> Practice()

Using starwars

Filter for all rows that don’t contain NAs and select columns that are characters, then create a new boolean column called ‘hair_eye_mismatch’, which is TRUE for anyone with exclusively brown hair and blue eyes

tip: use is.character and ifelse() to create the new column following the format:

starwars |>
  na.omit() |>
  select_if(is.character) |>
  mutate()

name	hair_color	skin_color	eye_color	sex	gender	homeworld	species
Luke Skywalker	blond	fair	blue	male	masculine	Tatooine	Human
Darth Vader	none	white	yellow	male	masculine	Tatooine	Human
Leia Organa	brown	light	brown	female	feminine	Alderaan	Human
Owen Lars	brown, grey	light	blue	male	masculine	Tatooine	Human
Beru Whitesun Lars	brown	light	blue	female	feminine	Tatooine	Human
Biggs Darklighter	black	light	brown	male	masculine	Tatooine	Human
Obi-Wan Kenobi	auburn, white	fair	blue-gray	male	masculine	Stewjon	Human
Anakin Skywalker	blond	fair	blue	male	masculine	Tatooine	Human
Chewbacca	brown	unknown	blue	male	masculine	Kashyyyk	Wookiee
Han Solo	brown	fair	brown	male	masculine	Corellia	Human
Wedge Antilles	brown	fair	hazel	male	masculine	Corellia	Human
Palpatine	grey	pale	yellow	male	masculine	Naboo	Human
Boba Fett	black	fair	brown	male	masculine	Kamino	Human
Bossk	none	green	red	male	masculine	Trandosha	Trandoshan
Lando Calrissian	black	dark	brown	male	masculine	Socorro	Human
Lobot	none	light	blue	male	masculine	Bespin	Human
Ackbar	none	brown mottle	orange	male	masculine	Mon Cala	Mon Calamari
Wicket Systri Warrick	brown	brown	brown	male	masculine	Endor	Ewok
Padmé Amidala	brown	light	brown	female	feminine	Naboo	Human
Jar Jar Binks	none	orange	orange	male	masculine	Naboo	Gungan
Darth Maul	none	red	yellow	male	masculine	Dathomir	Zabrak
Ayla Secura	none	blue	hazel	female	feminine	Ryloth	Twi’lek
Mace Windu	none	dark	brown	male	masculine	Haruun Kal	Human
Ki-Adi-Mundi	white	pale	yellow	male	masculine	Cerea	Cerean
Plo Koon	none	orange	black	male	masculine	Dorin	Kel Dor
Luminara Unduli	black	yellow	blue	female	feminine	Mirial	Mirialan
Barriss Offee	black	yellow	blue	female	feminine	Mirial	Mirialan
Dooku	white	fair	brown	male	masculine	Serenno	Human
Jango Fett	black	tan	brown	male	masculine	Concord Dawn	Human

Pipe |> Practice_Extra()

Using flights

Show the carrier and flight number for flights with arrival delays greater than 10 and distances over 1000, sorted by descending distance.

Group_by

Useful only in conjunction with other methods

# grouping by single variables
gapminder |>
  group_by(continent) |>
  head(4)

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.1007100
Afghanistan	Asia	1967	34.020	11537966	836.1971382

Group_by

Useful only in conjunction with other methods

# grouping by multiple variables
gapminder |>
  group_by(country, continent) |>
  head(4)

Summarise

reduces multiple values down to a single value

# get mean of life expectancy
summarise(gapminder, mean_lifexp = mean(lifeExp))

mean_lifexp
59.4744393662

# get mean of life expectancy and mean of GDP
summarise(gapminder, mean_lifexp = mean(lifeExp), mean_gdp = mean(gdpPercap))

mean_lifexp	mean_gdp
59.4744393662	7215.32708121

Summarise

Combining group_by() and summarise() gives us so much more power!

gapminder |>
  select(-pop) |>
  filter(continent == 'Oceania', year > 1980L) |>
  arrange(desc(gdpPercap)) |>
  group_by(country) |>
  summarise(
    n_years = n(),
    Mean_Life_Exp = mean(lifeExp),
    SD_Life_Exp = sd(lifeExp),
    Max_GDP = max(gdpPercap)
  )

country	n_years	Mean_Life_Exp	SD_Life_Exp	Max_GDP
Australia	6	78.1758333333	2.45882576989	34435.36744
New Zealand	6	76.8923333333	2.55215843291	25185.00911

Summarise

We can use lots of built-in functions when using summarise:

Summarise

Note: Don’t forget to include na.rm = TRUE if NAs are present

# e.g.
data("airquality")
airquality |>
  as_tibble() |>
  summarise(mean_ozone = mean(Ozone, na.rm = TRUE))

mean_ozone
42.1293103448

Summarise - Practice!

Using gapminder

Get the median and standard deviation for gdpPercap

Using airquality

Convert to tibble, remove Month and Day, keep Temperatures above 60, then summarise Solar.R with number, mean, median and standard deviation

Summarise - Extra Practice!

Using airquality

Summarise the mean values for integer columns only using summarise_if() and is.integer()

Using starwars

What is the mean height and weight of all Droids? Include a count of how many droids there are
We want to compare Humans from different planets. Show the number of people and their mean height and mass, broken down by homeworld then gender. Make sure you show all rows when printing by piping the last line into print(n = x), where x is a suitably high number of rows

Prettier summaries with skimr

library(skimr)
gapminder |>
  skim()

Other really useful verbs and tools

dplyr::glimpse() instead of str()
use everything() to re-order columns
use n_distinct() instead of length(unique())
dplyr::rename(new_name = old_name) for renaming columns
dplyr::select_() and other variants
dplyr::near() and dplyr::between()
coalesce(), recode() and case_when() from dplyr
tibble::rownames_to_column()

Things to be aware of

Packages can have functions with the same name - sometimes it helps to be specific e.g. dplyr::select()
We can now used the conflicted package to be more explicit about this!
Filtering using grepl() is better done using stringr’s str_detect() (introduced tomorrow)
stringr (tomorrow) combined with dplyr (more tomorrow) and the native pipe (or maggritr pipe if needed) will cover a lot of your everyday needs

Should we always use dplyr?

What if:

you have a single operation? or 20?
the flow of operations isn’t linear?
you need to inspect an intermediate step?
speed is more important to you than readability?

Homework

Quarto file
Regex practice before tomorrow, like this site

Suggested Reading

R for Data Science 2e, chapter 4
The docs for dplyr and the vignette at browseVignettes(package = "dplyr")

The Tidyverse Life

Hadley’s Tidy Data paper
Cheat Sheets from Posit
The Tidyverse blog
@hadleywickham, @posit_pbc, @tidyverse