MET581 Lecture 04

Wrangling Data 2: tidying, strings and joins

Matthew Bracher-Smith

2024-10-18

Overview

Review

The verbs of dplyr
Homework

First, Load everything

library(stringr)
library(tidyr)
library(dplyr)
library(gapminder)
library(nycflights13)
library(ggplot2)

Homework Tips

Basic organisation

By now there should be some structure to your work
This might be R scripts or literate programming (Quarto files)
It’s worth creating a new project for each class, and a new .qmd file for each homework
Test in the console to build up parts of a query and transfer to the .qmd file

Scoped verbs vs. pick and across

We mainly talked about applying functions to individual columns
But we often want to apply functions to multiple columns
We can use pick() and across() to do this
The scoped variants of dplyr verbs like select_if() have been superseded (but you will see them everywhere!)

`pick()` in masked environments

in functions like mutate(), summarise(), and group_by() we can refer to columns directly by their names instead of needing quotation marks - why is this?
they “mask” the overall data frame, meaning they provide direct access to column names without needing to explicitly reference the full data frame.
you will see this referred to as a “data masking environment” in the docs
pick() is like select(), but can refer to columns directly in a masked environment, e.g.

gapminder |> 
  mutate(rank = dense_rank(pick(lifeExp, gdpPercap)))

`across` for applying functions

You will cover functional programming in detail with purrr in later lectures
Functional programming refers to functions which can take other functions as arguments for iteration
across can be used in this way to apply functions to multiple columns
see the R4DS book, ch. 26 for more

`across` for applying functions

for example, we can apply a function to every column that’s an integer
with the more recent across syntax, this looks like:

gapminder |>
  mutate(across(where(is.numeric), round))

this is equivalent to the now superseded “scoped variant” syntax:

gapminder |>
  mutate_if(is.numeric, round)

`across` for applying functions

one of the more frustrating things is manually writing out calls to summarise
with across, the code fo doing something manual, like:

gapminder |>
  summarise(
    mean_lifeExp = mean(lifeExp),
    mean_pop = mean(pop),
    mean_gdpPercap = mean(gdpPercap)
  )

mean_lifeExp	mean_pop	mean_gdpPercap
59.4744393662	29601212.3245	7215.32708121

can now be simplified to:

gapminder |>
  summarise(across(c(lifeExp, pop, gdpPercap), mean))

lifeExp	pop	gdpPercap
59.4744393662	29601212.3245	7215.32708121

How to plan a query

For smaller queries:

sketch out the the “bones” of the query with dplyr verbs
start with from the end and work backwards

For bigger queries:

manually or electronically sketch out the data flow
think about the order of operations (ETL)
break the query up into manageable chunks if it’s long
test queries on a small batch of data e.g. read n_max=10 rows for testing

The Plan

introduce wide and long formats
begin converting between wide and long data
join data frames together with dplyr
introduce manipulating data with strings in stringr

Tidyr

Wide and Long Data

What’s the problem?

we want data to be tidy
we often need to convert to wide or long for different applications

We usually have one of two problems

a variable is spread over several columns
an observation is spread over several rows

tidyr::pivot_longer()

Solves the ‘variables as columns’ problem

table4a |> 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

country	year	cases
Afghanistan	1999	745
Afghanistan	2000	2666
Brazil	1999	37737
Brazil	2000	80488
China	1999	212258
China	2000	213766

tidyr::pivot_longer()

select columns as per dplyr::select()

table4a |> 
  pivot_longer(-country, names_to = "year", values_to = "cases")

country	year	cases
Afghanistan	1999	745
Afghanistan	2000	2666
Brazil	1999	37737
Brazil	2000	80488
China	1999	212258
China	2000	213766

tidyr::pivot_wider()

Solves the ‘observations over rows’ problem

table2 |>
    pivot_wider(names_from = type, values_from = count)

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

tidyr::practice()

Using fish_encounters

change the variable ‘station’ to be column names, and populate with values from ‘seen’

Using starwars - load with data("starwars")

pivot all columns ending with ‘color’ to long format as new columns ‘attribute’ and ‘colour’

Using flights - load with library(nycflights13)

pivot all columns ending with ‘time’ to long format as new columns ‘time_type’ and ‘time_value’ then select only columns beginning with ‘time’

tidyr::practice_more()

replace tidyr::some_function() with the correct call to convert the height and mass columns into ‘characteristic’ and ‘measurement’ columns for plotting below

starwars |> 
  tidyr::some_function() |>
  ggplot(aes(characteristic, measurement)) +
    geom_jitter()

replace tidyr::some_function() with the correct call to convert the temp, dewp and humid columns into ‘condition’ and ‘measurement’ columns for plotting below

weather |> 
  head(1000) |>
  tidyr::some_function() |>
  ggplot(aes(measurement, condition)) +
    geom_jitter(alpha=0.6)

Joins

Relational data

when our data is spread over several tables
we need to know the relations between pairs of tables
to do this, we need to think about keys

Keys

A key is a variable (or set of variables) that uniquely identifies an observation. It is the backbone of each dataset or set of datasets.

You generally have two types of key:

Primary key: identifies observations in its own data frame (eg: planes$tailnum)
Foreign key: identifies observations in another data frame (eg: flights$tailnum)

Keys

It is generally good idea to test whether or not you do have a unique primary key for the data frames you are working with, and may help you eliminate duplications in your data:

planes |> 
  count(tailnum) |> 
  filter(n > 1)

Joins

datasets must share at least one key for joining
may be one or more columns, but must uniquely identify each row
if it doesn’t, and there are multiple matches, all combination of the matches will occur
here are a variety of different joins: inner, left, right and full

All follow the same format:

join(x, y, by)

Joins

Inner Join

inner_join(x, y, by = “key”)

Left Join

left_join(x, y, by = “key”)

Joins - Practice

Using library(nycflights13)

Use a left_join to join the flights and planes datasets together
Combine the airline information from the “airlines” dataset with the flights dataset using left_join, filter for flights from 2013 and select only the columns to do with arrivals

Using band_members and band_instruments which are loaded with dplyr

Join the band_members and band_instruments datasets keep all columns and only those people in both

Right Join

right_join(x, y, by = “key”)

Full Join

full_join(x, y, by = “key”)

Joins - More Practice

Using library(nycflights13)

Add data from the weather dataset to flights with a left_join

Using band_members and band_instruments which are loaded with dplyr

Join the band_members and band_instruments datasets; keep only the people in the right-hand dataset (band_instruments)
Join the band_members and band_instruments datasets; keep everything

Final Tips on Joins

take the time to learn the different types of joins and choose carefully each time
anti-joins can be useful to find rows that don’t match between datasets, e.g. to exclude rows with IDs that aren’t in an expected ID list
there is now a relationships argument in dplyr::join() - use this to give the expected relationships between data frames (it will stop you making many mistakes)

Strings

But what is a string?

Strings are a sequence of characters which has to be represented in memory in binary
This was first widely done using ASCII (American Standard Code for Information Interchange)
But, ASCII only allows for 128 characters, so not nearly enough for all languages
Now we represent strings using Unicode (which didn’t appear in it’s current form till the 90s!), which allows for a lot (>1m) characters
UTF-8 is the most common encoding for Unicode - uses a variable number of bytes (8-bit) units to represent characters
R uses UTF-8 encoding by default (though will use the label “unknown” unless non-ASCII characters appear), and stringr fully supports Unicode

Stringr

removes inconsistencies found in base R
built on top of the stringi package
starts all functions with “str_”
more advanced use requires regular expressions (regex)

Regex

As quoted by R for Data Science 1st edition:

“When you first look at a regexp, you’ll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.”

If you’ve read about regex before, you will also have come across the quotation:

“Some people, when confronted with a problem, think”I know, I’ll use regular expressions.” Now they have two problems.”

Whistle-stop regex review

Allow you to match patterns in strings
Most basic matches an actual chunk of text, e.g. ‘hag’ in ‘hagrid’
We can match multiple types of characters with \s, \d, \w, [abc] and [^abc]
Because we’re in R, we have to use \\ instead of \, e.g. \\w or \\d
We can match any character with .
We can expand these with ?, +, * or {n,m}
We can anchor them to the start ^ or the end $
e.g. ^\\w+_\\d{4}$ would match “hagrid_2020”, but not “hagrid_120”

Whistle-stop regex review

Allow you to match patterns in strings
Most basic matches an actual chunk of text

my_string <- c('Hagrid', 'Hermione', 'Harry.Potter', 'Ronald_Weasley', '24xHouse Elves')
stringr::str_detect(my_string, 'Hag')

[1]  TRUE FALSE FALSE FALSE FALSE

Whistle-stop regex review

We can match multiple types of characters with \s, \d, \w, [abc] and [^abc]

my_string

[1] "Hagrid"         "Hermione"       "Harry.Potter"   "Ronald_Weasley"
[5] "24xHouse Elves"

stringr::str_detect(my_string, '\\w')

[1] TRUE TRUE TRUE TRUE TRUE

stringr::str_detect(my_string, '\\d')

[1] FALSE FALSE FALSE FALSE  TRUE

stringr::str_detect(my_string, '\\s')

[1] FALSE FALSE FALSE FALSE  TRUE

Whistle-stop regex review

We can match any character with .

my_string

[1] "Hagrid"         "Hermione"       "Harry.Potter"   "Ronald_Weasley"
[5] "24xHouse Elves"

stringr::str_detect(my_string, '.')

[1] TRUE TRUE TRUE TRUE TRUE

Whistle-stop regex review

We can expand these with ?, +, * or {n,m}

my_string

[1] "Hagrid"         "Hermione"       "Harry.Potter"   "Ronald_Weasley"
[5] "24xHouse Elves"

stringr::str_detect(my_string, '\\w+')

[1] TRUE TRUE TRUE TRUE TRUE

stringr::str_detect(my_string, '\\d{2}[xyz]\\w+\\s?\\w+')

[1] FALSE FALSE FALSE FALSE  TRUE

stringr::str_detect(my_string, '\\d{2}[^xyz]\\w+\\s?\\w+')

[1] FALSE FALSE FALSE FALSE FALSE

Whistle-stop regex review

We can anchor them to the start ^ or the end $

my_string

[1] "Hagrid"         "Hermione"       "Harry.Potter"   "Ronald_Weasley"
[5] "24xHouse Elves"

stringr::str_detect(my_string, '^H\\w+$')

[1]  TRUE  TRUE FALSE FALSE FALSE

stringr::str_detect(my_string, '^Hag$')

[1] FALSE FALSE FALSE FALSE FALSE

Whistle-stop regex review

Allow you to match patterns in strings
Most basic matches an actual chunk of text, e.g. ‘hag’ in ‘hagrid’
We can match multiple types of characters with \s, \d, \w, [abc] and [^abc]
Because we’re in R, we have to use \\ instead of \, e.g. \\w or \\d
We can match any character with .
We can expand these with ?, +, * or {n,m}
We can anchor them to the start ^ or the end $
e.g. ^\\w+_\\d{4}$ would match “hagrid_2020”, but not “hagrid_120”

Basic stringr operations

check string lengths and counts

# counts the number of characters in each string
stringr::str_length(my_string)

[1]  6  8 12 14 14

# counts the number of matches in a string
stringr::str_count(my_string, 'Ha')

[1] 1 0 1 0 0

Basic stringr operations

concatenate (combine) strings

# explicitly naming each column
stringr::str_c('this', 'that', sep=', ')

[1] "this, that"

Manipulations with stringr

extract or replace strings

# extract substrings
stringr::str_sub(my_string, 1, 3)

[1] "Hag" "Her" "Har" "Ron" "24x"

# replace matches
stringr::str_replace(my_string, '[\\._x]', ' ')

[1] "Hagrid"         "Hermione"       "Harry Potter"   "Ronald Weasley"
[5] "24 House Elves"

Stringr - Practice

concatenate the strings “day” to” and “day”, separated by a hyphen

Using starwars

select the hair color column and replace ‘,’ with a ‘/’ (hint: you can pipe a column into pull() to convert it to a vector for stringr to handle)

Using flights

select columns ending with ‘delay’ and remove the underscore from all column names

Manipulations with stringr

find or view strings

# return a boolean for matches (alternative to grepl)
stringr::str_detect(my_string, 'Hagrid')

[1]  TRUE FALSE FALSE FALSE FALSE

# highlight matches
stringr::str_view(my_string, '^\\w')

[1] │ <H>agrid
[2] │ <H>ermione
[3] │ <H>arry.Potter
[4] │ <R>onald_Weasley
[5] │ <2>4xHouse Elves

Manipulations with stringr

sort and separate strings

# sorting strings
sentences |>
  head(1) |> 
  stringr::str_split(" ")

[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
[8] "planks."

# sorting strings
stringr::str_sort(words[1:10], locale = 'en')

 [1] "a"        "able"     "about"    "absolute" "accept"   "account" 
 [7] "achieve"  "across"   "act"      "active"

Stringr - Practice

split the string “Harry, did you put your name in the Goblet of Fire?” into its components
use the boundary("word") function instead of ” ” and compare results

Using the fifth line of the sentences dataset

split by word boundary, convert to lowercase and sort (hint: use unlist())

Cleaning up with stringr

stringr::str_to_upper(words[1:10])

 [1] "A"        "ABLE"     "ABOUT"    "ABSOLUTE" "ACCEPT"   "ACCOUNT" 
 [7] "ACHIEVE"  "ACROSS"   "ACT"      "ACTIVE"

stringr::str_to_lower(words[1:10])

 [1] "a"        "able"     "about"    "absolute" "accept"   "account" 
 [7] "achieve"  "across"   "act"      "active"

stringr::str_to_sentence(words[1:10])

 [1] "A"        "Able"     "About"    "Absolute" "Accept"   "Account" 
 [7] "Achieve"  "Across"   "Act"      "Active"

stringr::str_to_title(words[1:10])

 [1] "A"        "Able"     "About"    "Absolute" "Accept"   "Account" 
 [7] "Achieve"  "Across"   "Act"      "Active"

Cleaning up with stringr

sentences |>
  head(1) |>
  stringr::str_to_sentence()

[1] "The birch canoe slid on the smooth planks."

sentences |>
  head(1) |>
  stringr::str_to_title()

[1] "The Birch Canoe Slid On The Smooth Planks."

Stringr - More Practice

Using starwars

convert the hair_color column values to be Sentence Case
change all columns names to be title case

Using gapminder

remove any Camel Case from column names (i.e. all to lower case)
convert the continent column values to be all upper case

Homework

Quarto file

Suggested Reading

R for Data Science 2e, chapters 6, 15, 16, 20
The docs for dplyr and vignette at browseVignettes(package = "dplyr")
Also check out the docs for tidyr and stringr
This blog post by Joel Spolsky on Unicode and character sets

MET581 Lecture 04

Overview

Review

First, Load everything

Homework Tips

Basic organisation

Scoped verbs vs. pick and across

pick() in masked environments

across for applying functions

across for applying functions

across for applying functions

How to plan a query

The Plan

Tidyr

Wide and Long Data

What’s the problem?

tidyr::pivot_longer()

tidyr::pivot_longer()

tidyr::pivot_wider()

tidyr::practice()

tidyr::practice_more()

Joins

Relational data

Keys

Keys

Joins

Joins

Inner Join

Left Join

Joins - Practice

Joins - Practice

Right Join

Full Join

Joins - More Practice

Final Tips on Joins

Strings

But what is a string?

Stringr

Regex

Whistle-stop regex review

Whistle-stop regex review

Whistle-stop regex review

Whistle-stop regex review

Whistle-stop regex review

Whistle-stop regex review

Whistle-stop regex review

Basic stringr operations

Basic stringr operations

Manipulations with stringr

Stringr - Practice

Manipulations with stringr

Manipulations with stringr

Stringr - Practice

Cleaning up with stringr

Cleaning up with stringr

Stringr - More Practice

Homework

`pick()` in masked environments

`across` for applying functions

`across` for applying functions

`across` for applying functions