Wrangling Data 2: tidying, strings and joins
2024-10-18
pick()
and across()
to do thisselect_if()
have been superseded (but you will see them everywhere!)pick()
in masked environmentsmutate()
, summarise()
, and group_by()
we can refer to columns directly by their names instead of needing quotation marks - why is this?pick()
is like select()
, but can refer to columns directly in a masked environment, e.g.across
for applying functionspurrr
in later lecturesacross
can be used in this way to apply functions to multiple columnsacross
for applying functionsacross
syntax, this looks like:across
for applying functionsacross
, the code fo doing something manual, like:gapminder |>
summarise(
mean_lifeExp = mean(lifeExp),
mean_pop = mean(pop),
mean_gdpPercap = mean(gdpPercap)
)
mean_lifeExp | mean_pop | mean_gdpPercap |
---|---|---|
59.4744393662 | 29601212.3245 | 7215.32708121 |
For smaller queries:
For bigger queries:
We usually have one of two problems
Solves the ‘variables as columns’ problem
select columns as per dplyr::select()
Solves the ‘observations over rows’ problem
Using fish_encounters
Using starwars - load with data("starwars")
Using flights - load with library(nycflights13)
tidyr::some_function()
with the correct call to convert the height and mass columns into ‘characteristic’ and ‘measurement’ columns for plotting belowtidyr::some_function()
with the correct call to convert the temp, dewp and humid columns into ‘condition’ and ‘measurement’ columns for plotting belowA key is a variable (or set of variables) that uniquely identifies an observation. It is the backbone of each dataset or set of datasets.
You generally have two types of key:
It is generally good idea to test whether or not you do have a unique primary key for the data frames you are working with, and may help you eliminate duplications in your data:
All follow the same format:
inner_join(x, y, by = “key”)
left_join(x, y, by = “key”)
Using library(nycflights13)
Using band_members and band_instruments which are loaded with dplyr
right_join(x, y, by = “key”)
full_join(x, y, by = “key”)
Using library(nycflights13)
Using band_members and band_instruments which are loaded with dplyr
dplyr::join()
- use this to give the expected relationships between data frames (it will stop you making many mistakes)As quoted by R for Data Science 1st edition:
“When you first look at a regexp, you’ll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.”
If you’ve read about regex before, you will also have come across the quotation:
“Some people, when confronted with a problem, think”I know, I’ll use regular expressions.” Now they have two problems.”
\s
, \d
, \w
, [abc]
and [^abc]
\\
instead of \
, e.g. \\w
or \\d
.
?
, +
, *
or {n,m}
^
or the end $
^\\w+_\\d{4}$
would match “hagrid_2020”, but not “hagrid_120”\s
, \d
, \w
, [abc]
and [^abc]
[1] "Hagrid" "Hermione" "Harry.Potter" "Ronald_Weasley"
[5] "24xHouse Elves"
[1] TRUE TRUE TRUE TRUE TRUE
.
?
, +
, *
or {n,m}
[1] "Hagrid" "Hermione" "Harry.Potter" "Ronald_Weasley"
[5] "24xHouse Elves"
[1] TRUE TRUE TRUE TRUE TRUE
^
or the end $
[1] "Hagrid" "Hermione" "Harry.Potter" "Ronald_Weasley"
[5] "24xHouse Elves"
[1] TRUE TRUE FALSE FALSE FALSE
\s
, \d
, \w
, [abc]
and [^abc]
\\
instead of \
, e.g. \\w
or \\d
.
?
, +
, *
or {n,m}
^
or the end $
^\\w+_\\d{4}$
would match “hagrid_2020”, but not “hagrid_120”check string lengths and counts
concatenate (combine) strings
extract or replace strings
Using starwars
pull()
to convert it to a vector for stringr to handle)Using flights
find or view strings
[1] TRUE FALSE FALSE FALSE FALSE
sort and separate strings
[[1]]
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
[8] "planks."
boundary("word")
function instead of ” ” and compare resultsUsing the fifth line of the sentences dataset
unlist()
) [1] "A" "ABLE" "ABOUT" "ABSOLUTE" "ACCEPT" "ACCOUNT"
[7] "ACHIEVE" "ACROSS" "ACT" "ACTIVE"
[1] "a" "able" "about" "absolute" "accept" "account"
[7] "achieve" "across" "act" "active"
[1] "A" "Able" "About" "Absolute" "Accept" "Account"
[7] "Achieve" "Across" "Act" "Active"
[1] "A" "Able" "About" "Absolute" "Accept" "Account"
[7] "Achieve" "Across" "Act" "Active"
Using starwars
Using gapminder
Suggested Reading
browseVignettes(package = "dplyr")