wrangling data 1: dplyr
2024-10-17
|>
or %>%
for data manipulationWe should also keep to one type of observational unit per table
The 6 main verbs in dplyr:
select()
→ select variables by namefilter()
→ return rows with matching conditionsarrange()
→ arrange rows by variablesmutate()
→ add new variablesgroup_by()
→ return grouping of variablessummarise()
→ reduces multiple values down to a single valueALL verbs follow the same format:
Also install/load:
Select variables by name
Select variables by name
country | continent | year |
---|---|---|
Afghanistan | Asia | 1952 |
Afghanistan | Asia | 1957 |
Afghanistan | Asia | 1962 |
Afghanistan | Asia | 1967 |
Select variables by name
continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|
Asia | 1952 | 28.801 | 8425333 | 779.4453145 |
Asia | 1957 | 30.332 | 9240934 | 820.8530296 |
Asia | 1962 | 31.997 | 10267083 | 853.1007100 |
Asia | 1967 | 34.020 | 11537966 | 836.1971382 |
There are lots of helper functions that work with select, e.g.:
Using gapminder
Using starwars - load with data("starwars")
Using flights - load with library(nycflights13)
Using gapminder
Using starwars
select_if()
)Using flights
return rows with matching conditions
return rows with matching conditions
##| output: false
# combining multiple filters in a single statement
country_names <- c(
'Afghanistan',
'Angola',
'Belgium',
'China',
'Mauritania',
'Mauritius',
'Mongolia',
'Korea, Rep.',
'Sri Lanka',
'Saudi Arabia',
'Vietnam',
'Yemen, Rep.'
)
filter(
gapminder,
year == 1997L,
continent != "asia",
pop >= 1000000L,
country %in% country_names
) |>
head()
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.341351 |
Angola | Africa | 1997 | 40.963 | 9875024 | 2277.140884 |
Belgium | Europe | 1997 | 77.530 | 10199787 | 27561.196630 |
China | Asia | 1997 | 70.426 | 1230075000 | 2289.234136 |
Korea, Rep. | Asia | 1997 | 74.647 | 46173816 | 15993.527960 |
Mauritania | Africa | 1997 | 60.430 | 2444741 | 1483.136136 |
combining filters
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Albania | Europe | 1997 | 72.950 | 3428038 | 3193.054604 |
Algeria | Africa | 1997 | 69.152 | 29072015 | 4797.295051 |
Angola | Africa | 1997 | 40.963 | 9875024 | 2277.140884 |
Argentina | Americas | 1997 | 73.275 | 36203463 | 10967.281950 |
combining filters
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971382 |
Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3413510 |
Albania | Europe | 1967 | 66.220 | 1984060 | 2760.1969310 |
Albania | Europe | 1997 | 72.950 | 3428038 | 3193.0546040 |
Using gapminder
Using starwars
filter()
and grepl()
Using gapminder
Using starwars
arrange rows by variables
arrange rows by variables
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Japan | Asia | 2007 | 82.603 | 127467972 | 31656.06806 |
Hong Kong, China | Asia | 2007 | 82.208 | 6980412 | 39724.97867 |
Japan | Asia | 2002 | 82.000 | 127065841 | 28604.59190 |
Iceland | Europe | 2007 | 81.757 | 301931 | 36180.78919 |
note: order is important when using multiple variables!
Using gapminder
add new variables
Using starwars
Using flights
ifelse()
)Using starwars
str_to_title()
from stringr
)Using flights
na_if()
Let’s combine lots of operations together to get the data we want!
It’s as simple as
…right???!
WRONG!! (ish)
Instead of this:
Do this:
Pipes:
|>
(no need to load tidyverse)|>
here, you could use %>%
instead (but load dplyr first)%>%
in a lot of code in the wild, so it’s good to know both%>%
yourself if you’re forced to use an older version of RUsing starwars
tip: use is.character
and ifelse()
to create the new column following the format:
name | hair_color | skin_color | eye_color | sex | gender | homeworld | species |
---|---|---|---|---|---|---|---|
Luke Skywalker | blond | fair | blue | male | masculine | Tatooine | Human |
Darth Vader | none | white | yellow | male | masculine | Tatooine | Human |
Leia Organa | brown | light | brown | female | feminine | Alderaan | Human |
Owen Lars | brown, grey | light | blue | male | masculine | Tatooine | Human |
Beru Whitesun Lars | brown | light | blue | female | feminine | Tatooine | Human |
Biggs Darklighter | black | light | brown | male | masculine | Tatooine | Human |
Obi-Wan Kenobi | auburn, white | fair | blue-gray | male | masculine | Stewjon | Human |
Anakin Skywalker | blond | fair | blue | male | masculine | Tatooine | Human |
Chewbacca | brown | unknown | blue | male | masculine | Kashyyyk | Wookiee |
Han Solo | brown | fair | brown | male | masculine | Corellia | Human |
Wedge Antilles | brown | fair | hazel | male | masculine | Corellia | Human |
Palpatine | grey | pale | yellow | male | masculine | Naboo | Human |
Boba Fett | black | fair | brown | male | masculine | Kamino | Human |
Bossk | none | green | red | male | masculine | Trandosha | Trandoshan |
Lando Calrissian | black | dark | brown | male | masculine | Socorro | Human |
Lobot | none | light | blue | male | masculine | Bespin | Human |
Ackbar | none | brown mottle | orange | male | masculine | Mon Cala | Mon Calamari |
Wicket Systri Warrick | brown | brown | brown | male | masculine | Endor | Ewok |
Padmé Amidala | brown | light | brown | female | feminine | Naboo | Human |
Jar Jar Binks | none | orange | orange | male | masculine | Naboo | Gungan |
Darth Maul | none | red | yellow | male | masculine | Dathomir | Zabrak |
Ayla Secura | none | blue | hazel | female | feminine | Ryloth | Twi’lek |
Mace Windu | none | dark | brown | male | masculine | Haruun Kal | Human |
Ki-Adi-Mundi | white | pale | yellow | male | masculine | Cerea | Cerean |
Plo Koon | none | orange | black | male | masculine | Dorin | Kel Dor |
Luminara Unduli | black | yellow | blue | female | feminine | Mirial | Mirialan |
Barriss Offee | black | yellow | blue | female | feminine | Mirial | Mirialan |
Dooku | white | fair | brown | male | masculine | Serenno | Human |
Jango Fett | black | tan | brown | male | masculine | Concord Dawn | Human |
Using flights
Useful only in conjunction with other methods
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453145 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530296 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007100 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971382 |
Useful only in conjunction with other methods
reduces multiple values down to a single value
mean_lifexp |
---|
59.4744393662 |
Combining group_by() and summarise() gives us so much more power!
country | n_years | Mean_Life_Exp | SD_Life_Exp | Max_GDP |
---|---|---|---|---|
Australia | 6 | 78.1758333333 | 2.45882576989 | 34435.36744 |
New Zealand | 6 | 76.8923333333 | 2.55215843291 | 25185.00911 |
We can use lots of built-in functions when using summarise:
Note: Don’t forget to include na.rm = TRUE
if NAs are present
Using gapminder
Using airquality
Using airquality
summarise_if()
and is.integer()
Using starwars
print(n = x)
, where x
is a suitably high number of rowsdplyr::glimpse()
instead of str()
everything()
to re-order columnsn_distinct()
instead of length(unique())
dplyr::rename(new_name = old_name)
for renaming columnsdplyr::select_()
and other variantsdplyr::near()
and dplyr::between()
coalesce()
, recode()
and case_when()
from dplyr
tibble::rownames_to_column()
dplyr::select()
grepl()
is better done using stringr’s str_detect()
(introduced tomorrow)stringr
(tomorrow) combined with dplyr
(more tomorrow) and the native pipe (or maggritr
pipe if needed) will cover a lot of your everyday needsWhat if:
Suggested Reading
browseVignettes(package = "dplyr")
met581 - programming in r - slides and code avilable here