Wrangling Data 1: dplyr
2024-10-17
|> or %>% for data manipulationWe should also keep to one type of observational unit per table
The 6 main verbs in dplyr:
select() → select variables by namefilter() → return rows with matching conditionsarrange() → arrange rows by variablesmutate() → add new variablesgroup_by() → return grouping of variablessummarise() → reduces multiple values down to a single valueALL verbs follow the same format:
Also install/load:
Select variables by name
Select variables by name
| country | continent | year |
|---|---|---|
| Afghanistan | Asia | 1952 |
| Afghanistan | Asia | 1957 |
| Afghanistan | Asia | 1962 |
| Afghanistan | Asia | 1967 |
Select variables by name
| continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|
| Asia | 1952 | 28.801 | 8425333 | 779.4453145 |
| Asia | 1957 | 30.332 | 9240934 | 820.8530296 |
| Asia | 1962 | 31.997 | 10267083 | 853.1007100 |
| Asia | 1967 | 34.020 | 11537966 | 836.1971382 |
There are lots of helper functions that work with select, e.g.:
Using gapminder
Using starwars - load with data("starwars")
Using flights - load with library(nycflights13)
Using gapminder
Using starwars
select_if())Using flights
return rows with matching conditions
return rows with matching conditions
##| output: false
# combining multiple filters in a single statement
country_names <- c(
'Afghanistan',
'Angola',
'Belgium',
'China',
'Mauritania',
'Mauritius',
'Mongolia',
'Korea, Rep.',
'Sri Lanka',
'Saudi Arabia',
'Vietnam',
'Yemen, Rep.'
)
filter(
gapminder,
year == 1997L,
continent != "asia",
pop >= 1000000L,
country %in% country_names
) |>
head()| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.341351 |
| Angola | Africa | 1997 | 40.963 | 9875024 | 2277.140884 |
| Belgium | Europe | 1997 | 77.530 | 10199787 | 27561.196630 |
| China | Asia | 1997 | 70.426 | 1230075000 | 2289.234136 |
| Korea, Rep. | Asia | 1997 | 74.647 | 46173816 | 15993.527960 |
| Mauritania | Africa | 1997 | 60.430 | 2444741 | 1483.136136 |
combining filters
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Albania | Europe | 1997 | 72.950 | 3428038 | 3193.054604 |
| Algeria | Africa | 1997 | 69.152 | 29072015 | 4797.295051 |
| Angola | Africa | 1997 | 40.963 | 9875024 | 2277.140884 |
| Argentina | Americas | 1997 | 73.275 | 36203463 | 10967.281950 |
combining filters
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971382 |
| Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3413510 |
| Albania | Europe | 1967 | 66.220 | 1984060 | 2760.1969310 |
| Albania | Europe | 1997 | 72.950 | 3428038 | 3193.0546040 |
Using gapminder
Using starwars
filter() and grepl()Using gapminder
Using starwars
arrange rows by variables
arrange rows by variables
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Japan | Asia | 2007 | 82.603 | 127467972 | 31656.06806 |
| Hong Kong, China | Asia | 2007 | 82.208 | 6980412 | 39724.97867 |
| Japan | Asia | 2002 | 82.000 | 127065841 | 28604.59190 |
| Iceland | Europe | 2007 | 81.757 | 301931 | 36180.78919 |
note: order is important when using multiple variables!
Using gapminder
add new variables
Using starwars
Using flights
ifelse())Using starwars
str_to_title() from stringr)Using flights
na_if()Let’s combine lots of operations together to get the data we want!
It’s as simple as
…right???!
WRONG!! (ish)
Instead of this:
Do this:
Pipes:
|> (no need to load tidyverse)|> here, you could use %>% instead (but load dplyr first)%>% in a lot of code in the wild, so it’s good to know both%>% yourself if you’re forced to use an older version of RUsing starwars
tip: use is.character and ifelse() to create the new column following the format:
| name | hair_color | skin_color | eye_color | sex | gender | homeworld | species |
|---|---|---|---|---|---|---|---|
| Luke Skywalker | blond | fair | blue | male | masculine | Tatooine | Human |
| Darth Vader | none | white | yellow | male | masculine | Tatooine | Human |
| Leia Organa | brown | light | brown | female | feminine | Alderaan | Human |
| Owen Lars | brown, grey | light | blue | male | masculine | Tatooine | Human |
| Beru Whitesun Lars | brown | light | blue | female | feminine | Tatooine | Human |
| Biggs Darklighter | black | light | brown | male | masculine | Tatooine | Human |
| Obi-Wan Kenobi | auburn, white | fair | blue-gray | male | masculine | Stewjon | Human |
| Anakin Skywalker | blond | fair | blue | male | masculine | Tatooine | Human |
| Chewbacca | brown | unknown | blue | male | masculine | Kashyyyk | Wookiee |
| Han Solo | brown | fair | brown | male | masculine | Corellia | Human |
| Wedge Antilles | brown | fair | hazel | male | masculine | Corellia | Human |
| Palpatine | grey | pale | yellow | male | masculine | Naboo | Human |
| Boba Fett | black | fair | brown | male | masculine | Kamino | Human |
| Bossk | none | green | red | male | masculine | Trandosha | Trandoshan |
| Lando Calrissian | black | dark | brown | male | masculine | Socorro | Human |
| Lobot | none | light | blue | male | masculine | Bespin | Human |
| Ackbar | none | brown mottle | orange | male | masculine | Mon Cala | Mon Calamari |
| Wicket Systri Warrick | brown | brown | brown | male | masculine | Endor | Ewok |
| Padmé Amidala | brown | light | brown | female | feminine | Naboo | Human |
| Jar Jar Binks | none | orange | orange | male | masculine | Naboo | Gungan |
| Darth Maul | none | red | yellow | male | masculine | Dathomir | Zabrak |
| Ayla Secura | none | blue | hazel | female | feminine | Ryloth | Twi’lek |
| Mace Windu | none | dark | brown | male | masculine | Haruun Kal | Human |
| Ki-Adi-Mundi | white | pale | yellow | male | masculine | Cerea | Cerean |
| Plo Koon | none | orange | black | male | masculine | Dorin | Kel Dor |
| Luminara Unduli | black | yellow | blue | female | feminine | Mirial | Mirialan |
| Barriss Offee | black | yellow | blue | female | feminine | Mirial | Mirialan |
| Dooku | white | fair | brown | male | masculine | Serenno | Human |
| Jango Fett | black | tan | brown | male | masculine | Concord Dawn | Human |
Using flights
Useful only in conjunction with other methods
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453145 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530296 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007100 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971382 |
Useful only in conjunction with other methods
reduces multiple values down to a single value
| mean_lifexp |
|---|
| 59.4744393662 |
Combining group_by() and summarise() gives us so much more power!
| country | n_years | Mean_Life_Exp | SD_Life_Exp | Max_GDP |
|---|---|---|---|---|
| Australia | 6 | 78.1758333333 | 2.45882576989 | 34435.36744 |
| New Zealand | 6 | 76.8923333333 | 2.55215843291 | 25185.00911 |
We can use lots of built-in functions when using summarise:
Note: Don’t forget to include na.rm = TRUE if NAs are present
Using gapminder
Using airquality
Using airquality
summarise_if() and is.integer()Using starwars
print(n = x), where x is a suitably high number of rowsdplyr::glimpse() instead of str()everything() to re-order columnsn_distinct() instead of length(unique())dplyr::rename(new_name = old_name) for renaming columnsdplyr::select_() and other variantsdplyr::near() and dplyr::between()coalesce(), recode() and case_when() from dplyrtibble::rownames_to_column()dplyr::select()grepl() is better done using stringr’s str_detect() (introduced tomorrow)stringr (tomorrow) combined with dplyr (more tomorrow) and the native pipe (or maggritr pipe if needed) will cover a lot of your everyday needsWhat if:
Suggested Reading
browseVignettes(package = "dplyr")met581 - programming in r - slides and code avilable here