Rectangling is the art and craft of taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns. There are three functions from tidyr that are particularly useful for rectangling:
unnest_longer()
takes each element of a list-column and makes a new row.unnest_wider()
takes each element of a list-column and makes a new column.unnest_auto()
guesses whether you want unnest_longer()
or unnest_wider()
.hoist()
is similar to unnest_wider()
but only plucks out selected components, and can reach down multiple levels.A very large number of data rectangling problems can be solved by combining these functions with a splash of dplyr (largely eliminating prior approaches that combined mutate()
with multiple purrr::map()
s).
To illustrate these techniques, we’ll use the repurrrsive package, which provides a number deeply nested lists originally mostly captured from web APIs.
We’ll start with gh_users
, a list which contains information about six GitHub users. To begin, we put the gh_users
list into a data frame:
users <- tibble(user = gh_users)
This seems a bit counter-intuitive: why is the first step in making a list simpler to make it more complicated? But a data frame has a big advantage: it bundles together multiple vectors so that everything is tracked together in a single object.
Each user
is a named list, where each element represents a column.
names(users$user[[1]]) #> [1] "login" "id" "avatar_url" #> [4] "gravatar_id" "url" "html_url" #> [7] "followers_url" "following_url" "gists_url" #> [10] "starred_url" "subscriptions_url" "organizations_url" #> [13] "repos_url" "events_url" "received_events_url" #> [16] "type" "site_admin" "name" #> [19] "company" "blog" "location" #> [22] "email" "hireable" "bio" #> [25] "public_repos" "public_gists" "followers" #> [28] "following" "created_at" "updated_at"
There are two ways to turn the list components into columns. unnest_wider()
takes every component and makes a new column:
users %>% unnest_wider(user) #> # A tibble: 6 x 30 #> login id avatar_url gravatar_id url html_url followers_url following_url #> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 gabo… 6.60e5 https://a… "" http… https:/… https://api.… https://api.… #> 2 jenn… 5.99e5 https://a… "" http… https:/… https://api.… https://api.… #> 3 jtle… 1.57e6 https://a… "" http… https:/… https://api.… https://api.… #> 4 juli… 1.25e7 https://a… "" http… https:/… https://api.… https://api.… #> 5 leep… 3.51e6 https://a… "" http… https:/… https://api.… https://api.… #> 6 masa… 8.36e6 https://a… "" http… https:/… https://api.… https://api.… #> # … with 22 more variables: gists_url <chr>, starred_url <chr>, #> # subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>, #> # events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>, #> # name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>, #> # public_repos <int>, public_gists <int>, followers <int>, following <int>, #> # created_at <chr>, updated_at <chr>, bio <chr>, hireable <lgl>
But in this case, there are many components and we don’t need most of them so we can instead use hoist()
. hoist()
allows us to pull out selected components using the same syntax as purrr::pluck()
:
users %>% hoist(user, followers = "followers", login = "login", url = "html_url" ) #> # A tibble: 6 x 4 #> followers login url user #> <int> <chr> <chr> <list> #> 1 303 gaborcsardi https://github.com/gaborcsardi <named list [27]> #> 2 780 jennybc https://github.com/jennybc <named list [27]> #> 3 3958 jtleek https://github.com/jtleek <named list [27]> #> 4 115 juliasilge https://github.com/juliasilge <named list [27]> #> 5 213 leeper https://github.com/leeper <named list [27]> #> 6 34 masalmon https://github.com/masalmon <named list [27]>
hoist()
removes the named components from the user
list-column, so you can think of it as moving components out of the inner list into the top-level data frame.
We start off gh_repos
similarly, by putting it in a tibble:
repos <- tibble(repo = gh_repos) repos #> # A tibble: 6 x 1 #> repo #> <list> #> 1 <list [30]> #> 2 <list [30]> #> 3 <list [30]> #> 4 <list [26]> #> 5 <list [30]> #> 6 <list [30]>
This time the elements of user
are a list of repositories that belong to that user. These are observations, so should become new rows, so we use unnest_longer()
rather than unnest_wider()
:
repos <- repos %>% unnest_longer(repo) repos #> # A tibble: 176 x 1 #> repo #> <list> #> 1 <named list [68]> #> 2 <named list [68]> #> 3 <named list [68]> #> 4 <named list [68]> #> 5 <named list [68]> #> 6 <named list [68]> #> 7 <named list [68]> #> 8 <named list [68]> #> 9 <named list [68]> #> 10 <named list [68]> #> # … with 166 more rows
Then we can use unnest_wider()
or hoist()
:
repos %>% hoist(repo, login = c("owner", "login"), name = "name", homepage = "homepage", watchers = "watchers_count" ) #> # A tibble: 176 x 5 #> login name homepage watchers repo #> <chr> <chr> <chr> <int> <list> #> 1 gaborcsardi after <NA> 5 <named list [65]> #> 2 gaborcsardi argufy <NA> 19 <named list [65]> #> 3 gaborcsardi ask <NA> 5 <named list [65]> #> 4 gaborcsardi baseimports <NA> 0 <named list [65]> #> 5 gaborcsardi citest <NA> 0 <named list [65]> #> 6 gaborcsardi clisymbols "" 18 <named list [65]> #> 7 gaborcsardi cmaker <NA> 0 <named list [65]> #> 8 gaborcsardi cmark <NA> 0 <named list [65]> #> 9 gaborcsardi conditions <NA> 0 <named list [65]> #> 10 gaborcsardi crayon <NA> 52 <named list [65]> #> # … with 166 more rows
Note the use of c("owner", "login")
: this allows us to reach two levels deep inside of a list. An alternative approach would be to pull out just owner
and then put each element of it in a column:
repos %>% hoist(repo, owner = "owner") %>% unnest_wider(owner) #> # A tibble: 176 x 18 #> login id avatar_url gravatar_id url html_url followers_url #> <chr> <int> <chr> <chr> <chr> <chr> <chr> #> 1 gabo… 660288 https://a… "" http… https:/… https://api.… #> 2 gabo… 660288 https://a… "" http… https:/… https://api.… #> 3 gabo… 660288 https://a… "" http… https:/… https://api.… #> 4 gabo… 660288 https://a… "" http… https:/… https://api.… #> 5 gabo… 660288 https://a… "" http… https:/… https://api.… #> 6 gabo… 660288 https://a… "" http… https:/… https://api.… #> 7 gabo… 660288 https://a… "" http… https:/… https://api.… #> 8 gabo… 660288 https://a… "" http… https:/… https://api.… #> 9 gabo… 660288 https://a… "" http… https:/… https://api.… #> 10 gabo… 660288 https://a… "" http… https:/… https://api.… #> # … with 166 more rows, and 11 more variables: following_url <chr>, #> # gists_url <chr>, starred_url <chr>, subscriptions_url <chr>, #> # organizations_url <chr>, repos_url <chr>, events_url <chr>, #> # received_events_url <chr>, type <chr>, site_admin <lgl>, repo <list>
Instead of looking at the list and carefully thinking about whether it needs to become rows or columns, you can use unnest_auto()
. It uses a handful of heuristics to figure out whether unnest_longer()
or unnest_wider()
is appropriate, and tells you about its reasoning.
tibble(repo = gh_repos) %>% unnest_auto(repo) %>% unnest_auto(repo) #> Using `unnest_longer(repo)`; no element has names #> Using `unnest_wider(repo)`; elements have 68 names in common #> # A tibble: 176 x 67 #> id name full_name owner private html_url description fork url #> <int> <chr> <chr> <lis> <lgl> <chr> <chr> <lgl> <chr> #> 1 6.12e7 after gaborcsa… <nam… FALSE https:/… Run Code i… FALSE http… #> 2 4.05e7 argu… gaborcsa… <nam… FALSE https:/… Declarativ… FALSE http… #> 3 3.64e7 ask gaborcsa… <nam… FALSE https:/… Friendly C… FALSE http… #> 4 3.49e7 base… gaborcsa… <nam… FALSE https:/… Do we get … FALSE http… #> 5 6.16e7 cite… gaborcsa… <nam… FALSE https:/… Test R pac… TRUE http… #> 6 3.39e7 clis… gaborcsa… <nam… FALSE https:/… Unicode sy… FALSE http… #> 7 3.72e7 cmak… gaborcsa… <nam… FALSE https:/… port of cm… TRUE http… #> 8 6.80e7 cmark gaborcsa… <nam… FALSE https:/… CommonMark… TRUE http… #> 9 6.32e7 cond… gaborcsa… <nam… FALSE https:/… <NA> TRUE http… #> 10 2.43e7 cray… gaborcsa… <nam… FALSE https:/… R package … FALSE http… #> # … with 166 more rows, and 58 more variables: forks_url <chr>, keys_url <chr>, #> # collaborators_url <chr>, teams_url <chr>, hooks_url <chr>, #> # issue_events_url <chr>, events_url <chr>, assignees_url <chr>, #> # branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>, #> # git_refs_url <chr>, trees_url <chr>, statuses_url <chr>, #> # languages_url <chr>, stargazers_url <chr>, contributors_url <chr>, #> # subscribers_url <chr>, subscription_url <chr>, commits_url <chr>, #> # git_commits_url <chr>, comments_url <chr>, issue_comment_url <chr>, #> # contents_url <chr>, compare_url <chr>, merges_url <chr>, archive_url <chr>, #> # downloads_url <chr>, issues_url <chr>, pulls_url <chr>, #> # milestones_url <chr>, notifications_url <chr>, labels_url <chr>, #> # releases_url <chr>, deployments_url <chr>, created_at <chr>, #> # updated_at <chr>, pushed_at <chr>, git_url <chr>, ssh_url <chr>, #> # clone_url <chr>, svn_url <chr>, size <int>, stargazers_count <int>, #> # watchers_count <int>, language <chr>, has_issues <lgl>, #> # has_downloads <lgl>, has_wiki <lgl>, has_pages <lgl>, forks_count <int>, #> # open_issues_count <int>, forks <int>, open_issues <int>, watchers <int>, #> # default_branch <chr>, homepage <chr>
got_chars
has a similar structure to gh_users
: it’s a list of named lists, where each element of the inner list describes some attribute of a GoT character. We start in the same way, first by creating a data frame and then by unnesting each component into a column:
chars <- tibble(char = got_chars) chars #> # A tibble: 30 x 1 #> char #> <list> #> 1 <named list [18]> #> 2 <named list [18]> #> 3 <named list [18]> #> 4 <named list [18]> #> 5 <named list [18]> #> 6 <named list [18]> #> 7 <named list [18]> #> 8 <named list [18]> #> 9 <named list [18]> #> 10 <named list [18]> #> # … with 20 more rows chars2 <- chars %>% unnest_wider(char) chars2 #> # A tibble: 30 x 18 #> url id name gender culture born died alive titles aliases father #> <chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr> #> 1 http… 1022 Theo… Male "Ironb… "In … "" TRUE <chr … <chr [… "" #> 2 http… 1052 Tyri… Male "" "In … "" TRUE <chr … <chr [… "" #> 3 http… 1074 Vict… Male "Ironb… "In … "" TRUE <chr … <chr [… "" #> 4 http… 1109 Will Male "" "" "In … FALSE <chr … <chr [… "" #> 5 http… 1166 Areo… Male "Norvo… "In … "" TRUE <chr … <chr [… "" #> 6 http… 1267 Chett Male "" "At … "In … FALSE <chr … <chr [… "" #> 7 http… 1295 Cres… Male "" "In … "In … FALSE <chr … <chr [… "" #> 8 http… 130 Aria… Female "Dorni… "In … "" TRUE <chr … <chr [… "" #> 9 http… 1303 Daen… Female "Valyr… "In … "" TRUE <chr … <chr [… "" #> 10 http… 1319 Davo… Male "Weste… "In … "" TRUE <chr … <chr [… "" #> # … with 20 more rows, and 7 more variables: mother <chr>, spouse <chr>, #> # allegiances <list>, books <list>, povBooks <list>, tvSeries <list>, #> # playedBy <list>
This is more complex than gh_users
because some component of char
are themselves a list, giving us a collection of list-columns:
chars2 %>% select_if(is.list) #> # A tibble: 30 x 7 #> titles aliases allegiances books povBooks tvSeries playedBy #> <list> <list> <list> <list> <list> <list> <list> #> 1 <chr [3]> <chr [4]> <chr [1]> <chr [3]> <chr [2]> <chr [6]> <chr [1]> #> 2 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr [4]> <chr [6]> <chr [1]> #> 3 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [1]> <chr [1]> #> 4 <chr [1]> <chr [1]> <NULL> <chr [1]> <chr [1]> <chr [1]> <chr [1]> #> 5 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [2]> <chr [1]> #> 6 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr [1]> <chr [1]> <chr [1]> #> 7 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr [1]> <chr [1]> <chr [1]> #> 8 <chr [1]> <chr [1]> <chr [1]> <chr [4]> <chr [1]> <chr [1]> <chr [1]> #> 9 <chr [5]> <chr [11]> <chr [1]> <chr [1]> <chr [4]> <chr [6]> <chr [1]> #> 10 <chr [4]> <chr [5]> <chr [2]> <chr [1]> <chr [3]> <chr [5]> <chr [1]> #> # … with 20 more rows
What you do next will depend on the purposes of the analysis. Maybe you want a row for every book and TV series that the character appears in:
chars2 %>% select(name, books, tvSeries) %>% pivot_longer(c(books, tvSeries), names_to = "media", values_to = "value") %>% unnest_longer(value) #> # A tibble: 180 x 3 #> name media value #> <chr> <chr> <chr> #> 1 Theon Greyjoy books A Game of Thrones #> 2 Theon Greyjoy books A Storm of Swords #> 3 Theon Greyjoy books A Feast for Crows #> 4 Theon Greyjoy tvSeries Season 1 #> 5 Theon Greyjoy tvSeries Season 2 #> 6 Theon Greyjoy tvSeries Season 3 #> 7 Theon Greyjoy tvSeries Season 4 #> 8 Theon Greyjoy tvSeries Season 5 #> 9 Theon Greyjoy tvSeries Season 6 #> 10 Tyrion Lannister books A Feast for Crows #> # … with 170 more rows
Or maybe you want to build a table that lets you match title to name:
chars2 %>% select(name, title = titles) %>% unnest_longer(title) #> # A tibble: 60 x 2 #> name title #> <chr> <chr> #> 1 Theon Greyjoy "Prince of Winterfell" #> 2 Theon Greyjoy "Captain of Sea Bitch" #> 3 Theon Greyjoy "Lord of the Iron Islands (by law of the green lands)" #> 4 Tyrion Lannister "Acting Hand of the King (former)" #> 5 Tyrion Lannister "Master of Coin (former)" #> 6 Victarion Greyjoy "Lord Captain of the Iron Fleet" #> 7 Victarion Greyjoy "Master of the Iron Victory" #> 8 Will "" #> 9 Areo Hotah "Captain of the Guard at Sunspear" #> 10 Chett "" #> # … with 50 more rows
(Note that the empty titles (""
) are due to an infelicity in the input got_chars
: ideally people without titles would have a title vector of length 0, not a title vector of length 1 containing an empty string.)
Again, we could rewrite using unnest_auto()
. This is convenient for exploration, but I wouldn’t rely on it in the long term - unnest_auto()
has the undesirable property that it will always succeed. That means if your data structure changes, unnest_auto()
will continue to work, but might give very different output that causes cryptic failures from downstream functions.
tibble(char = got_chars) %>% unnest_auto(char) %>% select(name, title = titles) %>% unnest_auto(title) #> Using `unnest_wider(char)`; elements have 18 names in common #> Using `unnest_longer(title)`; no element has names #> # A tibble: 60 x 2 #> name title #> <chr> <chr> #> 1 Theon Greyjoy "Prince of Winterfell" #> 2 Theon Greyjoy "Captain of Sea Bitch" #> 3 Theon Greyjoy "Lord of the Iron Islands (by law of the green lands)" #> 4 Tyrion Lannister "Acting Hand of the King (former)" #> 5 Tyrion Lannister "Master of Coin (former)" #> 6 Victarion Greyjoy "Lord Captain of the Iron Fleet" #> 7 Victarion Greyjoy "Master of the Iron Victory" #> 8 Will "" #> 9 Areo Hotah "Captain of the Guard at Sunspear" #> 10 Chett "" #> # … with 50 more rows
Next we’ll tackle a more complex form of data that comes from Google’s geocoding service. It’s against the terms of service to cache this data, so I first write a very simple wrapper around the API. This relies on having an Google maps API key stored in an environment; if that’s not available these code chunks won’t be run.
has_key <- !identical(Sys.getenv("GOOGLE_MAPS_API_KEY"), "") if (!has_key) { message("No Google Maps API key found; code chunks will not be run") } #> No Google Maps API key found; code chunks will not be run # https://developers.google.com/maps/documentation/geocoding geocode <- function(address, api_key = Sys.getenv("GOOGLE_MAPS_API_KEY")) { url <- "https://maps.googleapis.com/maps/api/geocode/json" url <- paste0(url, "?address=", URLencode(address), "&key=", api_key) jsonlite::read_json(url) }
The list that this function returns is quite complex:
houston <- geocode("Houston TX") str(houston)
Fortunately, we can attack the problem step by step with tidyr functions. To make the problem a bit harder (!) and more realistic, I’ll start by geocoding a few cities:
city <- c("Houston", "LA", "New York", "Chicago", "Springfield") city_geo <- purrr::map(city, geocode)
I’ll put these results in a tibble, next to the original city name:
loc <- tibble(city = city, json = city_geo) loc
The first level contains components status
and result
, which we can reveal with unnest_wider()
:
loc %>% unnest_wider(json)
Notice that results
is a list of lists. Most of the cities have 1 element (representing a unique match from the geocoding API), but Springfield has two. We can pull these out into separate rows with unnest_longer()
:
loc %>% unnest_wider(json) %>% unnest_longer(results)
Now these all have the same components, as revealed by unnest_wider()
:
loc %>% unnest_wider(json) %>% unnest_longer(results) %>% unnest_wider(results)
We can find the lat and lon coordinates by unnesting geometry
:
loc %>% unnest_wider(json) %>% unnest_longer(results) %>% unnest_wider(results) %>% unnest_wider(geometry)
And then location:
loc %>% unnest_wider(json) %>% unnest_longer(results) %>% unnest_wider(results) %>% unnest_wider(geometry) %>% unnest_wider(location)
Again, unnest_auto()
makes this simpler with the small risk of failing in unexpected ways if the input structure changes:
loc %>% unnest_auto(json) %>% unnest_auto(results) %>% unnest_auto(results) %>% unnest_auto(geometry) %>% unnest_auto(location)
We could also just look at the first address for each city:
loc %>% unnest_wider(json) %>% hoist(results, first_result = 1) %>% unnest_wider(first_result) %>% unnest_wider(geometry) %>% unnest_wider(location)
Or use hoist()
to dive deeply to get directly to lat
and lng
:
I’d normally use readr::parse_datetime()
or lubridate::ymd_hms()
, but I can’t here because it’s a vignette and I don’t want to add a dependency to tidyr just to simplify one example.↩︎