When using R to read JSON files or datasets containing columns of JSON objects or arrays, we often get very long strings or deeply nested lists. However, ideally, we would want to work with data structures that are more easily to manipulate in R, such as vectors or data frames. How to do transformations and solve the problems then?
In this post, we discuss four types of scenarios, and questions arising from each scenario, with three sample datasets:
We will be using tools from several packages to handle each situation.
Reading JSON objects, arrays or files:
jsonlite::fromJSON()
reads JSON objects and arrays.jsonlite::stream_in()
implements line-by-line processing of JSON data and works over a connection (e.g. url, file).Flattening list of data frames, list of vectors, or nested data frames:
tidyr::unnest()
flattens a list-column of data frames to regular columns.tidyr::unnest_longer()
and tidyr::unnest_wider()
rectangle and collapse deeply nested lists into regular columns. unnest_longer()
turns each element of a list-column into a row; it preserves the columns, but changes the rows. unnest_wider()
turns each element of a list-column into a column; it preserves the rows, but changes the columns.jsonlite::flatten()
flattens nested data frames.Working with lists:
purrr::map()
applies a function to each element of a vector. It can also extract elements from a deeply nested list. purrr::map()
returns a list.plyr::ldply
splits a list, applies a function, and returns results in a data frame.The first sample dataset we will be using is ted_main.csv
, which is part of the Ted Talks data that can be downloaded from Kaggle.
## # A tibble: 6 x 17
## comments description duration event film_date languages main_speaker name
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 4553 "Sir Ken R… 1164 TED2… 1.14e9 60 Ken Robinson Ken …
## 2 265 "With the … 977 TED2… 1.14e9 43 Al Gore Al G…
## 3 124 "New York … 1286 TED2… 1.14e9 26 David Pogue Davi…
## 4 200 "In an emo… 1116 TED2… 1.14e9 35 Majora Cart… Majo…
## 5 593 "You've ne… 1190 TED2… 1.14e9 48 Hans Rosling Hans…
## 6 672 "Tony Robb… 1305 TED2… 1.14e9 36 Tony Robbins Tony…
## # … with 9 more variables: num_speaker <dbl>, published_date <dbl>,
## # ratings <chr>, related_talks <chr>, speaker_occupation <chr>, tags <chr>,
## # title <chr>, url <chr>, views <dbl>
In this dataset, ratings
takes the form of JSON objects, which is a stringified dictionary of the various ratings given to the talk.
## [1] "[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"
Using jsonlite::fromJSON
to read ratings
, we get a list of data frames.
library(jsonlite)
library(purrr)
### fromJSON takes JSON strings
ted$ratings2 <- gsub("'", '"', ted$ratings)
ted$ratings2 <- map(ted$ratings2, fromJSON)
head(ted$ratings2, 2)
## [[1]]
## id name count
## 1 7 Funny 19645
## 2 1 Beautiful 4573
## 3 9 Ingenious 6073
## 4 3 Courageous 3253
## 5 11 Longwinded 387
## 6 2 Confusing 242
## 7 8 Informative 7346
## 8 22 Fascinating 10581
## 9 21 Unconvincing 300
## 10 24 Persuasive 10704
## 11 23 Jaw-dropping 4439
## 12 25 OK 1174
## 13 26 Obnoxious 209
## 14 10 Inspiring 24924
##
## [[2]]
## id name count
## 1 7 Funny 544
## 2 3 Courageous 139
## 3 2 Confusing 62
## 4 1 Beautiful 58
## 5 21 Unconvincing 258
## 6 11 Longwinded 113
## 7 8 Informative 443
## 8 10 Inspiring 413
## 9 22 Fascinating 132
## 10 9 Ingenious 56
## 11 24 Persuasive 268
## 12 23 Jaw-dropping 116
## 13 26 Obnoxious 131
## 14 25 OK 203
The question then becomes how to flatten the list-column of data frames to regular columns.
One solution is to use tidyr::unnest()
. For each title
, we take out the frame of ratings from the list ratings2
; these data frames consist of three columns id
, name
and count
. The flattened data frame is in the long format.
Below we show the output of the first two talks.
library(tidyr)
library(dplyr)
rating <- ted %>% select(title, ratings2) %>% unnest(ratings2)
head(rating, 20)
## # A tibble: 20 x 4
## title id name count
## <chr> <int> <chr> <int>
## 1 Do schools kill creativity? 7 Funny 19645
## 2 Do schools kill creativity? 1 Beautiful 4573
## 3 Do schools kill creativity? 9 Ingenious 6073
## 4 Do schools kill creativity? 3 Courageous 3253
## 5 Do schools kill creativity? 11 Longwinded 387
## 6 Do schools kill creativity? 2 Confusing 242
## 7 Do schools kill creativity? 8 Informative 7346
## 8 Do schools kill creativity? 22 Fascinating 10581
## 9 Do schools kill creativity? 21 Unconvincing 300
## 10 Do schools kill creativity? 24 Persuasive 10704
## 11 Do schools kill creativity? 23 Jaw-dropping 4439
## 12 Do schools kill creativity? 25 OK 1174
## 13 Do schools kill creativity? 26 Obnoxious 209
## 14 Do schools kill creativity? 10 Inspiring 24924
## 15 Averting the climate crisis 7 Funny 544
## 16 Averting the climate crisis 3 Courageous 139
## 17 Averting the climate crisis 2 Confusing 62
## 18 Averting the climate crisis 1 Beautiful 58
## 19 Averting the climate crisis 21 Unconvincing 258
## 20 Averting the climate crisis 11 Longwinded 113
The next demo dataset yelp_academic_dataset_business.json
is from the Yelp Dataset, which can also be found on Kaggle.
## # A tibble: 6 x 10
## business_id name full_address hours open categories city state review_count
## <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <dbl>
## 1 5UmKMjUEUN… Mr H… "4734 Leban… {'Fr… TRUE ['Fast Fo… Drav… PA 4
## 2 UsFtqoBl7n… Clan… "202 McClur… {} TRUE ['Nightli… Drav… PA 4
## 3 3eu6MEFlq2… Joe … "1 Ravine S… {} TRUE ['Auto Re… Drav… PA 3
## 4 cE27W9VPgO… Cool… "1530 Hamil… {} FALSE ['Active … Beth… PA 5
## 5 HZdLhv6COC… Veri… "301 South … {'Th… TRUE ['Shoppin… Pitt… PA 5
## 6 mVHrayjG3u… Emil… "414 Hawkin… {'Tu… TRUE ['Bars', … rank… PA 20
## # … with 1 more variable: stars <dbl>
hours
takes the form of nested JSON objects, slightly more complex than the previous case.
head(business$hours)
## [1] "{'Friday': {'close': '21:00', 'open': '11:00'}, 'Wednesday': {'close': '21:00', 'open': '11:00'}, 'Thursday': {'close': '21:00', 'open': '11:00'}, 'Monday': {'close': '21:00', 'open': '11:00'}, 'Tuesday': {'close': '21:00', 'open': '11:00'}}"
## [2] "{}"
## [3] "{}"
## [4] "{}"
## [5] "{'Thursday': {'close': '17:00', 'open': '10:00'}, 'Saturday': {'close': '21:00', 'open': '10:00'}, 'Monday': {'close': '21:00', 'open': '10:00'}, 'Sunday': {'close': '18:00', 'open': '11:00'}, 'Friday': {'close': '17:00', 'open': '10:00'}, 'Tuesday': {'close': '21:00', 'open': '10:00'}, 'Wednesday': {'close': '21:00', 'open': '10:00'}}"
## [6] "{'Tuesday': {'close': '19:00', 'open': '10:00'}, 'Wednesday': {'close': '19:00', 'open': '10:00'}, 'Thursday': {'close': '19:00', 'open': '10:00'}, 'Saturday': {'close': '16:00', 'open': '10:00'}, 'Friday': {'close': '20:00', 'open': '10:00'}}"
Converting hours
using jsonlite
, we get nested lists. For each business, there is a nested list of its opening and closing hours by day of a week.
business$hours2 <- gsub("'", '"', business$hours)
business$hours2 <- map(business$hours2, fromJSON)
head(business$hours2, 1)
## [[1]]
## [[1]]$Friday
## [[1]]$Friday$close
## [1] "21:00"
##
## [[1]]$Friday$open
## [1] "11:00"
##
##
## [[1]]$Wednesday
## [[1]]$Wednesday$close
## [1] "21:00"
##
## [[1]]$Wednesday$open
## [1] "11:00"
##
##
## [[1]]$Thursday
## [[1]]$Thursday$close
## [1] "21:00"
##
## [[1]]$Thursday$open
## [1] "11:00"
##
##
## [[1]]$Monday
## [[1]]$Monday$close
## [1] "21:00"
##
## [[1]]$Monday$open
## [1] "11:00"
##
##
## [[1]]$Tuesday
## [[1]]$Tuesday$close
## [1] "21:00"
##
## [[1]]$Tuesday$open
## [1] "11:00"
It would make sense if we can extract the elements from each nested list and place everything properly into a data frame like the one below.
## # A tibble: 34 x 4
## name close open day
## <chr> <chr> <chr> <chr>
## 1 Mr Hoagie 21:00 11:00 Friday
## 2 Mr Hoagie 21:00 11:00 Wednesday
## 3 Mr Hoagie 21:00 11:00 Thursday
## 4 Mr Hoagie 21:00 11:00 Monday
## 5 Mr Hoagie 21:00 11:00 Tuesday
## 6 Clancy's Pub <NA> <NA> <NA>
## 7 Joe Cislo's Auto <NA> <NA> <NA>
## 8 Cool Springs Golf Center <NA> <NA> <NA>
## 9 Verizon 17:00 10:00 Thursday
## 10 Verizon 21:00 10:00 Saturday
## # … with 24 more rows
We use unnest_longer()
and unnest_wider()
to collapse deeply nested lists into regular columns. unnest_longer()
turns each element of a list-column into a row; unnest_wider()
turns each element of a list-column into a column.
Below we start with the first level of the list, days of a week. We use unnest_longer()
to put the days into a new column. The data frame grows longer with the days grouped under each name
.
hours <- business %>% select(name, hours2)
hours %>% unnest_longer(hours2)
## # A tibble: 34 x 3
## name hours2 hours2_id
## <chr> <named list> <chr>
## 1 Mr Hoagie <named list [2]> Friday
## 2 Mr Hoagie <named list [2]> Wednesday
## 3 Mr Hoagie <named list [2]> Thursday
## 4 Mr Hoagie <named list [2]> Monday
## 5 Mr Hoagie <named list [2]> Tuesday
## 6 Clancy's Pub <NULL> <NA>
## 7 Joe Cislo's Auto <NULL> <NA>
## 8 Cool Springs Golf Center <NULL> <NA>
## 9 Verizon <named list [2]> Thursday
## 10 Verizon <named list [2]> Saturday
## # … with 24 more rows
We then work on the elements on the second level of the list, opening and closing hours of the businesses. We use unnest_wider()
to put the opening and closing hours into the same row with the day they belong to. The data frame grows wider.
hours %>% unnest_longer(hours2) %>% unnest_wider(hours2)
## # A tibble: 34 x 4
## name close open hours2_id
## <chr> <chr> <chr> <chr>
## 1 Mr Hoagie 21:00 11:00 Friday
## 2 Mr Hoagie 21:00 11:00 Wednesday
## 3 Mr Hoagie 21:00 11:00 Thursday
## 4 Mr Hoagie 21:00 11:00 Monday
## 5 Mr Hoagie 21:00 11:00 Tuesday
## 6 Clancy's Pub <NA> <NA> <NA>
## 7 Joe Cislo's Auto <NA> <NA> <NA>
## 8 Cool Springs Golf Center <NA> <NA> <NA>
## 9 Verizon 17:00 10:00 Thursday
## 10 Verizon 21:00 10:00 Saturday
## # … with 24 more rows
Finally, after some postprocessing we get a data frame that is easier to manipulate and displays the information buried in the lists.
hours %>%
unnest_longer(hours2) %>%
unnest_wider(hours2) %>%
select(name, close, open, day = hours2_id)
## # A tibble: 34 x 4
## name close open day
## <chr> <chr> <chr> <chr>
## 1 Mr Hoagie 21:00 11:00 Friday
## 2 Mr Hoagie 21:00 11:00 Wednesday
## 3 Mr Hoagie 21:00 11:00 Thursday
## 4 Mr Hoagie 21:00 11:00 Monday
## 5 Mr Hoagie 21:00 11:00 Tuesday
## 6 Clancy's Pub <NA> <NA> <NA>
## 7 Joe Cislo's Auto <NA> <NA> <NA>
## 8 Cool Springs Golf Center <NA> <NA> <NA>
## 9 Verizon 17:00 10:00 Thursday
## 10 Verizon 21:00 10:00 Saturday
## # … with 24 more rows
In the sample dataset yelp_academic_dataset_business.json
, we have a variable categories
that takes the form of JSON arrays.
head(business$categories)
## [1] "['Fast Food', 'Restaurants']"
## [2] "['Nightlife']"
## [3] "['Auto Repair', 'Automotive']"
## [4] "['Active Life', 'Mini Golf', 'Golf']"
## [5] "['Shopping', 'Home Services', 'Internet Service Providers', 'Mobile Phones', 'Professional Services', 'Electronics']"
## [6] "['Bars', 'American (New)', 'Nightlife', 'Lounges', 'Restaurants']"
Again, we use jsonlite::fromJSON
and purrr::map
to read the JSON arrays. We get a list-column of vectors categories2
.
business$categories2 <- gsub("'", '"', business$categories)
business$categories2 <- map(business$categories2, fromJSON)
head(business$categories2)
## [[1]]
## [1] "Fast Food" "Restaurants"
##
## [[2]]
## [1] "Nightlife"
##
## [[3]]
## [1] "Auto Repair" "Automotive"
##
## [[4]]
## [1] "Active Life" "Mini Golf" "Golf"
##
## [[5]]
## [1] "Shopping" "Home Services"
## [3] "Internet Service Providers" "Mobile Phones"
## [5] "Professional Services" "Electronics"
##
## [[6]]
## [1] "Bars" "American (New)" "Nightlife" "Lounges"
## [5] "Restaurants"
Using unnest_longer()
, we convert the list-column of vectors to a regular column and turn the data frame into the longer format.
business %>% select(name, categories2) %>% unnest_longer(categories2)
## # A tibble: 31 x 2
## name categories2
## <chr> <chr>
## 1 Mr Hoagie Fast Food
## 2 Mr Hoagie Restaurants
## 3 Clancy's Pub Nightlife
## 4 Joe Cislo's Auto Auto Repair
## 5 Joe Cislo's Auto Automotive
## 6 Cool Springs Golf Center Active Life
## 7 Cool Springs Golf Center Mini Golf
## 8 Cool Springs Golf Center Golf
## 9 Verizon Shopping
## 10 Verizon Home Services
## # … with 21 more rows
Below we use the third sample dataset, log
, to show how to handle JSON files, which we also discussed in this post with more examples.
The log
file is part of a large dataset of server logs recording user behaviors of an online course on a learning platform.
The JSON data is in the ndjson format that jsonlite::fromJSON
does not directly support. The stream_in()
function in the package implements line-by-line processing of JSON data in the ndjson format over a connection (e.g. url, file).
Therefore, we read the file line-by-line first. The output is a large character vector.
library(readr)
con <- read_lines("sample/tracking.log")
head(con, 1)
## [1] "{\"username\": \"\", \"event_type\": \"/notifier_api/v1/users/\", \"ip\": \"127.0.0.1\", \"agent\": \"python-requests/2.8.1\", \"host\": \"localhost\", \"referer\": \"\", \"accept_language\": \"\", \"event\": \"{\\\"POST\\\": {}, \\\"GET\\\": {\\\"page\\\": [\\\"1\\\"], \\\"page_size\\\": [\\\"10\\\"]}}\", \"event_source\": \"server\", \"context\": {\"user_id\": \"\", \"org_id\": \"\", \"course_id\": \"\", \"path\": \"/notifier_api/v1/users/\"}, \"time\": \"2016-01-05T17:50:03.793624+00:00\", \"page\": null}"
We then use jsonlite::stream_in()
to implement line-by-line processing. Below, textConnection()
inputs and outputs text connections. The output is a data frame of vectors, a data frame, and a list.
df <- stream_in(textConnection(con))
##
Found 500 records...
Found 1000 records...
Found 1500 records...
Found 2000 records...
Found 2180 records...
Imported 2180 records. Simplifying...
head(df)
## username event_type ip agent
## 1 /notifier_api/v1/users/ 127.0.0.1 python-requests/2.8.1
## 2 / 198.20.70.114
## 3 / 198.20.70.114
## 4 /robots.txt 198.20.70.114
## 5 /sitemap.xml 198.20.70.114
## 6 /notifier_api/v1/users/ 127.0.0.1 python-requests/2.8.1
## host referer accept_language
## 1 localhost
## 2 52.192.239.60
## 3 52.192.239.60
## 4 52.192.239.60
## 5 52.192.239.60
## 6 localhost
## event event_source
## 1 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}} server
## 2 {"POST": {}, "GET": {}} server
## 3 {"POST": {}, "GET": {}} server
## 4 {"POST": {}, "GET": {}} server
## 5 {"POST": {}, "GET": {}} server
## 6 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}} server
## context.user_id context.org_id context.course_id context.path
## 1 /notifier_api/v1/users/
## 2 /
## 3 /
## 4 /robots.txt
## 5 /sitemap.xml
## 6 /notifier_api/v1/users/
## time page name session
## 1 2016-01-05T17:50:03.793624+00:00 NA <NA> <NA>
## 2 2016-01-05T17:51:20.142531+00:00 NA <NA> <NA>
## 3 2016-01-05T17:51:23.850085+00:00 NA <NA> <NA>
## 4 2016-01-05T17:51:24.584924+00:00 NA <NA> <NA>
## 5 2016-01-05T17:51:25.757229+00:00 NA <NA> <NA>
## 6 2016-01-05T17:55:03.792682+00:00 NA <NA> <NA>
We use jsonlite::flatten()
to flatten the nested data frame context
.
flatten(df$context)
If we use rjson
to read the file line-by-line, the initial outputs can be more complex and it takes more steps to get what we want.
library(rjson)
library(plyr)
lines <- read_lines("sample/tracking.log")
df2 <- ldply(lapply(lines, function(x) t(unlist(fromJSON(json_str = x)))))
In the code above, rjson::fromJSON()
reads the first JSON object from the file and outputs a list.
a <- fromJSON(json_str = lines)
head(a)
## $username
## [1] ""
##
## $event_type
## [1] "/notifier_api/v1/users/"
##
## $ip
## [1] "127.0.0.1"
##
## $agent
## [1] "python-requests/2.8.1"
##
## $host
## [1] "localhost"
##
## $referer
## [1] ""
unlist()
unlists the list and transforms it to a vector. t()
transposes the vector to a matrix.
b <- t(unlist(fromJSON(json_str = lines)))
head(b)
## username event_type ip agent
## [1,] "" "/notifier_api/v1/users/" "127.0.0.1" "python-requests/2.8.1"
## host referer accept_language
## [1,] "localhost" "" ""
## event
## [1,] "{\"POST\": {}, \"GET\": {\"page\": [\"1\"], \"page_size\": [\"10\"]}}"
## event_source context.user_id context.org_id context.course_id
## [1,] "server" "" "" ""
## context.path time
## [1,] "/notifier_api/v1/users/" "2016-01-05T17:50:03.793624+00:00"
lapply()
applies a function over a list or vector. In this case, it repeats what we have done above to every JSON object in lines
. The output is a list.
l <- lapply(lines, function(x) t(unlist(fromJSON(json_str = x))))
head(l, 1)
As the last step, in order to get a data frame, we use the function plyr::ldply()
that splits the list, applies a function, and returns results in a data frame.
df2 <- ldply(lapply(lines, function(x) t(unlist(fromJSON(json_str = x)))))
head(df2, 1)
## username event_type ip agent host
## 1 /notifier_api/v1/users/ 127.0.0.1 python-requests/2.8.1 localhost
## referer accept_language
## 1
## event event_source
## 1 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}} server
## context.user_id context.org_id context.course_id context.path
## 1 /notifier_api/v1/users/
## time name session event.user_id event.old
## 1 2016-01-05T17:50:03.793624+00:00 <NA> <NA> <NA> <NA>
## event.setting event.table event.new
## 1 <NA> <NA> <NA>