Working with JSON Data

When using R to read JSON files or datasets containing columns of JSON objects or arrays, we often get very long strings or deeply nested lists. However, ideally, we would want to work with data structures that are more easily to manipulate in R, such as vectors or data frames. How to do transformations and solve the problems then?

In this post, we discuss four types of scenarios, and questions arising from each scenario, with three sample datasets:

How to handle JSON objects and flatten list-column of data frames
How to handle nested JSON objects and rectangle deeply nested lists
How to handle JSON arrays and flatten list-column of vectors
How to handle JSON files

We will be using tools from several packages to handle each situation.

Reading JSON objects, arrays or files:

jsonlite::fromJSON() reads JSON objects and arrays.
jsonlite::stream_in() implements line-by-line processing of JSON data and works over a connection (e.g. url, file).

Flattening list of data frames, list of vectors, or nested data frames:

tidyr::unnest() flattens a list-column of data frames to regular columns.
tidyr::unnest_longer() and tidyr::unnest_wider() rectangle and collapse deeply nested lists into regular columns. unnest_longer() turns each element of a list-column into a row; it preserves the columns, but changes the rows. unnest_wider() turns each element of a list-column into a column; it preserves the rows, but changes the columns.
jsonlite::flatten() flattens nested data frames.

Working with lists:

purrr::map() applies a function to each element of a vector. It can also extract elements from a deeply nested list. purrr::map() returns a list.
plyr::ldply splits a list, applies a function, and returns results in a data frame.

Handling JSON objects

The first sample dataset we will be using is ted_main.csv, which is part of the Ted Talks data that can be downloaded from Kaggle.

## # A tibble: 6 x 17
##   comments description duration event film_date languages main_speaker name 
##      <dbl> <chr>          <dbl> <chr>     <dbl>     <dbl> <chr>        <chr>
## 1     4553 "Sir Ken R…     1164 TED2…    1.14e9        60 Ken Robinson Ken …
## 2      265 "With the …      977 TED2…    1.14e9        43 Al Gore      Al G…
## 3      124 "New York …     1286 TED2…    1.14e9        26 David Pogue  Davi…
## 4      200 "In an emo…     1116 TED2…    1.14e9        35 Majora Cart… Majo…
## 5      593 "You've ne…     1190 TED2…    1.14e9        48 Hans Rosling Hans…
## 6      672 "Tony Robb…     1305 TED2…    1.14e9        36 Tony Robbins Tony…
## # … with 9 more variables: num_speaker <dbl>, published_date <dbl>,
## #   ratings <chr>, related_talks <chr>, speaker_occupation <chr>, tags <chr>,
## #   title <chr>, url <chr>, views <dbl>

In this dataset, ratings takes the form of JSON objects, which is a stringified dictionary of the various ratings given to the talk.

## [1] "[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"

reading JSON objects

Using jsonlite::fromJSON to read ratings, we get a list of data frames.

library(jsonlite)
library(purrr)

### fromJSON takes JSON strings
ted$ratings2 <- gsub("'", '"', ted$ratings)
ted$ratings2 <- map(ted$ratings2, fromJSON)
head(ted$ratings2, 2)

## [[1]]
##    id         name count
## 1   7        Funny 19645
## 2   1    Beautiful  4573
## 3   9    Ingenious  6073
## 4   3   Courageous  3253
## 5  11   Longwinded   387
## 6   2    Confusing   242
## 7   8  Informative  7346
## 8  22  Fascinating 10581
## 9  21 Unconvincing   300
## 10 24   Persuasive 10704
## 11 23 Jaw-dropping  4439
## 12 25           OK  1174
## 13 26    Obnoxious   209
## 14 10    Inspiring 24924
## 
## [[2]]
##    id         name count
## 1   7        Funny   544
## 2   3   Courageous   139
## 3   2    Confusing    62
## 4   1    Beautiful    58
## 5  21 Unconvincing   258
## 6  11   Longwinded   113
## 7   8  Informative   443
## 8  10    Inspiring   413
## 9  22  Fascinating   132
## 10  9    Ingenious    56
## 11 24   Persuasive   268
## 12 23 Jaw-dropping   116
## 13 26    Obnoxious   131
## 14 25           OK   203

flattening list-column of data frames

The question then becomes how to flatten the list-column of data frames to regular columns.

One solution is to use tidyr::unnest(). For each title, we take out the frame of ratings from the list ratings2; these data frames consist of three columns id, name and count. The flattened data frame is in the long format.

Below we show the output of the first two talks.

library(tidyr)
library(dplyr)

rating <- ted %>% select(title, ratings2) %>% unnest(ratings2) 
head(rating, 20)

## # A tibble: 20 x 4
##    title                          id name         count
##    <chr>                       <int> <chr>        <int>
##  1 Do schools kill creativity?     7 Funny        19645
##  2 Do schools kill creativity?     1 Beautiful     4573
##  3 Do schools kill creativity?     9 Ingenious     6073
##  4 Do schools kill creativity?     3 Courageous    3253
##  5 Do schools kill creativity?    11 Longwinded     387
##  6 Do schools kill creativity?     2 Confusing      242
##  7 Do schools kill creativity?     8 Informative   7346
##  8 Do schools kill creativity?    22 Fascinating  10581
##  9 Do schools kill creativity?    21 Unconvincing   300
## 10 Do schools kill creativity?    24 Persuasive   10704
## 11 Do schools kill creativity?    23 Jaw-dropping  4439
## 12 Do schools kill creativity?    25 OK            1174
## 13 Do schools kill creativity?    26 Obnoxious      209
## 14 Do schools kill creativity?    10 Inspiring    24924
## 15 Averting the climate crisis     7 Funny          544
## 16 Averting the climate crisis     3 Courageous     139
## 17 Averting the climate crisis     2 Confusing       62
## 18 Averting the climate crisis     1 Beautiful       58
## 19 Averting the climate crisis    21 Unconvincing   258
## 20 Averting the climate crisis    11 Longwinded     113

Handling nested JSON objects

The next demo dataset yelp_academic_dataset_business.json is from the Yelp Dataset, which can also be found on Kaggle.

## # A tibble: 6 x 10
##   business_id name  full_address hours open  categories city  state review_count
##   <chr>       <chr> <chr>        <chr> <lgl> <chr>      <chr> <chr>        <dbl>
## 1 5UmKMjUEUN… Mr H… "4734 Leban… {'Fr… TRUE  ['Fast Fo… Drav… PA               4
## 2 UsFtqoBl7n… Clan… "202 McClur… {}    TRUE  ['Nightli… Drav… PA               4
## 3 3eu6MEFlq2… Joe … "1 Ravine S… {}    TRUE  ['Auto Re… Drav… PA               3
## 4 cE27W9VPgO… Cool… "1530 Hamil… {}    FALSE ['Active … Beth… PA               5
## 5 HZdLhv6COC… Veri… "301 South … {'Th… TRUE  ['Shoppin… Pitt… PA               5
## 6 mVHrayjG3u… Emil… "414 Hawkin… {'Tu… TRUE  ['Bars', … rank… PA              20
## # … with 1 more variable: stars <dbl>

hours takes the form of nested JSON objects, slightly more complex than the previous case.

head(business$hours)

## [1] "{'Friday': {'close': '21:00', 'open': '11:00'}, 'Wednesday': {'close': '21:00', 'open': '11:00'}, 'Thursday': {'close': '21:00', 'open': '11:00'}, 'Monday': {'close': '21:00', 'open': '11:00'}, 'Tuesday': {'close': '21:00', 'open': '11:00'}}"                                                                                                
## [2] "{}"                                                                                                                                                                                                                                                                                                                                               
## [3] "{}"                                                                                                                                                                                                                                                                                                                                               
## [4] "{}"                                                                                                                                                                                                                                                                                                                                               
## [5] "{'Thursday': {'close': '17:00', 'open': '10:00'}, 'Saturday': {'close': '21:00', 'open': '10:00'}, 'Monday': {'close': '21:00', 'open': '10:00'}, 'Sunday': {'close': '18:00', 'open': '11:00'}, 'Friday': {'close': '17:00', 'open': '10:00'}, 'Tuesday': {'close': '21:00', 'open': '10:00'}, 'Wednesday': {'close': '21:00', 'open': '10:00'}}"
## [6] "{'Tuesday': {'close': '19:00', 'open': '10:00'}, 'Wednesday': {'close': '19:00', 'open': '10:00'}, 'Thursday': {'close': '19:00', 'open': '10:00'}, 'Saturday': {'close': '16:00', 'open': '10:00'}, 'Friday': {'close': '20:00', 'open': '10:00'}}"

Converting hours using jsonlite, we get nested lists. For each business, there is a nested list of its opening and closing hours by day of a week.

business$hours2 <- gsub("'", '"', business$hours)
business$hours2 <- map(business$hours2, fromJSON)
head(business$hours2, 1)

## [[1]]
## [[1]]$Friday
## [[1]]$Friday$close
## [1] "21:00"
## 
## [[1]]$Friday$open
## [1] "11:00"
## 
## 
## [[1]]$Wednesday
## [[1]]$Wednesday$close
## [1] "21:00"
## 
## [[1]]$Wednesday$open
## [1] "11:00"
## 
## 
## [[1]]$Thursday
## [[1]]$Thursday$close
## [1] "21:00"
## 
## [[1]]$Thursday$open
## [1] "11:00"
## 
## 
## [[1]]$Monday
## [[1]]$Monday$close
## [1] "21:00"
## 
## [[1]]$Monday$open
## [1] "11:00"
## 
## 
## [[1]]$Tuesday
## [[1]]$Tuesday$close
## [1] "21:00"
## 
## [[1]]$Tuesday$open
## [1] "11:00"

It would make sense if we can extract the elements from each nested list and place everything properly into a data frame like the one below.

## # A tibble: 34 x 4
##    name                     close open  day      
##    <chr>                    <chr> <chr> <chr>    
##  1 Mr Hoagie                21:00 11:00 Friday   
##  2 Mr Hoagie                21:00 11:00 Wednesday
##  3 Mr Hoagie                21:00 11:00 Thursday 
##  4 Mr Hoagie                21:00 11:00 Monday   
##  5 Mr Hoagie                21:00 11:00 Tuesday  
##  6 Clancy's Pub             <NA>  <NA>  <NA>     
##  7 Joe Cislo's Auto         <NA>  <NA>  <NA>     
##  8 Cool Springs Golf Center <NA>  <NA>  <NA>     
##  9 Verizon                  17:00 10:00 Thursday 
## 10 Verizon                  21:00 10:00 Saturday 
## # … with 24 more rows

rectangling deeply nested lists

We use unnest_longer() and unnest_wider() to collapse deeply nested lists into regular columns. unnest_longer() turns each element of a list-column into a row; unnest_wider() turns each element of a list-column into a column.

Below we start with the first level of the list, days of a week. We use unnest_longer() to put the days into a new column. The data frame grows longer with the days grouped under each name.

hours <- business %>% select(name, hours2)
hours %>% unnest_longer(hours2)

## # A tibble: 34 x 3
##    name                     hours2           hours2_id
##    <chr>                    <named list>     <chr>    
##  1 Mr Hoagie                <named list [2]> Friday   
##  2 Mr Hoagie                <named list [2]> Wednesday
##  3 Mr Hoagie                <named list [2]> Thursday 
##  4 Mr Hoagie                <named list [2]> Monday   
##  5 Mr Hoagie                <named list [2]> Tuesday  
##  6 Clancy's Pub             <NULL>           <NA>     
##  7 Joe Cislo's Auto         <NULL>           <NA>     
##  8 Cool Springs Golf Center <NULL>           <NA>     
##  9 Verizon                  <named list [2]> Thursday 
## 10 Verizon                  <named list [2]> Saturday 
## # … with 24 more rows

We then work on the elements on the second level of the list, opening and closing hours of the businesses. We use unnest_wider() to put the opening and closing hours into the same row with the day they belong to. The data frame grows wider.

hours %>% unnest_longer(hours2) %>% unnest_wider(hours2)

## # A tibble: 34 x 4
##    name                     close open  hours2_id
##    <chr>                    <chr> <chr> <chr>    
##  1 Mr Hoagie                21:00 11:00 Friday   
##  2 Mr Hoagie                21:00 11:00 Wednesday
##  3 Mr Hoagie                21:00 11:00 Thursday 
##  4 Mr Hoagie                21:00 11:00 Monday   
##  5 Mr Hoagie                21:00 11:00 Tuesday  
##  6 Clancy's Pub             <NA>  <NA>  <NA>     
##  7 Joe Cislo's Auto         <NA>  <NA>  <NA>     
##  8 Cool Springs Golf Center <NA>  <NA>  <NA>     
##  9 Verizon                  17:00 10:00 Thursday 
## 10 Verizon                  21:00 10:00 Saturday 
## # … with 24 more rows

Finally, after some postprocessing we get a data frame that is easier to manipulate and displays the information buried in the lists.

hours %>% 
  unnest_longer(hours2) %>% 
  unnest_wider(hours2) %>%
  select(name, close, open, day = hours2_id)

## # A tibble: 34 x 4
##    name                     close open  day      
##    <chr>                    <chr> <chr> <chr>    
##  1 Mr Hoagie                21:00 11:00 Friday   
##  2 Mr Hoagie                21:00 11:00 Wednesday
##  3 Mr Hoagie                21:00 11:00 Thursday 
##  4 Mr Hoagie                21:00 11:00 Monday   
##  5 Mr Hoagie                21:00 11:00 Tuesday  
##  6 Clancy's Pub             <NA>  <NA>  <NA>     
##  7 Joe Cislo's Auto         <NA>  <NA>  <NA>     
##  8 Cool Springs Golf Center <NA>  <NA>  <NA>     
##  9 Verizon                  17:00 10:00 Thursday 
## 10 Verizon                  21:00 10:00 Saturday 
## # … with 24 more rows

Handling JSON arrays

In the sample dataset yelp_academic_dataset_business.json, we have a variable categories that takes the form of JSON arrays.

head(business$categories)

## [1] "['Fast Food', 'Restaurants']"                                                                                        
## [2] "['Nightlife']"                                                                                                       
## [3] "['Auto Repair', 'Automotive']"                                                                                       
## [4] "['Active Life', 'Mini Golf', 'Golf']"                                                                                
## [5] "['Shopping', 'Home Services', 'Internet Service Providers', 'Mobile Phones', 'Professional Services', 'Electronics']"
## [6] "['Bars', 'American (New)', 'Nightlife', 'Lounges', 'Restaurants']"

Again, we use jsonlite::fromJSON and purrr::map to read the JSON arrays. We get a list-column of vectors categories2.

business$categories2 <- gsub("'", '"', business$categories)
business$categories2 <- map(business$categories2, fromJSON)
head(business$categories2)

## [[1]]
## [1] "Fast Food"   "Restaurants"
## 
## [[2]]
## [1] "Nightlife"
## 
## [[3]]
## [1] "Auto Repair" "Automotive" 
## 
## [[4]]
## [1] "Active Life" "Mini Golf"   "Golf"       
## 
## [[5]]
## [1] "Shopping"                   "Home Services"             
## [3] "Internet Service Providers" "Mobile Phones"             
## [5] "Professional Services"      "Electronics"               
## 
## [[6]]
## [1] "Bars"           "American (New)" "Nightlife"      "Lounges"       
## [5] "Restaurants"

flattening list-column of vectors

Using unnest_longer(), we convert the list-column of vectors to a regular column and turn the data frame into the longer format.

business %>% select(name, categories2) %>% unnest_longer(categories2)

## # A tibble: 31 x 2
##    name                     categories2  
##    <chr>                    <chr>        
##  1 Mr Hoagie                Fast Food    
##  2 Mr Hoagie                Restaurants  
##  3 Clancy's Pub             Nightlife    
##  4 Joe Cislo's Auto         Auto Repair  
##  5 Joe Cislo's Auto         Automotive   
##  6 Cool Springs Golf Center Active Life  
##  7 Cool Springs Golf Center Mini Golf    
##  8 Cool Springs Golf Center Golf         
##  9 Verizon                  Shopping     
## 10 Verizon                  Home Services
## # … with 21 more rows

Handling JSON files

Below we use the third sample dataset, log, to show how to handle JSON files, which we also discussed in this post with more examples.

The log file is part of a large dataset of server logs recording user behaviors of an online course on a learning platform.

The JSON data is in the ndjson format that jsonlite::fromJSON does not directly support. The stream_in() function in the package implements line-by-line processing of JSON data in the ndjson format over a connection (e.g. url, file).

Therefore, we read the file line-by-line first. The output is a large character vector.

library(readr)

con <- read_lines("sample/tracking.log")
head(con, 1)

## [1] "{\"username\": \"\", \"event_type\": \"/notifier_api/v1/users/\", \"ip\": \"127.0.0.1\", \"agent\": \"python-requests/2.8.1\", \"host\": \"localhost\", \"referer\": \"\", \"accept_language\": \"\", \"event\": \"{\\\"POST\\\": {}, \\\"GET\\\": {\\\"page\\\": [\\\"1\\\"], \\\"page_size\\\": [\\\"10\\\"]}}\", \"event_source\": \"server\", \"context\": {\"user_id\": \"\", \"org_id\": \"\", \"course_id\": \"\", \"path\": \"/notifier_api/v1/users/\"}, \"time\": \"2016-01-05T17:50:03.793624+00:00\", \"page\": null}"

We then use jsonlite::stream_in() to implement line-by-line processing. Below, textConnection() inputs and outputs text connections. The output is a data frame of vectors, a data frame, and a list.

df <- stream_in(textConnection(con))

## 
 Found 500 records...
 Found 1000 records...
 Found 1500 records...
 Found 2000 records...
 Found 2180 records...
 Imported 2180 records. Simplifying...

head(df)

##   username              event_type            ip                 agent
## 1          /notifier_api/v1/users/     127.0.0.1 python-requests/2.8.1
## 2                                / 198.20.70.114                      
## 3                                / 198.20.70.114                      
## 4                      /robots.txt 198.20.70.114                      
## 5                     /sitemap.xml 198.20.70.114                      
## 6          /notifier_api/v1/users/     127.0.0.1 python-requests/2.8.1
##            host referer accept_language
## 1     localhost                        
## 2 52.192.239.60                        
## 3 52.192.239.60                        
## 4 52.192.239.60                        
## 5 52.192.239.60                        
## 6     localhost                        
##                                                       event event_source
## 1 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}}       server
## 2                                   {"POST": {}, "GET": {}}       server
## 3                                   {"POST": {}, "GET": {}}       server
## 4                                   {"POST": {}, "GET": {}}       server
## 5                                   {"POST": {}, "GET": {}}       server
## 6 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}}       server
##   context.user_id context.org_id context.course_id            context.path
## 1                                                  /notifier_api/v1/users/
## 2                                                                        /
## 3                                                                        /
## 4                                                              /robots.txt
## 5                                                             /sitemap.xml
## 6                                                  /notifier_api/v1/users/
##                               time page name session
## 1 2016-01-05T17:50:03.793624+00:00   NA <NA>    <NA>
## 2 2016-01-05T17:51:20.142531+00:00   NA <NA>    <NA>
## 3 2016-01-05T17:51:23.850085+00:00   NA <NA>    <NA>
## 4 2016-01-05T17:51:24.584924+00:00   NA <NA>    <NA>
## 5 2016-01-05T17:51:25.757229+00:00   NA <NA>    <NA>
## 6 2016-01-05T17:55:03.792682+00:00   NA <NA>    <NA>

We use jsonlite::flatten() to flatten the nested data frame context.

flatten(df$context)

If we use rjson to read the file line-by-line, the initial outputs can be more complex and it takes more steps to get what we want.

library(rjson) 
library(plyr)

lines <- read_lines("sample/tracking.log")
df2 <- ldply(lapply(lines, function(x) t(unlist(fromJSON(json_str = x)))))

In the code above, rjson::fromJSON() reads the first JSON object from the file and outputs a list.

a <- fromJSON(json_str = lines)
head(a)

## $username
## [1] ""
## 
## $event_type
## [1] "/notifier_api/v1/users/"
## 
## $ip
## [1] "127.0.0.1"
## 
## $agent
## [1] "python-requests/2.8.1"
## 
## $host
## [1] "localhost"
## 
## $referer
## [1] ""

unlist() unlists the list and transforms it to a vector. t() transposes the vector to a matrix.

b <- t(unlist(fromJSON(json_str = lines)))
head(b)

##      username event_type                ip          agent                  
## [1,] ""       "/notifier_api/v1/users/" "127.0.0.1" "python-requests/2.8.1"
##      host        referer accept_language
## [1,] "localhost" ""      ""             
##      event                                                                  
## [1,] "{\"POST\": {}, \"GET\": {\"page\": [\"1\"], \"page_size\": [\"10\"]}}"
##      event_source context.user_id context.org_id context.course_id
## [1,] "server"     ""              ""             ""               
##      context.path              time                              
## [1,] "/notifier_api/v1/users/" "2016-01-05T17:50:03.793624+00:00"

lapply() applies a function over a list or vector. In this case, it repeats what we have done above to every JSON object in lines. The output is a list.

l <- lapply(lines, function(x) t(unlist(fromJSON(json_str = x))))
head(l, 1)

As the last step, in order to get a data frame, we use the function plyr::ldply() that splits the list, applies a function, and returns results in a data frame.

df2 <- ldply(lapply(lines, function(x) t(unlist(fromJSON(json_str = x)))))
head(df2, 1)

##   username              event_type        ip                 agent      host
## 1          /notifier_api/v1/users/ 127.0.0.1 python-requests/2.8.1 localhost
##   referer accept_language
## 1                        
##                                                       event event_source
## 1 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}}       server
##   context.user_id context.org_id context.course_id            context.path
## 1                                                  /notifier_api/v1/users/
##                               time name session event.user_id event.old
## 1 2016-01-05T17:50:03.793624+00:00 <NA>    <NA>          <NA>      <NA>
##   event.setting event.table event.new
## 1          <NA>        <NA>      <NA>