Reading Data

First of all, make sure you have read R Data Import/Export. The document is an comprehensive introduction of reading and saving files of various formats with R.

In addition to what has been documented there, below we have a few short notes.

Importing data with RStudio

RStudio offers the click-and-select menu dropdown options of importing data. On the Enviroment Pane, we can find Import Dataset, with which we can import data easily of various formats, including delimited text files (e.g. csv, tsv), Excel files, and files from other statistical software (SPSS, SAS, Stata). The Data Import feature provides many import options (e.g. skipping lines, setting the delimiter, treating the first row as variable names or not, trimming spaces or not, setting what would be considerd as NA). It also provides a Code Preview window where we can review the R code as given and make edits.

RStudio has a full guide on importing data with RStudio.

Loading pre-installed R datasets

There are quite a few pre-installed sample R datasets available for us to play with.

## lists all datasets in the package datasets
data() 

## loads the singer dataset in the package lattice
data("singer", package = "lattice")

## lists all datasets in all available packages
data(package = .packages(all.available = TRUE))

Read the help files for descriptions of the pre-installed datasets.

library(datasets) 
?sleep

Reading rectangular text data

Package readr provides faster ways to read rectangular text data than using base R functions. For instance, read_delim() reads a delimited file. read_csv() and read_tsv() are special cases of the general read_delim() function to read comma separated and tab separated files.

Below we have a .txt file downloaded from a database delimited by space, where the first three rows are empty and need to be skipped. delim argument sets the separator, and col_names argument tells R to treat the first row as contaning the column names.

library(readr)
data <- read_delim(file = "file.txt", delim = " ", col_names = TRUE, skip = 3)

If we have a file where all missing cases are denoted by character value “NULL”, we can handle that while reading the file by setting the NAs to be "NULL".

data <- read_csv(file = "file.csv", na = "NULL")

Reading data from other software

Excel spreadsheets

Package readxland package xlsx allow us to deal with Excel files of different versions conveniently. xlsx requires Java. Check the package documentations for more details.

Stata, SPSS, SAS etc.

Package haven provides methods to read data from the major statistical software Stata, SPSS and SAS.

Package foreign reads data stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, and dBase. But note that foreign does not support Stata after version 12.

Reading JSON files

JSON (JavaScript Object Notation) is a lightweight data-interchange format. Introducing JSON introduces what is JSON and how various programming languages support it.

In R, rjson and jsonlite, among other packages, read and write JSON data. jsonlite has additional cleaning functionalities (e.g. flattening nested data frames) and can read ndjson data.

Below we discuss a case.

using `jsonlite`

The sample file is a server log. Our JSON data is of the ndjson format that jsonlite::fromJSON does not support. So we read the file line-by-line first. The output is a large character vector.

library(readr)
library(jsonlite)

con <- read_lines("sample/tracking.log")
head(con, 1)

## [1] "{\"username\": \"\", \"event_type\": \"/notifier_api/v1/users/\", \"ip\": \"127.0.0.1\", \"agent\": \"python-requests/2.8.1\", \"host\": \"localhost\", \"referer\": \"\", \"accept_language\": \"\", \"event\": \"{\\\"POST\\\": {}, \\\"GET\\\": {\\\"page\\\": [\\\"1\\\"], \\\"page_size\\\": [\\\"10\\\"]}}\", \"event_source\": \"server\", \"context\": {\"user_id\": \"\", \"org_id\": \"\", \"course_id\": \"\", \"path\": \"/notifier_api/v1/users/\"}, \"time\": \"2016-01-05T17:50:03.793624+00:00\", \"page\": null}"

We then use jsonlite::stream_in() to implement line-by-line processing of JSON data, which works over a connection (e.g. url, file). textConnection() inputs and outputs text connections. The output is a nested data frame.

df <- stream_in(textConnection(con))

## 
 Found 500 records...
 Found 1000 records...
 Found 1500 records...
 Found 2000 records...
 Found 2180 records...
 Imported 2180 records. Simplifying...

We use jsonlite::flatten() to flatten the nested data frame.

df <- flatten(df)
head(df, 1)

##   username              event_type        ip                 agent
## 1          /notifier_api/v1/users/ 127.0.0.1 python-requests/2.8.1
##        host referer accept_language
## 1 localhost                        
##                                                       event event_source
## 1 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}}       server
##                               time page name session context.user_id
## 1 2016-01-05T17:50:03.793624+00:00   NA <NA>    <NA>                
##   context.org_id context.course_id            context.path
## 1                                  /notifier_api/v1/users/

using `rjson`

Using rjson, again we need to read the file line-by-line first. The output lines is a large character vector.

library(readr)
library(rjson) 
library(plyr)

lines <- read_lines("sample/tracking.log")
df <- ldply(lapply(lines, function(x) t(unlist(fromJSON(json_str = x)))))

In the code above, rjson::fromJSON() reads the first JSON object from the file. The output is a list.

a <- fromJSON(json_str = lines)
head(a)

## $username
## [1] ""
## 
## $event_type
## [1] "/notifier_api/v1/users/"
## 
## $ip
## [1] "127.0.0.1"
## 
## $agent
## [1] "python-requests/2.8.1"
## 
## $host
## [1] "localhost"
## 
## $referer
## [1] ""

unlist() unlists the list and transforms it to a vector. t() transposes the vector to a matrix.

b <- t(unlist(fromJSON(json_str = lines)))
head(b)

##      username event_type                ip         
## [1,] ""       "/notifier_api/v1/users/" "127.0.0.1"
##      agent                   host        referer accept_language
## [1,] "python-requests/2.8.1" "localhost" ""      ""             
##      event                                                                  
## [1,] "{\"POST\": {}, \"GET\": {\"page\": [\"1\"], \"page_size\": [\"10\"]}}"
##      event_source context.user_id context.org_id context.course_id
## [1,] "server"     ""              ""             ""               
##      context.path              time                              
## [1,] "/notifier_api/v1/users/" "2016-01-05T17:50:03.793624+00:00"

lapply() applies a function over a list or vector. In this case, it repeats what we have done above to every JSON object in lines. The output is a list.

l <- lapply(lines, function(x) t(unlist(fromJSON(json_str = x))))
head(l, 1)

As the last step, in order to get a data frame, we use the function plyr::ldply() that splits the list, applies a function, and returns results in a data frame.

df <- ldply(lapply(lines, function(x) t(unlist(fromJSON(json_str = x)))))
head(df, 1)

##   username              event_type        ip                 agent
## 1          /notifier_api/v1/users/ 127.0.0.1 python-requests/2.8.1
##        host referer accept_language
## 1 localhost                        
##                                                       event event_source
## 1 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}}       server
##   context.user_id context.org_id context.course_id            context.path
## 1                                                  /notifier_api/v1/users/
##                               time name session event.user_id event.old
## 1 2016-01-05T17:50:03.793624+00:00 <NA>    <NA>          <NA>      <NA>
##   event.setting event.table event.new
## 1          <NA>        <NA>      <NA>

reading many compressed JSON files

Below we extend the case above to read hundreds of compressed JSON files.

## creates an empty vector to store JSON objects from the file
vec <- c()

## list.files() returns a character vector storing all compressed .gz files  
logs <- list.files(pattern = "*.gz")

## vec() concatenates all JSON objects from all files in the folder
for (i in 1:length(logs)) {
  temp <- read_lines(logs[i])
  vec <- c(vec, temp)   
}

## use the same method in the above section 
df <- ldply(lapply(vec, function(x) t(unlist(fromJSON(x)))))

Working with databases

RStudio has a website on Databases using R that talks about how to connect to databases, how to run queries, and packages to work with different databases.

Reading Data

Yun Dai

11/2019

Importing data with RStudio

Loading pre-installed R datasets

Reading rectangular text data

Reading data from other software

Excel spreadsheets

Stata, SPSS, SAS etc.

Reading JSON files

using `jsonlite`

using `rjson`

reading many compressed JSON files

Working with databases

Reading Data

Yun Dai

11/2019

Importing data with RStudio

Loading pre-installed R datasets

Reading rectangular text data

Reading data from other software

Excel spreadsheets

Stata, SPSS, SAS etc.

Reading JSON files

using jsonlite

using rjson

reading many compressed JSON files

Working with databases

using `jsonlite`

using `rjson`