First of all, make sure you have read R Data Import/Export. The document is an comprehensive introduction of reading and saving files of various formats with R.
In addition to what has been documented there, below we have a few short notes.
RStudio offers the click-and-select menu dropdown options of importing data. On the Enviroment Pane, we can find Import Dataset, with which we can import data easily of various formats, including delimited text files (e.g. csv, tsv), Excel files, and files from other statistical software (SPSS, SAS, Stata). The Data Import feature provides many import options (e.g. skipping lines, setting the delimiter, treating the first row as variable names or not, trimming spaces or not, setting what would be considerd as NA
). It also provides a Code Preview window where we can review the R code as given and make edits.
RStudio has a full guide on importing data with RStudio.
There are quite a few pre-installed sample R datasets available for us to play with.
## lists all datasets in the package datasets
data()
## loads the singer dataset in the package lattice
data("singer", package = "lattice")
## lists all datasets in all available packages
data(package = .packages(all.available = TRUE))
Read the help files for descriptions of the pre-installed datasets.
library(datasets)
?sleep
Package readr
provides faster ways to read rectangular text data than using base R functions. For instance, read_delim()
reads a delimited file. read_csv()
and read_tsv()
are special cases of the general read_delim()
function to read comma separated and tab separated files.
Below we have a .txt file downloaded from a database delimited by space, where the first three rows are empty and need to be skipped. delim
argument sets the separator, and col_names
argument tells R to treat the first row as contaning the column names.
library(readr)
data <- read_delim(file = "file.txt", delim = " ", col_names = TRUE, skip = 3)
If we have a file where all missing cases are denoted by character value “NULL”, we can handle that while reading the file by setting the NA
s to be "NULL"
.
data <- read_csv(file = "file.csv", na = "NULL")
Package readxl
and package xlsx
allow us to deal with Excel files of different versions conveniently. xlsx
requires Java. Check the package documentations for more details.
Package haven
provides methods to read data from the major statistical software Stata, SPSS and SAS.
Package foreign
reads data stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, and dBase. But note that foreign
does not support Stata after version 12.
JSON (JavaScript Object Notation) is a lightweight data-interchange format. Introducing JSON introduces what is JSON and how various programming languages support it.
In R, rjson
and jsonlite
, among other packages, read and write JSON data. jsonlite
has additional cleaning functionalities (e.g. flattening nested data frames) and can read ndjson data.
Below we discuss a case.
jsonlite
The sample file is a server log. Our JSON data is of the ndjson format that jsonlite::fromJSON
does not support. So we read the file line-by-line first. The output is a large character vector.
library(readr)
library(jsonlite)
con <- read_lines("sample/tracking.log")
head(con, 1)
## [1] "{\"username\": \"\", \"event_type\": \"/notifier_api/v1/users/\", \"ip\": \"127.0.0.1\", \"agent\": \"python-requests/2.8.1\", \"host\": \"localhost\", \"referer\": \"\", \"accept_language\": \"\", \"event\": \"{\\\"POST\\\": {}, \\\"GET\\\": {\\\"page\\\": [\\\"1\\\"], \\\"page_size\\\": [\\\"10\\\"]}}\", \"event_source\": \"server\", \"context\": {\"user_id\": \"\", \"org_id\": \"\", \"course_id\": \"\", \"path\": \"/notifier_api/v1/users/\"}, \"time\": \"2016-01-05T17:50:03.793624+00:00\", \"page\": null}"
We then use jsonlite::stream_in()
to implement line-by-line processing of JSON data, which works over a connection (e.g. url, file). textConnection()
inputs and outputs text connections. The output is a nested data frame.
df <- stream_in(textConnection(con))
##
Found 500 records...
Found 1000 records...
Found 1500 records...
Found 2000 records...
Found 2180 records...
Imported 2180 records. Simplifying...
We use jsonlite::flatten()
to flatten the nested data frame.
df <- flatten(df)
head(df, 1)
## username event_type ip agent
## 1 /notifier_api/v1/users/ 127.0.0.1 python-requests/2.8.1
## host referer accept_language
## 1 localhost
## event event_source
## 1 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}} server
## time page name session context.user_id
## 1 2016-01-05T17:50:03.793624+00:00 NA <NA> <NA>
## context.org_id context.course_id context.path
## 1 /notifier_api/v1/users/
rjson
Using rjson
, again we need to read the file line-by-line first. The output lines
is a large character vector.
library(readr)
library(rjson)
library(plyr)
lines <- read_lines("sample/tracking.log")
df <- ldply(lapply(lines, function(x) t(unlist(fromJSON(json_str = x)))))
In the code above, rjson::fromJSON()
reads the first JSON object from the file. The output is a list.
a <- fromJSON(json_str = lines)
head(a)
## $username
## [1] ""
##
## $event_type
## [1] "/notifier_api/v1/users/"
##
## $ip
## [1] "127.0.0.1"
##
## $agent
## [1] "python-requests/2.8.1"
##
## $host
## [1] "localhost"
##
## $referer
## [1] ""
unlist()
unlists the list and transforms it to a vector. t()
transposes the vector to a matrix.
b <- t(unlist(fromJSON(json_str = lines)))
head(b)
## username event_type ip
## [1,] "" "/notifier_api/v1/users/" "127.0.0.1"
## agent host referer accept_language
## [1,] "python-requests/2.8.1" "localhost" "" ""
## event
## [1,] "{\"POST\": {}, \"GET\": {\"page\": [\"1\"], \"page_size\": [\"10\"]}}"
## event_source context.user_id context.org_id context.course_id
## [1,] "server" "" "" ""
## context.path time
## [1,] "/notifier_api/v1/users/" "2016-01-05T17:50:03.793624+00:00"
lapply()
applies a function over a list or vector. In this case, it repeats what we have done above to every JSON object in lines
. The output is a list.
l <- lapply(lines, function(x) t(unlist(fromJSON(json_str = x))))
head(l, 1)
As the last step, in order to get a data frame, we use the function plyr::ldply()
that splits the list, applies a function, and returns results in a data frame.
df <- ldply(lapply(lines, function(x) t(unlist(fromJSON(json_str = x)))))
head(df, 1)
## username event_type ip agent
## 1 /notifier_api/v1/users/ 127.0.0.1 python-requests/2.8.1
## host referer accept_language
## 1 localhost
## event event_source
## 1 {"POST": {}, "GET": {"page": ["1"], "page_size": ["10"]}} server
## context.user_id context.org_id context.course_id context.path
## 1 /notifier_api/v1/users/
## time name session event.user_id event.old
## 1 2016-01-05T17:50:03.793624+00:00 <NA> <NA> <NA> <NA>
## event.setting event.table event.new
## 1 <NA> <NA> <NA>
Below we extend the case above to read hundreds of compressed JSON files.
## creates an empty vector to store JSON objects from the file
vec <- c()
## list.files() returns a character vector storing all compressed .gz files
logs <- list.files(pattern = "*.gz")
## vec() concatenates all JSON objects from all files in the folder
for (i in 1:length(logs)) {
temp <- read_lines(logs[i])
vec <- c(vec, temp)
}
## use the same method in the above section
df <- ldply(lapply(vec, function(x) t(unlist(fromJSON(x)))))
RStudio has a website on Databases using R that talks about how to connect to databases, how to run queries, and packages to work with different databases.