Usually, when we work on a project, we will want to keep track of what we do. Below we discuss the best practices for writing R code to the goal of being reproducible, together with some general tips and tricks.
The first thing we want to do is set up a working directory, which is the default directory to read and write files.
getwd()
returns the location of the current working directory.
setwd()
sets the current working directory. Once we set the working directory, we can access the files via relative path, which is much easier to specify than the absolute path.
setwd("~/Desktop")
For each project we will want to set up a different working directory, and put data, code, graphics, tables, documentations, reports and other files of one project in one folder.
The recommended way to organize code is as below:
######################################
# project description
######################################
# Author:
# Date:
# ...
######################################
# load libraries
######################################
library(readr)
library(dplyr)
library(stringr)
library(depmixS4)
library(ggplot2)
######################################
# run scripts
######################################
source("cleaning.R")
source("models.R")
source("viz.R")
######################################
#
indicates comments. Everything following #
will be ignored.
We use comments not only to “comment”, but also to
If we need to comment out a block of codes, we need to put #
at the start of each new line.
# this is
# a very long
# comment
Defining intermediate objects as we move along gives the readers (and yourself when you later review your code) a clear clue of what you have done. It also makes debugging and checking intermediate outputs easier.
As an oversimplified example, below we first defined all the rules, logical expressions indicating elements or rows to keep, before passing them to subset()
.
a <- rule1
b <- rule2
c <- rule3
subset(data, a & b & c)
If you find yourself repeating your code, automate them with a loop or a function. Doing so would also reduce chances of making mistakes in our code.
The exception is around :
and ::
.
x <- (-b + sqrt(b ^ 2 - 4 * a * c)) / (2 * a)
Use <-
for assignment instead of =
.
Whichever naming convention (underscore_separated, period.separated, CamelCase …) we choose, it is best that we keep it consistent.
Sometimes there could be conflicts in function names. For instance, packages dplyr
and plyr
have several functions that share the same names: mutate()
, rename()
, summarise()
etc. We need to specify which package needs to be loaded when calling those functions (e.g. dplyr::summarise()
or plyr::summarise()
).