Combining data frames, either by rows or columns, is a common need in data frame manipulation. The datasets can certainly be interesting on their own, but may generate more insights when combined together.
We show how to handle those tasks with base R functions and the dplyr
methods.
When combining data frames, first, we should consider how these data of interest to us should be combined.
rbind()
dplyr::bind_rows()
cbind()
merge()
dplyr::inner_join()
dplyr::left_join()
dplyr::right_join()
dplyr::full_join()
Next, we find the common column(s) in the data frames. It can be just one column that serves as the single identifier; or we may need a combination of multiple columns to form a group of identifiers to link the data frames.
Moreover, quite importantly, we should consider the relationship among the data frames, or how we would like to match the cases from different tables (our data frames).
Finally, we should decide how we would like to keep or drop the cases when linking the data frames.
This often happens when we work with longitudinal survey data, or when we work with big data that logs user behavior in a particular time span. Most columns/variables in these datasets should be the same.
Let’s use two sample datasets to show how to append one data frames to another by rows. The task is to append the file2
to the file1
by rows.
file1
## id year family role rating
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
file2
## id year family role rating
## 1 1 2005 1 child A
## 2 2 2005 1 child A
## 3 3 2005 1 mother B
## 4 1 2005 2 father B
## 5 2 2005 2 mother C
## 6 3 2005 2 child B
The same can be achieved with dplyr::bind_rows()
.
library(dplyr)
bind_rows(file1, file2)
## id year family role rating
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
## 7 1 2005 1 child A
## 8 2 2005 1 child A
## 9 3 2005 1 mother B
## 10 1 2005 2 father B
## 11 2 2005 2 mother C
## 12 3 2005 2 child B
bind_rows()
also allows us to mark the source file by specifying the .id
argument.
bind_rows(file1, file2, .id = "source")
## source id year family role rating
## 1 1 1 1995 1 child A
## 2 1 2 1995 1 child B
## 3 1 3 1995 1 mother B
## 4 1 1 1995 2 father C
## 5 1 2 1995 2 mother B
## 6 1 3 1995 2 child A
## 7 2 1 2005 1 child A
## 8 2 2 2005 1 child A
## 9 2 3 2005 1 mother B
## 10 2 1 2005 2 father B
## 11 2 2 2005 2 mother C
## 12 2 3 2005 2 child B
merge()
merges two data frames by common columns or row names.
Here we have file1
and file5
to merge, which share the common column family
.
file1
## id year family role rating
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
file5
## family area SES
## 1 1 N 1
## 2 2 S 2
## 3 3 N 3
by
argument sets the identifier column.
merge(file1, file5, by = "family")
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
all
argument evaluates if unmatched cases from either data frame should be kept or dropped. With all = TRUE
we keep all observations from both files.
Compare the case below with the case above. Without all = TRUE
in the above case, unmatched cases are dropped.
merge(file1, file5, by = "family", all = TRUE)
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
## 7 3 NA NA <NA> <NA> N 3
Rows in file5
with no match in file1
have NA
values in the columns from file1
.
If we set all.y
to be TRUE
, all non-missing information from file5
(y
) is kept, while information from file1
(x
) is missing.
merge(file1, file5, by = "family", all.y = TRUE)
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
## 7 3 NA NA <NA> <NA> N 3
Compare all.y = TRUE
with all.x = TRUE
, where only complete cases from file1
will be kept.
merge(file1, file5, by = "family", all.x = TRUE)
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
Sometimes we need to merge data frames by multiple common columns, when a single column cannot uniquely identify cases.
file3
## id family role
## 1 1 1 child
## 2 2 1 child
## 3 3 1 mother
## 4 1 2 father
## 5 2 2 mother
## 6 3 2 child
file4
## id year family rating
## 1 1 1995 1 A
## 2 2 1995 1 B
## 3 3 1995 1 B
## 4 1 1995 2 C
## 5 2 1995 2 B
## 6 3 1995 2 A
To merge these two files, we use the combination of id
and family
as a group identifier.
merge(file3, file4, by = c("id", "family"))
## id family role year rating
## 1 1 1 child 1995 A
## 2 1 2 father 1995 C
## 3 2 1 child 1995 B
## 4 2 2 mother 1995 B
## 5 3 1 mother 1995 B
## 6 3 2 child 1995 A
inner_join()
, left_join()
, right_join()
, and full_join()
from dplyr
join data frames together in different ways. These functions return all columns from the data frames to be joined.
The sample data frames we have are file6
and file7
. Note that these two data frames cannot be merged one-to-one or one-to-many, but need to be joined many-to-many.
file6
## industry company rank
## 1 A a 1
## 2 A b 2
## 3 A c 3
## 4 B l 4
## 5 B m 5
## 6 B n 6
## 7 C o 7
file7
## industry year
## 1 A 1995
## 2 A 2005
## 3 B 1995
## 4 B 2005
## 5 D 1995
## 6 D 2005
In a many-to-many relationship, when there are multiple matches between x
and y
, all combinations of the matches are returned.
inner_join()
returns all rows from x
where there are matching values in y
.
file6 %>% inner_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
left_join()
returns all rows from x
. Rows in x
with no match in y
will have NA
values in the new columns.
file6 %>% left_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 C o 7 NA
right_join()
returns all rows from y
. Rows in y
with no match in x
will have NA
values in the new columns.
file6 %>% right_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 D <NA> NA 1995
## 14 D <NA> NA 2005
full_join()
returns all rows from both x
and y
. Where there are unmatched values, it returns NA
for the missing ones.
file6 %>% full_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 C o 7 NA
## 14 D <NA> NA 1995
## 15 D <NA> NA 2005