Combining Data Frames

Combining data frames, either by rows or columns, is a common need in data frame manipulation. The datasets can certainly be interesting on their own, but may generate more insights when combined together.

We show how to handle those tasks with base R functions and the dplyr methods.

Introduction

When combining data frames, first, we should consider how these data of interest to us should be combined.

vertically by rows
- rbind()
- dplyr::bind_rows()
horizontally by columns
- cbind()
- merge()
- dplyr::inner_join()
- dplyr::left_join()
- dplyr::right_join()
- dplyr::full_join()

Next, we find the common column(s) in the data frames. It can be just one column that serves as the single identifier; or we may need a combination of multiple columns to form a group of identifiers to link the data frames.

Moreover, quite importantly, we should consider the relationship among the data frames, or how we would like to match the cases from different tables (our data frames).

one-to-one: a row (observation/case/record) in one table is related to one row in another table
one-to-many: a row in one table is related to many rows in another table
many-to-many: multiple rows in one table are related to multiple rows in another table

Finally, we should decide how we would like to keep or drop the cases when linking the data frames.

keep all cases from both tables, matched or unmatched
keep only the matched cases in both tables, dropping all unmatched cases
keep all cases in one table, and drop unmatched cases from the other table

Binding by rows

This often happens when we work with longitudinal survey data, or when we work with big data that logs user behavior in a particular time span. Most columns/variables in these datasets should be the same.

Let’s use two sample datasets to show how to append one data frames to another by rows. The task is to append the file2 to the file1 by rows.

file1

##   id year family   role rating
## 1  1 1995      1  child      A
## 2  2 1995      1  child      B
## 3  3 1995      1 mother      B
## 4  1 1995      2 father      C
## 5  2 1995      2 mother      B
## 6  3 1995      2  child      A

file2

##   id year family   role rating
## 1  1 2005      1  child      A
## 2  2 2005      1  child      A
## 3  3 2005      1 mother      B
## 4  1 2005      2 father      B
## 5  2 2005      2 mother      C
## 6  3 2005      2  child      B

The same can be achieved with dplyr::bind_rows().

library(dplyr)
bind_rows(file1, file2)

##    id year family   role rating
## 1   1 1995      1  child      A
## 2   2 1995      1  child      B
## 3   3 1995      1 mother      B
## 4   1 1995      2 father      C
## 5   2 1995      2 mother      B
## 6   3 1995      2  child      A
## 7   1 2005      1  child      A
## 8   2 2005      1  child      A
## 9   3 2005      1 mother      B
## 10  1 2005      2 father      B
## 11  2 2005      2 mother      C
## 12  3 2005      2  child      B

bind_rows() also allows us to mark the source file by specifying the .id argument.

bind_rows(file1, file2, .id = "source")

##    source id year family   role rating
## 1       1  1 1995      1  child      A
## 2       1  2 1995      1  child      B
## 3       1  3 1995      1 mother      B
## 4       1  1 1995      2 father      C
## 5       1  2 1995      2 mother      B
## 6       1  3 1995      2  child      A
## 7       2  1 2005      1  child      A
## 8       2  2 2005      1  child      A
## 9       2  3 2005      1 mother      B
## 10      2  1 2005      2 father      B
## 11      2  2 2005      2 mother      C
## 12      2  3 2005      2  child      B

Merging data frames

merge() merges two data frames by common columns or row names.

merging by single identifier

Here we have file1 and file5 to merge, which share the common column family.

file1

##   id year family   role rating
## 1  1 1995      1  child      A
## 2  2 1995      1  child      B
## 3  3 1995      1 mother      B
## 4  1 1995      2 father      C
## 5  2 1995      2 mother      B
## 6  3 1995      2  child      A

file5

##   family area SES
## 1      1    N   1
## 2      2    S   2
## 3      3    N   3

by argument sets the identifier column.

merge(file1, file5, by = "family")

##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2

all argument evaluates if unmatched cases from either data frame should be kept or dropped. With all = TRUE we keep all observations from both files.

Compare the case below with the case above. Without all = TRUE in the above case, unmatched cases are dropped.

merge(file1, file5, by = "family", all = TRUE)

##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2
## 7      3 NA   NA   <NA>   <NA>    N   3

Rows in file5 with no match in file1 have NA values in the columns from file1.

If we set all.y to be TRUE, all non-missing information from file5 (y) is kept, while information from file1 (x) is missing.

merge(file1, file5, by = "family", all.y = TRUE)

##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2
## 7      3 NA   NA   <NA>   <NA>    N   3

Compare all.y = TRUE with all.x = TRUE, where only complete cases from file1 will be kept.

merge(file1, file5, by = "family", all.x = TRUE)

##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2

merging by multiple identifiers

Sometimes we need to merge data frames by multiple common columns, when a single column cannot uniquely identify cases.

file3

##   id family   role
## 1  1      1  child
## 2  2      1  child
## 3  3      1 mother
## 4  1      2 father
## 5  2      2 mother
## 6  3      2  child

file4

##   id year family rating
## 1  1 1995      1      A
## 2  2 1995      1      B
## 3  3 1995      1      B
## 4  1 1995      2      C
## 5  2 1995      2      B
## 6  3 1995      2      A

To merge these two files, we use the combination of id and family as a group identifier.

merge(file3, file4, by = c("id", "family"))

##   id family   role year rating
## 1  1      1  child 1995      A
## 2  1      2 father 1995      C
## 3  2      1  child 1995      B
## 4  2      2 mother 1995      B
## 5  3      1 mother 1995      B
## 6  3      2  child 1995      A

Join operations

inner_join(), left_join(), right_join(), and full_join() from dplyr join data frames together in different ways. These functions return all columns from the data frames to be joined.

The sample data frames we have are file6 and file7. Note that these two data frames cannot be merged one-to-one or one-to-many, but need to be joined many-to-many.

file6

##   industry company rank
## 1        A       a    1
## 2        A       b    2
## 3        A       c    3
## 4        B       l    4
## 5        B       m    5
## 6        B       n    6
## 7        C       o    7

file7

##   industry year
## 1        A 1995
## 2        A 2005
## 3        B 1995
## 4        B 2005
## 5        D 1995
## 6        D 2005

In a many-to-many relationship, when there are multiple matches between x and y, all combinations of the matches are returned.

inner_join() returns all rows from x where there are matching values in y.

file6 %>% inner_join(file7, by = "industry")

##    industry company rank year
## 1         A       a    1 1995
## 2         A       a    1 2005
## 3         A       b    2 1995
## 4         A       b    2 2005
## 5         A       c    3 1995
## 6         A       c    3 2005
## 7         B       l    4 1995
## 8         B       l    4 2005
## 9         B       m    5 1995
## 10        B       m    5 2005
## 11        B       n    6 1995
## 12        B       n    6 2005

left_join() returns all rows from x. Rows in x with no match in y will have NA values in the new columns.

file6 %>% left_join(file7, by = "industry")

##    industry company rank year
## 1         A       a    1 1995
## 2         A       a    1 2005
## 3         A       b    2 1995
## 4         A       b    2 2005
## 5         A       c    3 1995
## 6         A       c    3 2005
## 7         B       l    4 1995
## 8         B       l    4 2005
## 9         B       m    5 1995
## 10        B       m    5 2005
## 11        B       n    6 1995
## 12        B       n    6 2005
## 13        C       o    7   NA

right_join() returns all rows from y. Rows in y with no match in x will have NA values in the new columns.

file6 %>% right_join(file7, by = "industry")

##    industry company rank year
## 1         A       a    1 1995
## 2         A       a    1 2005
## 3         A       b    2 1995
## 4         A       b    2 2005
## 5         A       c    3 1995
## 6         A       c    3 2005
## 7         B       l    4 1995
## 8         B       l    4 2005
## 9         B       m    5 1995
## 10        B       m    5 2005
## 11        B       n    6 1995
## 12        B       n    6 2005
## 13        D    <NA>   NA 1995
## 14        D    <NA>   NA 2005

full_join() returns all rows from both x and y. Where there are unmatched values, it returns NA for the missing ones.

file6 %>% full_join(file7, by = "industry")

##    industry company rank year
## 1         A       a    1 1995
## 2         A       a    1 2005
## 3         A       b    2 1995
## 4         A       b    2 2005
## 5         A       c    3 1995
## 6         A       c    3 2005
## 7         B       l    4 1995
## 8         B       l    4 2005
## 9         B       m    5 1995
## 10        B       m    5 2005
## 11        B       n    6 1995
## 12        B       n    6 2005
## 13        C       o    7   NA
## 14        D    <NA>   NA 1995
## 15        D    <NA>   NA 2005