Apply Family Functions

Introduction

In this post, we introduce a family of functions, the apply family of functions. These functions are vectorized functions that minimize our need to write loops explicitly. They allow us to apply a function, built-in or created by us, to a vector, array, or list.

The apply family functions include

apply: applying a function to arrays and matrices
tapply: applying a function over subsets of a vector
lapply: list apply
sapply: simplifying list apply
vapply: list apply that returns a vector
mapply: multiple argument list apply
rapply: recursively apply a function to a list
eapply: applying a function to each entry in an environment

As you may have seen, these apply functions operate on different data structures, and return an output object that may or may not be of the same data structure as the input object.

Which apply function to use, then? This depends on:

the structure of the data to operate on (matrix, list, data frame etc.)
parts of the data to pass the function to (rows, columns, groups, rows and columns etc.)
the desired format of the output
the function to be specified (e.g. What kind of arguments does the function take? Scalars or vectors?)

Considering all these factors, the apply functions are quite versatile. Apply functions are also used often in data manipulation tasks. Therefore, apply family functions can be quite powerful. Hopefully you will agree with me after reading this post.

Below, we discuss (1) apply, (2) tapply and its cousin by, and (3) lapply and its variants sapply and mapply. We also take the chance to review several core constructs in R: loops, data structures and functions.

We provide some use cases of these functions using several built-in functions in R: USPersonalExpenditure, UCBAdmissions, ToothGrowth, and state.x77. These datasets are pre-installed in the base R package datasets that are directly available to us.

`apply()`

apply(X, MARGIN, FUN, ...) takes an array (including a matrix) as an input and applies a function to margins of the array. apply() returns a vector, array, or list of values.

apply() takes several arguments. We will start from arguments X and MARGIN.

`X` and `MARGIN`

X is the input object. It can be an array, or matrix. It can also be a data frame, which will be converted to a data frame.

The argument MARGIN is a vector giving the subscripts which the function will be applied over. This is where we tell R which places in X a specified function FUN will be applied to. For instance, for a matrix, 1 indicates rows, 2 indicates columns, and c(1, 2) indicates rows and columns. When MARGIN = 1, apply() function will call FUN once for each row. Besides, MARGIN can also be a character vector that selects dimension names. If X has named dimnames, such as row names or column names.

`X` is a matrix

In the simplest case, the input object X is a matrix. To show how apply() works, here we use a pre-installed dataset USPersonalExpenditure in the datasets package. USPersonalExpenditure is a matrix. We can use ?USPersonalExpenditure to find its description.

class(USPersonalExpenditure)

## [1] "matrix" "array"

USPersonalExpenditure consists of United States personal expenditures (in billions of dollars) for the categories food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960.

USPersonalExpenditure

##                       1940   1945  1950 1955  1960
## Food and Tobacco    22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health   3.530  5.760  9.71 14.0 21.10
## Personal Care        1.040  1.980  2.45  3.4  5.40
## Private Education    0.341  0.974  1.80  2.6  3.64

The line of code below calculates the sum of personal expenditure across the years for each category. It applies the sum() function to each row and returns a vector.

apply(USPersonalExpenditure, 1, sum)

##    Food and Tobacco Household Operation  Medical and Health       Personal Care 
##             286.300             137.700              54.100              14.270 
##   Private Education 
##               9.355

Then this line of code below finds the maximum personal expenditure across the categories for each year. It applies the max() function to each column and returns a vector.

apply(USPersonalExpenditure, 2, max)

## 1940 1945 1950 1955 1960 
## 22.2 44.5 59.6 73.2 86.8

If we apply range() to the columns of USPersonalExpenditure, we get a matrix in return. range() returns a vector of two elements, the minimum and the maximum.

apply(USPersonalExpenditure, 2, range)

##        1940   1945 1950 1955  1960
## [1,]  0.341  0.974  1.8  2.6  3.64
## [2,] 22.200 44.500 59.6 73.2 86.80

Note that apply() uses the rownames from our matrix to identify the elements of the resulting vector or matrix. That’s why we are seeing the food categories or years in the outputs.

rownames(USPersonalExpenditure)

## [1] "Food and Tobacco"    "Household Operation" "Medical and Health" 
## [4] "Personal Care"       "Private Education"

colnames(USPersonalExpenditure)

## [1] "1940" "1945" "1950" "1955" "1960"

`X` is an array

When the input object X is an array, the dataset has more than two dimensions, and we have more to play with in our data with apply().

Let’s use the famous 1973 UC Berkeley admissions data, UCBAdmissions, another built-in dataset in R, to explore some interesting questions. This is a subset of the complete data examined in a study published on Science in 1975.

In the fall of 1973, the University of California, Berkeley’s graduate division admitted about 44% of male applicants and 35% of female applicants. The school officials were worried about the difference, or bias, in the admission rates between male and female applicants. They asked a statistician to analyze the data, who was one of the authors of the Science paper.

This dataset provides aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973, classified by admission and gender. Once again, the description of the dataset can be found by in its help file.

UCBAdmissions

## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317

UCBAdmissions is a 3-dimensional array resulting from cross-tabulating 4526 observations on 3 variables.

Dimension	Name	Levels
1	Admit	Admitted, Rejected
2	Gender	Male, Female
3	Dept	A, B, C, D, E, F

Now let’s go back to what worried Berkeley’s officials: the overall acceptance rates for female and male applicants. The formula is simply the number of admitted female and male applicants divided by the total number of male and female applicants. But how do we get the numbers from the array?

First, we calculate the total number for both male and female applicants. We pass the sum() function to FUN and sum up the values in each level of dimension Gender.

applicants <- apply(UCBAdmissions, "Gender", sum)
applicants

##   Male Female 
##   2691   1835

Note: Remember that when the input X has named dimnames, MARGIN can be a character vector selecting dimension names.

The same result can be achieved by replacing the dimension name with its number.

apply(UCBAdmissions, 2, sum)

##   Male Female 
##   2691   1835

Next, we calculate the number of female and male applicants in the admitted group. Our approach here is to get the numbers for both admitted and rejected applicants and extract the admitted applicants from the result.

We need to find each combination by Admit and Gender to apply the function sum(). Therefore, MARGIN is c("Admit","Gender"). Then we sum up their values across the departments.

apply(UCBAdmissions, c("Admit","Gender"), sum)

##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

The output is a matrix.

class(apply(UCBAdmissions, c("Admit","Gender"), sum))

## [1] "matrix" "array"

Then we extract the “Admitted” applicants from the output matrix.

apply(UCBAdmissions, c("Admit","Gender"), sum)["Admitted",]

##   Male Female 
##   1198    557

Summarizing the two steps above, we have the number of admitted applicants in both genders.

admitted <- apply(UCBAdmissions, c("Admit","Gender"), sum)["Admitted",]
# apply(UCBAdmissions, c(1,2), sum)[1,]
admitted

##   Male Female 
##   1198    557

Now we are ready to calculate the acceptance rates.

round(admitted/applicants*100, 2)

##   Male Female 
##  44.52  30.35

It seems that the acceptance rate in the six departments for female applicants is much lower than the acceptance rate for the male applicants.

However, when the statisticians examined the data, they discovered that within specific departments, this bias against women went away. The acceptance rate for female applicants was higher than the acceptance rate for male applicants in several cases.

Let’s get the number of applicants for both genders in each department.

applicants2 <- apply(UCBAdmissions, c("Gender","Dept"), sum)
applicants2

##         Dept
## Gender     A   B   C   D   E   F
##   Male   825 560 325 417 191 373
##   Female 108  25 593 375 393 341

And the number of admitted applicants.

admitted2 <- UCBAdmissions["Admitted",,]
admitted2

##         Dept
## Gender     A   B   C   D   E   F
##   Male   512 353 120 138  53  22
##   Female  89  17 202 131  94  24

The acceptance rates are:

round(admitted2/applicants2*100, 2)

##         Dept
## Gender       A     B     C     D     E     F
##   Male   62.06 63.04 36.92 33.09 27.75  5.90
##   Female 82.41 68.00 34.06 34.93 23.92  7.04

But why?

Because more women had applied to departments that admitted a small percentage of applicants, like English, than to departments that admitted a large percentage of applicants, like mechanical engineering. As the authors summarized in the paper:

“The graduate departments that are easier to enter tend to be those that require more mathematics in the undergraduate preparatory curriculum. The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system. Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.” (Bickel, Hammel, O’Connell, 1975, p. 402)

This phenomenon is called the Simpson’s Paradox. It occurs in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.

Titanic

Another example of the Simpson’s Paradox is the survival rates of the third class passengers and crew members on Titanic. The data is also available in R, called Titanic. It is a 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables.

Dimension	Name	Levels
1	Class	1st, 2nd, 3rd, Crew
2	Sex	Male, Female
3	Age	Child, Adult
4	Survived	No, Yes

If we compare the survival rates for adults, we find that the numbers for third class passengers and crew members in the adults are close.

round(apply(Titanic, c(1,3,4), sum)[,"Adult","Yes"] /
        apply(Titanic, c(1,3), sum)[,"Adult"] * 100, 2)

##   1st   2nd   3rd  Crew 
## 61.76 36.02 24.08 23.95

The survival rate is 24.08% for the third class passengers and 23.95% for the crew members.

However, if we further break the data down by gender, the survival rates are higher for crew members compared to the third class passengers for both men and women.

round(apply(Titanic, c(1,2,3,4), sum)[,,"Adult","Yes"] / 
        apply(Titanic, c(1,2,3), sum)[,,"Adult"] * 100, 2)

##       Sex
## Class   Male Female
##   1st  32.57  97.22
##   2nd   8.33  86.02
##   3rd  16.23  46.06
##   Crew 22.27  86.96

Why do you think that happened?

`X` is a subset of a matrix or an array

In the UCBAdmissions example, when we calculated the total number of admitted applicants in both genders, we extracted the Admitted subset after we got the total number of all applicants.

apply(UCBAdmissions, c("Admit","Gender"), sum)["Admitted",]

##   Male Female 
##   1198    557

The same can be achieved by subsetting the Admitted group first, before summing up their values.

apply(UCBAdmissions["Admitted",,], "Gender", sum)

##   Male Female 
##   1198    557

`X` is a data frame

If the input X is a data frame, R will convert it into a matrix. apply() works if the data frame is of the same type (e.g. all numeric). When the data frame has columns of different types, apply() may convert them to one type. In these cases, we should use lapply(), which we discuss below.

`FUN`

The function argument FUN can be a named function, or an anonymous function. We get an anonymous function if we choose not to give the function a name. This is useful when it’s not worth the effort to figure out a name.

For example, the one-liner function below is an anonymous function. There is no need to name it if we are going to use it only once inside the apply() function.

apply(USPersonalExpenditure, 2, function(x) sum(x)*10)

##    1940    1945    1950    1955    1960 
##  376.11  687.14 1025.60 1297.00 1631.40

`...`

... can be optional arguments to FUN.

As shown below, na.rm is the second argument to mean(), although in this case we don’t have NAs to worry about.

apply(USPersonalExpenditure, 1, mean, na.rm = TRUE)

Every time that apply() calls mean(), the first argument will be a row of USPersonalExpenditure and the second argument will be na.rm = TRUE. With those arguments, the function call will be mean(row, na.rm = TRUE).

`tapply()`, `by()`

The second member of the apply family functions that we are going to meet is tapply(), together with its cousin by.

tapply() and by() apply a function to groups of values.

`tapply()`

tapply(X, INDEX, FUN, ...) applies a function to each (non-empty) group of values given by a unique combination of the levels of certain factors.

The function has three main arguments: the vector of data X, the factor INDEX that defines the groups, and a function FUN. The factor level in INDEX identifies the group of each vector element in X. The vector and the factor are of the same length.

Vector	Factor
9	A
25	B
32	C
14	B
2	C
100	A

Now we use an R dataset ToothGrowth to show how tapply() works. ToothGrowth recorded the effect of Vitamin C on tooth growth in Guinea pigs.

head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

It has three variables. dose is dose in milligrams/day and a numeric vector. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods.

ToothGrowth$dose

##  [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [20] 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [39] 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
## [58] 2.0 2.0 2.0

len is tooth length and a numeric vector. It is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs.

ToothGrowth$len

##  [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0 16.5 16.5 15.2 17.3 22.5
## [16] 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
## [31] 15.2 21.5 17.6  9.7 14.5 10.0  8.2  9.4 16.5  9.7 19.7 23.3 23.6 26.4 20.0
## [46] 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0

supp is supplement type and a factor. The two supplement types are orange juice (coded as OJ) or ascorbic acid (a form of vitamin C and coded as VC). The factor level identifies the group of each vector element in len and dose.

ToothGrowth$supp

##  [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
## [26] VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## [51] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## Levels: OJ VC

To get the mean of length by the group of supplement type, we use tapply() to apply the mean() function to each group of supp. The result is a vector.

tapply(ToothGrowth$len, ToothGrowth$supp, mean)

tapply() can also manage multiple categories. This is handled by the argument INDEX, a list of one or more factors, each of same length as X. The elements are coerced to factors by as.factor.

In our case, we can split the length of tooth by both supplement type and dose. dose is not a factor, but is converted to a factor in this operation.

tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), mean)

##      0.5     1     2
## OJ 13.23 22.70 26.06
## VC  7.98 16.77 26.14

The result is a matrix.

`by()`

by(data, INDICES, FUN, ...) applies a function to a data frame split by factors. by() is a wrapper for tapply() applied to data frames. The function returns a list.

The argument data normally is a data frame, but can possibly be a matrix. INDICES is a factor or a list of factors, each of length nrow(data).

by() calls the function FUN for each group with a data frame. It is useful if the function FUN handles data frames in a special way.

For instance, the example below summarizes the data by supp, and organizes the summary statistics in a reader friendly manner.

by(ToothGrowth, ToothGrowth$supp, summary)

## ToothGrowth$supp: OJ
##       len        supp         dose      
##  Min.   : 8.20   OJ:30   Min.   :0.500  
##  1st Qu.:15.53   VC: 0   1st Qu.:0.500  
##  Median :22.70           Median :1.000  
##  Mean   :20.66           Mean   :1.167  
##  3rd Qu.:25.73           3rd Qu.:2.000  
##  Max.   :30.90           Max.   :2.000  
## ------------------------------------------------------------ 
## ToothGrowth$supp: VC
##       len        supp         dose      
##  Min.   : 4.20   OJ: 0   Min.   :0.500  
##  1st Qu.:11.20   VC:30   1st Qu.:0.500  
##  Median :16.50           Median :1.000  
##  Mean   :16.96           Mean   :1.167  
##  3rd Qu.:23.10           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Same as tapply(), by() can handle multiple groups. This example below summarizes the data by all combinations of supp and dose.

by(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), summary)

## : OJ
## : 0.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.20    9.70   12.25   13.23   16.18   21.50 
## ------------------------------------------------------------ 
## : VC
## : 0.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20    5.95    7.15    7.98   10.90   11.50 
## ------------------------------------------------------------ 
## : OJ
## : 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.50   20.30   23.45   22.70   25.65   27.30 
## ------------------------------------------------------------ 
## : VC
## : 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.60   15.28   16.50   16.77   17.30   22.50 
## ------------------------------------------------------------ 
## : OJ
## : 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.40   24.57   25.95   26.06   27.07   30.90 
## ------------------------------------------------------------ 
## : VC
## : 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.50   23.38   25.95   26.14   28.80   33.90

`lapply()`, `sapply()`, `mapply()`

lapply(), sapply() and mapply() apply a function over a list or vector.

lapply() returns a list. sapply() is a user-friendly version and wrapper of lapply(), which returns a vector or a matrix. mapply() is a multivariate version of sapply().

`lapply()`

lapply(X, FUN, ...) applies a function over a list or vector. lapply() returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

`X` is a list

Most often, we see lapply() applied to each element of a list.

For instance, below lapply(mylist, mean) iterates over the list mylist to get the mean of three vectors. The result is a list.

mylist <- list(1:10, 10:1, -5:5)
lapply(mylist, mean)

## [[1]]
## [1] 5.5
## 
## [[2]]
## [1] 5.5
## 
## [[3]]
## [1] 0

Now let’s see a more meaningful example using the pre-installed dataset state.x77. state.x77 is a matrix with 50 rows and 8 columns giving the statistics of the 50 states of the United States, including population, income, life expectancy, murder rate, percent of high-school graduates and a few others.

head(state.x77, 3)

##         Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama       3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska         365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona       2212   4530        1.8    70.55    7.8    58.1    15 113417

The dataset was transformed and reorganized into a list of data frames. Following this post, the rows in the data frame state.df became elements of a new list, for the purpose of illustration.

state.df <- data.frame(state.x77)
state.list <- setNames(split(state.df, seq(nrow(state.df))), rownames(state.df))
head(state.list, 3)

## $Alabama
##         Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Alabama       3615   3624        2.1    69.05   15.1    41.3    20 50708
## 
## $Alaska
##        Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432
## 
## $Arizona
##         Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Arizona       2212   4530        1.8    70.55    7.8    58.1    15 113417

Let’s use the list to answer a few questions. First, what is the average population across the states?

pop <- lapply(state.list, "[[", "Population") 
mean(unlist(pop))

## [1] 4246.42

Note that in the case of functions like +, %*%, [[, etc., the function name must be backquoted or quoted.

pop <- lapply(state.list, "[[", "Population") 
pop <- lapply(state.list, `[[`, "Population")

Next, what is the average number of people per square miles in each state?

pop_den <- function(x) x[["Population"]]/x[["Area"]]
lapply(state.list, pop_den)

## $Alabama
## [1] 0.07129053
## 
## $Alaska
## [1] 0.0006443845
## 
## $Arizona
## [1] 0.01950325
## 
## $Arkansas
## [1] 0.04061989
## 
## $California
## [1] 0.1355709
## 
## $Colorado
## [1] 0.02448779
## 
## $Connecticut
## [1] 0.6375977
## 
## $Delaware
## [1] 0.2921292
## 
## $Florida
## [1] 0.1530227
## 
## $Georgia
## [1] 0.08491037
## 
## $Hawaii
## [1] 0.1350973
## 
## $Idaho
## [1] 0.009833448
## 
## $Illinois
## [1] 0.2008503
## 
## $Indiana
## [1] 0.1471867
## 
## $Iowa
## [1] 0.05114317
## 
## $Kansas
## [1] 0.02787729
## 
## $Kentucky
## [1] 0.08542245
## 
## $Louisiana
## [1] 0.08470955
## 
## $Maine
## [1] 0.03421734
## 
## $Maryland
## [1] 0.4167425
## 
## $Massachusetts
## [1] 0.7429083
## 
## $Michigan
## [1] 0.1603569
## 
## $Minnesota
## [1] 0.049452
## 
## $Mississippi
## [1] 0.04949679
## 
## $Missouri
## [1] 0.06909196
## 
## $Montana
## [1] 0.005124084
## 
## $Nebraska
## [1] 0.02018749
## 
## $Nevada
## [1] 0.005369054
## 
## $`New Hampshire`
## [1] 0.08995237
## 
## $`New Jersey`
## [1] 0.9750033
## 
## $`New Mexico`
## [1] 0.009422462
## 
## $`New York`
## [1] 0.3779139
## 
## $`North Carolina`
## [1] 0.1115005
## 
## $`North Dakota`
## [1] 0.009195502
## 
## $Ohio
## [1] 0.261989
## 
## $Oklahoma
## [1] 0.03947254
## 
## $Oregon
## [1] 0.02374615
## 
## $Pennsylvania
## [1] 0.2637548
## 
## $`Rhode Island`
## [1] 0.8875119
## 
## $`South Carolina`
## [1] 0.09316791
## 
## $`South Dakota`
## [1] 0.008965835
## 
## $Tennessee
## [1] 0.1009727
## 
## $Texas
## [1] 0.04668223
## 
## $Utah
## [1] 0.01465358
## 
## $Vermont
## [1] 0.05093342
## 
## $Virginia
## [1] 0.1252137
## 
## $Washington
## [1] 0.05346252
## 
## $`West Virginia`
## [1] 0.07474034
## 
## $Wisconsin
## [1] 0.08425749
## 
## $Wyoming
## [1] 0.003868193

`X` is a vector

The input object of lapply() can also be a vector.

In the example below, n takes 5 and is passed to the function fun for 3 times.

fun <- function(n){
  x <- rnorm(n) 
  y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
  list(X = x, Y = y)
}

lapply(rep(5, 3), fun)

## [[1]]
## [[1]]$X
## [1] -1.4642910  0.8151961 -1.0451296  0.7288556 -1.1996225
## 
## [[1]]$Y
## [1] -0.15149544 -0.07630003 -0.30677627 -0.21860199 -0.31394367
## 
## 
## [[2]]
## [[2]]$X
## [1]  1.2423225  0.4067033 -0.8115283 -0.3284437  0.4067043
## 
## [[2]]$Y
## [1] 0.33400227 0.03399143 0.03873823 0.15599160 0.13703682
## 
## 
## [[3]]
## [[3]]$X
## [1] -0.7723077 -1.7605176  0.1962825  0.7955502 -0.1973143
## 
## [[3]]$Y
## [1] -0.6729501 -0.2679057 -0.7545229 -0.1158864 -0.7901898

This usage of lapply() is seen in large number of sequence generations.

`X` is a data frame

Recall earlier we have mentioned that apply() works on the rows of a data frame if its columns are of the same type. If not, this is when lapply() comes to rescue.

lapply() can be applied to a data frame, since data frame is a kind of list. The function passed to the argument FUN will be applied to each column of the data frame.

One application of this is to use lapply() to check the types of columns in a data frame. The output is a list.

lapply(ToothGrowth, class)

## $len
## [1] "numeric"
## 
## $supp
## [1] "factor"
## 
## $dose
## [1] "numeric"

If we want to turn the output to a vector, we can unlist() the output list.

unlist(lapply(ToothGrowth, class))

##       len      supp      dose 
## "numeric"  "factor" "numeric"

`sapply()`

sapply(X, FUN, ...) will try to simplify the result of lapply() if possible. The output of it can be a vector or a matrix.

If the result of lapply() is a list where every element is length 1, then using sapply() to run the same code will return a vector. For instance, when we use lapply() to evaluate if each column of ToothGrowth is numeric, the output is a list where each element is a vector of length 1.

lapply(ToothGrowth, is.numeric)

## $len
## [1] TRUE
## 
## $supp
## [1] FALSE
## 
## $dose
## [1] TRUE

If we use sapply() instead, we will get the same output, but in a vector rather than a list.

sapply(ToothGrowth, is.numeric)

##   len  supp  dose 
##  TRUE FALSE  TRUE

If the result is a list where every element is a vector of the same length larger than 1, then using sapply() to run the same code will return a matrix.

This is an example we used earlier to explain lapply(). The output is a list with vectors longer than 1.

fun <- function(n){
  x <- rnorm(n) 
  y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
  list(X = x, Y = y)
}
lapply(rep(5, 3), fun)

## [[1]]
## [[1]]$X
## [1] -0.8361558  1.6935620  1.8570289  0.4078630 -0.8715377
## 
## [[1]]$Y
## [1] 0.09201121 0.55955178 0.45002184 0.04867432 0.12360807
## 
## 
## [[2]]
## [[2]]$X
## [1] -0.92694494 -0.16089881  1.32358758 -0.02410915 -0.77307756
## 
## [[2]]$Y
## [1] -0.001387589 -0.093046486 -0.003721287 -0.014385157 -0.131666390
## 
## 
## [[3]]
## [[3]]$X
## [1] -1.0473123 -0.6158816 -0.5444173  1.3921447  2.8202498
## 
## [[3]]$Y
## [1] 0.08119834 0.88877720 0.05362492 0.32822391 0.03706953

In such cases, using sapply() returns a matrix.

sapply(rep(5, 3), fun)

##   [,1]      [,2]      [,3]     
## X Numeric,5 Numeric,5 Numeric,5
## Y Numeric,5 Numeric,5 Numeric,5

`mapply()`

mapply(FUN, ...) a multivariate version of sapply(). mapply() applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.

Note that the first argument of mapply() is a function, unlike lapply() and sapply().

mapply() applies the function element-wise to vectors or lists.

Let’s return to the state.list and revisit the question on population density. With mapply(), our solution can be like below. The output is a vector rather than a list.

pop <- lapply(state.list, "[[", "Population")
area <- lapply(state.list, "[[", "Area")
mapply("/", pop, area)

##        Alabama         Alaska        Arizona       Arkansas     California 
##   0.0712905261   0.0006443845   0.0195032491   0.0406198864   0.1355708904 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##   0.0244877898   0.6375976964   0.2921291625   0.1530227399   0.0849103714 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##   0.1350972763   0.0098334482   0.2008502547   0.1471867468   0.0511431687 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##   0.0278772910   0.0854224464   0.0847095482   0.0342173351   0.4167424932 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##   0.7429082545   0.1603569354   0.0494520047   0.0494967862   0.0690919632 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##   0.0051240839   0.0201874926   0.0053690542   0.0899523651   0.9750033240 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##   0.0094224624   0.3779139052   0.1115004713   0.0091955019   0.2619890177 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##   0.0394725364   0.0237461532   0.2637548370   0.8875119161   0.0931679074 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##   0.0089658350   0.1009727062   0.0466822312   0.0146535763   0.0509334197 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##   0.1252136752   0.0534625207   0.0747403407   0.0842574912   0.0038681934

Apply Family Functions

Yun Dai

05/2022

Introduction

`apply()`

`X` and `MARGIN`

`X` is a matrix

`X` is an array

Titanic

`X` is a subset of a matrix or an array

`X` is a data frame

`FUN`

`...`

`tapply()`, `by()`

`tapply()`

`by()`

`lapply()`, `sapply()`, `mapply()`

`lapply()`

`X` is a list

`X` is a vector

`X` is a data frame

`sapply()`

`mapply()`

Apply Family Functions

Yun Dai

05/2022

Introduction

apply()

X and MARGIN

X is a matrix

X is an array

Titanic

X is a subset of a matrix or an array

X is a data frame

FUN

...

tapply(), by()

tapply()

by()

lapply(), sapply(), mapply()

lapply()

X is a list

X is a vector

X is a data frame

sapply()

mapply()

`apply()`

`X` and `MARGIN`

`X` is a matrix

`X` is an array

`X` is a subset of a matrix or an array

`X` is a data frame

`FUN`

`...`

`tapply()`, `by()`

`tapply()`

`by()`

`lapply()`, `sapply()`, `mapply()`

`lapply()`

`X` is a list

`X` is a vector

`X` is a data frame

`sapply()`

`mapply()`