In this post, we introduce a family of functions, the apply family of functions. These functions are vectorized functions that minimize our need to write loops explicitly. They allow us to apply a function, built-in or created by us, to a vector, array, or list.
The apply family functions include
apply
: applying a function to arrays and matricestapply
: applying a function over subsets of a vectorlapply
: list applysapply
: simplifying list applyvapply
: list apply that returns a vectormapply
: multiple argument list applyrapply
: recursively apply a function to a listeapply
: applying a function to each entry in an environmentAs you may have seen, these apply functions operate on different data structures, and return an output object that may or may not be of the same data structure as the input object.
Which apply function to use, then? This depends on:
Considering all these factors, the apply functions are quite versatile. Apply functions are also used often in data manipulation tasks. Therefore, apply family functions can be quite powerful. Hopefully you will agree with me after reading this post.
Below, we discuss (1) apply
, (2) tapply
and its cousin by
, and (3) lapply
and its variants sapply
and mapply
. We also take the chance to review several core constructs in R: loops, data structures and functions.
We provide some use cases of these functions using several built-in functions in R: USPersonalExpenditure
, UCBAdmissions
, ToothGrowth
, and state.x77
. These datasets are pre-installed in the base R package datasets
that are directly available to us.
apply()
apply(X, MARGIN, FUN, ...)
takes an array (including a matrix) as an input and applies a function to margins of the array. apply()
returns a vector, array, or list of values.
apply()
takes several arguments. We will start from arguments X
and MARGIN
.
X
and MARGIN
X
is the input object. It can be an array, or matrix. It can also be a data frame, which will be converted to a data frame.
The argument MARGIN
is a vector giving the subscripts which the function will be applied over. This is where we tell R which places in X
a specified function FUN
will be applied to. For instance, for a matrix, 1 indicates rows, 2 indicates columns, and c(1, 2)
indicates rows and columns. When MARGIN = 1
, apply()
function will call FUN
once for each row. Besides, MARGIN
can also be a character vector that selects dimension names. If X
has named dimnames
, such as row names or column names.
X
is a matrixIn the simplest case, the input object X
is a matrix. To show how apply()
works, here we use a pre-installed dataset USPersonalExpenditure
in the datasets
package. USPersonalExpenditure
is a matrix. We can use ?USPersonalExpenditure
to find its description.
class(USPersonalExpenditure)
## [1] "matrix" "array"
USPersonalExpenditure
consists of United States personal expenditures (in billions of dollars) for the categories food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960.
USPersonalExpenditure
## 1940 1945 1950 1955 1960
## Food and Tobacco 22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health 3.530 5.760 9.71 14.0 21.10
## Personal Care 1.040 1.980 2.45 3.4 5.40
## Private Education 0.341 0.974 1.80 2.6 3.64
The line of code below calculates the sum of personal expenditure across the years for each category. It applies the sum()
function to each row and returns a vector.
apply(USPersonalExpenditure, 1, sum)
## Food and Tobacco Household Operation Medical and Health Personal Care
## 286.300 137.700 54.100 14.270
## Private Education
## 9.355
Then this line of code below finds the maximum personal expenditure across the categories for each year. It applies the max()
function to each column and returns a vector.
apply(USPersonalExpenditure, 2, max)
## 1940 1945 1950 1955 1960
## 22.2 44.5 59.6 73.2 86.8
If we apply range()
to the columns of USPersonalExpenditure
, we get a matrix in return. range()
returns a vector of two elements, the minimum and the maximum.
apply(USPersonalExpenditure, 2, range)
## 1940 1945 1950 1955 1960
## [1,] 0.341 0.974 1.8 2.6 3.64
## [2,] 22.200 44.500 59.6 73.2 86.80
Note that apply()
uses the rownames
from our matrix to identify the elements of the resulting vector or matrix. That’s why we are seeing the food categories or years in the outputs.
rownames(USPersonalExpenditure)
## [1] "Food and Tobacco" "Household Operation" "Medical and Health"
## [4] "Personal Care" "Private Education"
colnames(USPersonalExpenditure)
## [1] "1940" "1945" "1950" "1955" "1960"
X
is an arrayWhen the input object X
is an array, the dataset has more than two dimensions, and we have more to play with in our data with apply()
.
Let’s use the famous 1973 UC Berkeley admissions data, UCBAdmissions
, another built-in dataset in R, to explore some interesting questions. This is a subset of the complete data examined in a study published on Science in 1975.
In the fall of 1973, the University of California, Berkeley’s graduate division admitted about 44% of male applicants and 35% of female applicants. The school officials were worried about the difference, or bias, in the admission rates between male and female applicants. They asked a statistician to analyze the data, who was one of the authors of the Science paper.
This dataset provides aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973, classified by admission and gender. Once again, the description of the dataset can be found by in its help file.
UCBAdmissions
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
UCBAdmissions
is a 3-dimensional array resulting from cross-tabulating 4526 observations on 3 variables.
Dimension | Name | Levels |
---|---|---|
1 | Admit | Admitted, Rejected |
2 | Gender | Male, Female |
3 | Dept | A, B, C, D, E, F |
Now let’s go back to what worried Berkeley’s officials: the overall acceptance rates for female and male applicants. The formula is simply the number of admitted female and male applicants divided by the total number of male and female applicants. But how do we get the numbers from the array?
First, we calculate the total number for both male and female applicants. We pass the sum()
function to FUN
and sum up the values in each level of dimension Gender
.
applicants <- apply(UCBAdmissions, "Gender", sum)
applicants
## Male Female
## 2691 1835
Note: Remember that when the input X
has named dimnames
, MARGIN
can be a character vector selecting dimension names.
The same result can be achieved by replacing the dimension name with its number.
apply(UCBAdmissions, 2, sum)
## Male Female
## 2691 1835
Next, we calculate the number of female and male applicants in the admitted group. Our approach here is to get the numbers for both admitted and rejected applicants and extract the admitted applicants from the result.
We need to find each combination by Admit
and Gender
to apply the function sum()
. Therefore, MARGIN
is c("Admit","Gender")
. Then we sum up their values across the departments.
apply(UCBAdmissions, c("Admit","Gender"), sum)
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
The output is a matrix.
class(apply(UCBAdmissions, c("Admit","Gender"), sum))
## [1] "matrix" "array"
Then we extract the “Admitted” applicants from the output matrix.
apply(UCBAdmissions, c("Admit","Gender"), sum)["Admitted",]
## Male Female
## 1198 557
Summarizing the two steps above, we have the number of admitted
applicants in both genders.
admitted <- apply(UCBAdmissions, c("Admit","Gender"), sum)["Admitted",]
# apply(UCBAdmissions, c(1,2), sum)[1,]
admitted
## Male Female
## 1198 557
Now we are ready to calculate the acceptance rates.
round(admitted/applicants*100, 2)
## Male Female
## 44.52 30.35
It seems that the acceptance rate in the six departments for female applicants is much lower than the acceptance rate for the male applicants.
However, when the statisticians examined the data, they discovered that within specific departments, this bias against women went away. The acceptance rate for female applicants was higher than the acceptance rate for male applicants in several cases.
Let’s get the number of applicants for both genders in each department.
applicants2 <- apply(UCBAdmissions, c("Gender","Dept"), sum)
applicants2
## Dept
## Gender A B C D E F
## Male 825 560 325 417 191 373
## Female 108 25 593 375 393 341
And the number of admitted applicants.
admitted2 <- UCBAdmissions["Admitted",,]
admitted2
## Dept
## Gender A B C D E F
## Male 512 353 120 138 53 22
## Female 89 17 202 131 94 24
The acceptance rates are:
round(admitted2/applicants2*100, 2)
## Dept
## Gender A B C D E F
## Male 62.06 63.04 36.92 33.09 27.75 5.90
## Female 82.41 68.00 34.06 34.93 23.92 7.04
But why?
Because more women had applied to departments that admitted a small percentage of applicants, like English, than to departments that admitted a large percentage of applicants, like mechanical engineering. As the authors summarized in the paper:
“The graduate departments that are easier to enter tend to be those that require more mathematics in the undergraduate preparatory curriculum. The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system. Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.” (Bickel, Hammel, O’Connell, 1975, p. 402)
This phenomenon is called the Simpson’s Paradox. It occurs in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
Another example of the Simpson’s Paradox is the survival rates of the third class passengers and crew members on Titanic. The data is also available in R, called Titanic
. It is a 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables.
Dimension | Name | Levels |
---|---|---|
1 | Class | 1st, 2nd, 3rd, Crew |
2 | Sex | Male, Female |
3 | Age | Child, Adult |
4 | Survived | No, Yes |
If we compare the survival rates for adults, we find that the numbers for third class passengers and crew members in the adults are close.
round(apply(Titanic, c(1,3,4), sum)[,"Adult","Yes"] /
apply(Titanic, c(1,3), sum)[,"Adult"] * 100, 2)
## 1st 2nd 3rd Crew
## 61.76 36.02 24.08 23.95
The survival rate is 24.08% for the third class passengers and 23.95% for the crew members.
However, if we further break the data down by gender, the survival rates are higher for crew members compared to the third class passengers for both men and women.
round(apply(Titanic, c(1,2,3,4), sum)[,,"Adult","Yes"] /
apply(Titanic, c(1,2,3), sum)[,,"Adult"] * 100, 2)
## Sex
## Class Male Female
## 1st 32.57 97.22
## 2nd 8.33 86.02
## 3rd 16.23 46.06
## Crew 22.27 86.96
Why do you think that happened?
X
is a subset of a matrix or an arrayIn the UCBAdmissions
example, when we calculated the total number of admitted applicants in both genders, we extracted the Admitted
subset after we got the total number of all applicants.
apply(UCBAdmissions, c("Admit","Gender"), sum)["Admitted",]
## Male Female
## 1198 557
The same can be achieved by subsetting the Admitted
group first, before summing up their values.
apply(UCBAdmissions["Admitted",,], "Gender", sum)
## Male Female
## 1198 557
X
is a data frameIf the input X
is a data frame, R will convert it into a matrix. apply()
works if the data frame is of the same type (e.g. all numeric). When the data frame has columns of different types, apply()
may convert them to one type. In these cases, we should use lapply()
, which we discuss below.
FUN
The function argument FUN
can be a named function, or an anonymous function. We get an anonymous function if we choose not to give the function a name. This is useful when it’s not worth the effort to figure out a name.
For example, the one-liner function below is an anonymous function. There is no need to name it if we are going to use it only once inside the apply()
function.
apply(USPersonalExpenditure, 2, function(x) sum(x)*10)
## 1940 1945 1950 1955 1960
## 376.11 687.14 1025.60 1297.00 1631.40
...
...
can be optional arguments to FUN
.
As shown below, na.rm
is the second argument to mean()
, although in this case we don’t have NA
s to worry about.
apply(USPersonalExpenditure, 1, mean, na.rm = TRUE)
Every time that apply()
calls mean()
, the first argument will be a row of USPersonalExpenditure
and the second argument will be na.rm = TRUE
. With those arguments, the function call will be mean(row, na.rm = TRUE)
.
tapply()
, by()
The second member of the apply family functions that we are going to meet is tapply()
, together with its cousin by
.
tapply()
and by()
apply a function to groups of values.
tapply()
tapply(X, INDEX, FUN, ...)
applies a function to each (non-empty) group of values given by a unique combination of the levels of certain factors.
The function has three main arguments: the vector of data X
, the factor INDEX
that defines the groups, and a function FUN
. The factor level in INDEX
identifies the group of each vector element in X
. The vector
and the factor
are of the same length.
Vector | Factor |
---|---|
9 | A |
25 | B |
32 | C |
14 | B |
2 | C |
100 | A |
Now we use an R dataset ToothGrowth
to show how tapply()
works. ToothGrowth
recorded the effect of Vitamin C on tooth growth in Guinea pigs.
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
It has three variables. dose
is dose in milligrams/day and a numeric vector. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods.
ToothGrowth$dose
## [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [20] 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [39] 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
## [58] 2.0 2.0 2.0
len
is tooth length and a numeric vector. It is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs.
ToothGrowth$len
## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0 16.5 16.5 15.2 17.3 22.5
## [16] 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
## [31] 15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7 19.7 23.3 23.6 26.4 20.0
## [46] 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0
supp
is supplement type and a factor. The two supplement types are orange juice (coded as OJ
) or ascorbic acid (a form of vitamin C and coded as VC
). The factor level identifies the group of each vector element in len
and dose
.
ToothGrowth$supp
## [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
## [26] VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## [51] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## Levels: OJ VC
To get the mean of length by the group of supplement type, we use tapply()
to apply the mean()
function to each group of supp
. The result is a vector.
tapply(ToothGrowth$len, ToothGrowth$supp, mean)
tapply()
can also manage multiple categories. This is handled by the argument INDEX
, a list of one or more factors, each of same length as X
. The elements are coerced to factors by as.factor
.
In our case, we can split the length of tooth by both supplement type and dose. dose
is not a factor, but is converted to a factor in this operation.
tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), mean)
## 0.5 1 2
## OJ 13.23 22.70 26.06
## VC 7.98 16.77 26.14
The result is a matrix.
by()
by(data, INDICES, FUN, ...)
applies a function to a data frame split by factors. by()
is a wrapper for tapply()
applied to data frames. The function returns a list.
The argument data
normally is a data frame, but can possibly be a matrix. INDICES
is a factor or a list of factors, each of length nrow(data)
.
by()
calls the function FUN
for each group with a data frame. It is useful if the function FUN
handles data frames in a special way.
For instance, the example below summarizes the data by supp
, and organizes the summary statistics in a reader friendly manner.
by(ToothGrowth, ToothGrowth$supp, summary)
## ToothGrowth$supp: OJ
## len supp dose
## Min. : 8.20 OJ:30 Min. :0.500
## 1st Qu.:15.53 VC: 0 1st Qu.:0.500
## Median :22.70 Median :1.000
## Mean :20.66 Mean :1.167
## 3rd Qu.:25.73 3rd Qu.:2.000
## Max. :30.90 Max. :2.000
## ------------------------------------------------------------
## ToothGrowth$supp: VC
## len supp dose
## Min. : 4.20 OJ: 0 Min. :0.500
## 1st Qu.:11.20 VC:30 1st Qu.:0.500
## Median :16.50 Median :1.000
## Mean :16.96 Mean :1.167
## 3rd Qu.:23.10 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
Same as tapply()
, by()
can handle multiple groups. This example below summarizes the data by all combinations of supp
and dose
.
by(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), summary)
## : OJ
## : 0.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.20 9.70 12.25 13.23 16.18 21.50
## ------------------------------------------------------------
## : VC
## : 0.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.20 5.95 7.15 7.98 10.90 11.50
## ------------------------------------------------------------
## : OJ
## : 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.50 20.30 23.45 22.70 25.65 27.30
## ------------------------------------------------------------
## : VC
## : 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.60 15.28 16.50 16.77 17.30 22.50
## ------------------------------------------------------------
## : OJ
## : 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.40 24.57 25.95 26.06 27.07 30.90
## ------------------------------------------------------------
## : VC
## : 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.50 23.38 25.95 26.14 28.80 33.90
lapply()
, sapply()
, mapply()
lapply()
, sapply()
and mapply()
apply a function over a list or vector.
lapply()
returns a list. sapply()
is a user-friendly version and wrapper of lapply()
, which returns a vector or a matrix. mapply()
is a multivariate version of sapply()
.
lapply()
lapply(X, FUN, ...)
applies a function over a list or vector. lapply()
returns a list of the same length as X
, each element of which is the result of applying FUN
to the corresponding element of X
.
X
is a listMost often, we see lapply()
applied to each element of a list.
For instance, below lapply(mylist, mean)
iterates over the list mylist
to get the mean of three vectors. The result is a list.
mylist <- list(1:10, 10:1, -5:5)
lapply(mylist, mean)
## [[1]]
## [1] 5.5
##
## [[2]]
## [1] 5.5
##
## [[3]]
## [1] 0
Now let’s see a more meaningful example using the pre-installed dataset state.x77
. state.x77
is a matrix with 50 rows and 8 columns giving the statistics of the 50 states of the United States, including population, income, life expectancy, murder rate, percent of high-school graduates and a few others.
head(state.x77, 3)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
The dataset was transformed and reorganized into a list of data frames. Following this post, the rows in the data frame state.df
became elements of a new list, for the purpose of illustration.
state.df <- data.frame(state.x77)
state.list <- setNames(split(state.df, seq(nrow(state.df))), rownames(state.df))
head(state.list, 3)
## $Alabama
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
##
## $Alaska
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
##
## $Arizona
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Let’s use the list to answer a few questions. First, what is the average population across the states?
pop <- lapply(state.list, "[[", "Population")
mean(unlist(pop))
## [1] 4246.42
Note that in the case of functions like +
, %*%
, [[
, etc., the function name must be backquoted or quoted.
pop <- lapply(state.list, "[[", "Population")
pop <- lapply(state.list, `[[`, "Population")
Next, what is the average number of people per square miles in each state?
pop_den <- function(x) x[["Population"]]/x[["Area"]]
lapply(state.list, pop_den)
## $Alabama
## [1] 0.07129053
##
## $Alaska
## [1] 0.0006443845
##
## $Arizona
## [1] 0.01950325
##
## $Arkansas
## [1] 0.04061989
##
## $California
## [1] 0.1355709
##
## $Colorado
## [1] 0.02448779
##
## $Connecticut
## [1] 0.6375977
##
## $Delaware
## [1] 0.2921292
##
## $Florida
## [1] 0.1530227
##
## $Georgia
## [1] 0.08491037
##
## $Hawaii
## [1] 0.1350973
##
## $Idaho
## [1] 0.009833448
##
## $Illinois
## [1] 0.2008503
##
## $Indiana
## [1] 0.1471867
##
## $Iowa
## [1] 0.05114317
##
## $Kansas
## [1] 0.02787729
##
## $Kentucky
## [1] 0.08542245
##
## $Louisiana
## [1] 0.08470955
##
## $Maine
## [1] 0.03421734
##
## $Maryland
## [1] 0.4167425
##
## $Massachusetts
## [1] 0.7429083
##
## $Michigan
## [1] 0.1603569
##
## $Minnesota
## [1] 0.049452
##
## $Mississippi
## [1] 0.04949679
##
## $Missouri
## [1] 0.06909196
##
## $Montana
## [1] 0.005124084
##
## $Nebraska
## [1] 0.02018749
##
## $Nevada
## [1] 0.005369054
##
## $`New Hampshire`
## [1] 0.08995237
##
## $`New Jersey`
## [1] 0.9750033
##
## $`New Mexico`
## [1] 0.009422462
##
## $`New York`
## [1] 0.3779139
##
## $`North Carolina`
## [1] 0.1115005
##
## $`North Dakota`
## [1] 0.009195502
##
## $Ohio
## [1] 0.261989
##
## $Oklahoma
## [1] 0.03947254
##
## $Oregon
## [1] 0.02374615
##
## $Pennsylvania
## [1] 0.2637548
##
## $`Rhode Island`
## [1] 0.8875119
##
## $`South Carolina`
## [1] 0.09316791
##
## $`South Dakota`
## [1] 0.008965835
##
## $Tennessee
## [1] 0.1009727
##
## $Texas
## [1] 0.04668223
##
## $Utah
## [1] 0.01465358
##
## $Vermont
## [1] 0.05093342
##
## $Virginia
## [1] 0.1252137
##
## $Washington
## [1] 0.05346252
##
## $`West Virginia`
## [1] 0.07474034
##
## $Wisconsin
## [1] 0.08425749
##
## $Wyoming
## [1] 0.003868193
X
is a vectorThe input object of lapply()
can also be a vector.
In the example below, n
takes 5 and is passed to the function fun
for 3 times.
fun <- function(n){
x <- rnorm(n)
y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
list(X = x, Y = y)
}
lapply(rep(5, 3), fun)
## [[1]]
## [[1]]$X
## [1] -1.4642910 0.8151961 -1.0451296 0.7288556 -1.1996225
##
## [[1]]$Y
## [1] -0.15149544 -0.07630003 -0.30677627 -0.21860199 -0.31394367
##
##
## [[2]]
## [[2]]$X
## [1] 1.2423225 0.4067033 -0.8115283 -0.3284437 0.4067043
##
## [[2]]$Y
## [1] 0.33400227 0.03399143 0.03873823 0.15599160 0.13703682
##
##
## [[3]]
## [[3]]$X
## [1] -0.7723077 -1.7605176 0.1962825 0.7955502 -0.1973143
##
## [[3]]$Y
## [1] -0.6729501 -0.2679057 -0.7545229 -0.1158864 -0.7901898
This usage of lapply()
is seen in large number of sequence generations.
X
is a data frameRecall earlier we have mentioned that apply()
works on the rows of a data frame if its columns are of the same type. If not, this is when lapply()
comes to rescue.
lapply()
can be applied to a data frame, since data frame is a kind of list. The function passed to the argument FUN
will be applied to each column of the data frame.
One application of this is to use lapply()
to check the types of columns in a data frame. The output is a list.
lapply(ToothGrowth, class)
## $len
## [1] "numeric"
##
## $supp
## [1] "factor"
##
## $dose
## [1] "numeric"
If we want to turn the output to a vector, we can unlist()
the output list.
unlist(lapply(ToothGrowth, class))
## len supp dose
## "numeric" "factor" "numeric"
sapply()
sapply(X, FUN, ...)
will try to simplify the result of lapply()
if possible. The output of it can be a vector or a matrix.
If the result of lapply()
is a list where every element is length 1, then using sapply()
to run the same code will return a vector. For instance, when we use lapply()
to evaluate if each column of ToothGrowth
is numeric, the output is a list where each element is a vector of length 1.
lapply(ToothGrowth, is.numeric)
## $len
## [1] TRUE
##
## $supp
## [1] FALSE
##
## $dose
## [1] TRUE
If we use sapply()
instead, we will get the same output, but in a vector rather than a list.
sapply(ToothGrowth, is.numeric)
## len supp dose
## TRUE FALSE TRUE
If the result is a list where every element is a vector of the same length larger than 1, then using sapply()
to run the same code will return a matrix.
This is an example we used earlier to explain lapply()
. The output is a list with vectors longer than 1.
fun <- function(n){
x <- rnorm(n)
y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
list(X = x, Y = y)
}
lapply(rep(5, 3), fun)
## [[1]]
## [[1]]$X
## [1] -0.8361558 1.6935620 1.8570289 0.4078630 -0.8715377
##
## [[1]]$Y
## [1] 0.09201121 0.55955178 0.45002184 0.04867432 0.12360807
##
##
## [[2]]
## [[2]]$X
## [1] -0.92694494 -0.16089881 1.32358758 -0.02410915 -0.77307756
##
## [[2]]$Y
## [1] -0.001387589 -0.093046486 -0.003721287 -0.014385157 -0.131666390
##
##
## [[3]]
## [[3]]$X
## [1] -1.0473123 -0.6158816 -0.5444173 1.3921447 2.8202498
##
## [[3]]$Y
## [1] 0.08119834 0.88877720 0.05362492 0.32822391 0.03706953
In such cases, using sapply()
returns a matrix.
sapply(rep(5, 3), fun)
## [,1] [,2] [,3]
## X Numeric,5 Numeric,5 Numeric,5
## Y Numeric,5 Numeric,5 Numeric,5
mapply()
mapply(FUN, ...)
a multivariate version of sapply()
. mapply()
applies FUN
to the first elements of each ...
argument, the second elements, the third elements, and so on.
Note that the first argument of mapply()
is a function, unlike lapply()
and sapply()
.
mapply()
applies the function element-wise to vectors or lists.
Let’s return to the state.list
and revisit the question on population density. With mapply()
, our solution can be like below. The output is a vector rather than a list.
pop <- lapply(state.list, "[[", "Population")
area <- lapply(state.list, "[[", "Area")
mapply("/", pop, area)
## Alabama Alaska Arizona Arkansas California
## 0.0712905261 0.0006443845 0.0195032491 0.0406198864 0.1355708904
## Colorado Connecticut Delaware Florida Georgia
## 0.0244877898 0.6375976964 0.2921291625 0.1530227399 0.0849103714
## Hawaii Idaho Illinois Indiana Iowa
## 0.1350972763 0.0098334482 0.2008502547 0.1471867468 0.0511431687
## Kansas Kentucky Louisiana Maine Maryland
## 0.0278772910 0.0854224464 0.0847095482 0.0342173351 0.4167424932
## Massachusetts Michigan Minnesota Mississippi Missouri
## 0.7429082545 0.1603569354 0.0494520047 0.0494967862 0.0690919632
## Montana Nebraska Nevada New Hampshire New Jersey
## 0.0051240839 0.0201874926 0.0053690542 0.0899523651 0.9750033240
## New Mexico New York North Carolina North Dakota Ohio
## 0.0094224624 0.3779139052 0.1115004713 0.0091955019 0.2619890177
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 0.0394725364 0.0237461532 0.2637548370 0.8875119161 0.0931679074
## South Dakota Tennessee Texas Utah Vermont
## 0.0089658350 0.1009727062 0.0466822312 0.0146535763 0.0509334197
## Virginia Washington West Virginia Wisconsin Wyoming
## 0.1252136752 0.0534625207 0.0747403407 0.0842574912 0.0038681934