apply家族——只为循环而生
apply()函数 是一个很 R语言 的函数,可以起到很好的替代冗余的for循环的作用,在一篇博客里面介绍过,R语言的循环操作for和while,都是基于R语言本身来实现的,而向量操作是基于底层的C语言函数实现的,所以使用apply()家族进行向量计算是 高性价比 的。apply()可以面向 数据框、列表、向量等 ,同时 任何函数 都可以传递给apply()函数。
作者:面面的徐爷
链接: https://www.jianshu.com/p/8e04245bfe6d一、apply() 家谱
apply家族为循环而生,又根据输入、输出的数据类型衍生8大派系,其中前三个最为人知:
apply函数 :处理矩阵的行或列
lapply函数 :输入list,对list每个对象操作后返回list
sapply函数 :输入list,对list每个对象操作后返回matrix
vapply函数
mapply函数
tapply函数
rapply函数
eapply函数X :an array, including a matrix.
MARGIN :a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.
FUN :the function to be applied: see ‘Details’. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted.
... :optional arguments to FUN.
实战简介:
1、数学运算:sum, mean, quantile等
2、自定义函数1、数学运算:sum, mean, quantile等
####矩阵每一行、每一列求和 > x <- cbind(x1 = 3, x2 = c(4:1, 2:5)) > dimnames(x)[[1]] <- letters[1:8] #大致看一下数据结构 x1 x2 a 3 4 b 3 3 c 3 2 d 3 1 e 3 2 f 3 3 g 3 4 h 3 5 #对x的行和列分别求和 > apply(x, 2, mean, trim = .2) > col.sums <- apply(x, 2, sum) > row.sums <- apply(x, 1, sum) > rbind(cbind(x, Rtot = row.sums), Ctot = c(col.sums, sum(col.sums))) x1 x2 Rtot a 3 4 7 b 3 3 6 c 3 2 5 d 3 1 4 e 3 2 5 f 3 3 6 g 3 4 7 h 3 5 8 Ctot 24 24 48
## Sort the columns of a matrix apply(x, 2, sort) x1 x2 [1,] 3 1 [2,] 3 2 [3,] 3 2 [4,] 3 3 [5,] 3 3 [6,] 3 4 [7,] 3 4 [8,] 3 5
2、自定义函数
> ##- function with extra args: > cave <- function(x, c1, c2) c(mean(x[c1]), mean(x[c2])) > apply(x, 1, cave, c1 = "x1", c2 = c("x1","x2")) a b c d e f g h [1,] 3.0 3 3.0 3 3.0 3 3.0 3 [2,] 3.5 3 2.5 2 2.5 3 3.5 4 > ma <- matrix(c(1:4, 1, 6:8), nrow = 2) [,1] [,2] [,3] [,4] [1,] 1 3 1 7 [2,] 2 4 6 8 > apply(ma, 1, table) #--> a list of length 2 [[1]] 1 3 7 2 1 1 [[2]] 2 4 6 8 1 1 1 1 > apply(ma, 1, stats::quantile) # 5 x n matrix with rownames [,1] [,2] 0% 1 2.0 25% 1 3.5 50% 2 5.0 75% 4 6.5 100% 7 8.0
2、 lapply()函数
lapply函数是一个最基础循环操作函数之一,用来对list、data.frame数据集进行循环,并返回和X长度同样的list结构作为结果集,通过lapply的开头的第一个字母’l’就可以判断返回结果集的类型;可以通过参数
lapply(X, FUN, ...)simplify = T
简化结果,返回matrix,结果与sapply一致。X :a vector (atomic or list) or an expression object. Other objects (including classed objects) will be coerced by base::as.list.
FUN :the function to be applied to each element of X: see ‘Details’. In the case of functions like +, %*%, the function name must be backquoted or quoted.
... :optional arguments to FUN.
simplify :logical or character string; should the result be simplified to a vector, matrix or higher dimensional array if possible? For sapply it must be named and not abbreviated. The default value, TRUE, returns a vector or matrix if appropriate, whereas if simplify = "array" the result may be an array of “rank” (=length(dim(.))) one higher than the result of FUN(X[[i]]).
> require(stats); require(graphics) > x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE)) > # compute the list mean for each list element > lapply(x, mean) [1] 5.5 $beta [1] 4.535125 $logic [1] 0.5 > # median and quartiles for each list element > lapply(x, quantile, probs = 1:3/4) 25% 50% 75% 3.25 5.50 7.75 $beta 25% 50% 75% 0.2516074 1.0000000 5.0536690 $logic 25% 50% 75% 0.0 0.5 1.0
当lapply对矩阵和数据框操作时,可能达不到我们的要求。其中当输入数据是matrix时,lapply对matrix中每个向量操作,返回值再逐个放进list中的每个key;当输入dataframe时,lapply对每一列操作。
# 生成一个矩阵 > x <- cbind(x1=3, x2=c(2:1,4:5)) > x; class(x) x1 x2 [1,] 3 2 [2,] 3 1 [3,] 3 4 [4,] 3 5 [1] "matrix" > lapply(x, sum) [[1]] [1] 3 [[2]] [1] 3 [[3]] [1] 3 [[4]] [1] 3 [[5]] [1] 2 [[6]] [1] 1 [[7]] [1] 4 [[8]] [1] 5lapply会分别循环矩阵中的每个值,而不是按行或按列进行分组计算。
如果对数据框的列求和。
> lapply(data.frame(x), sum) [1] 12 [1] 12
lapply会自动把数据框按列进行分组,再进行计算。
3、 sapply()函数
sapply()函数做的事情和lapply()一样,可以理解为是一个简化的lapply,返回的是一个向量(vector)使得对解读更加友好,其使用方法和lapply一样,不过多了两个参数: simplify&use.NAMEs,simplify = T可以将输出结果数组化,如果设置为false,sapply()函数就和lapply()函数没有差别了,use.NAMEs = T可以设置字符串为字符名。
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)> x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE)) > sapply(x, quantile) a beta logic 0% 1.00 0.04978707 0.0 25% 3.25 0.25160736 0.0 50% 5.50 1.00000000 0.5 75% 7.75 5.05366896 1.0 100% 10.00 20.08553692 1.0 > i39 <- sapply(3:9, seq) # list of vectors > sapply(i39, fivenum) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 1.0 1.0 1 1.0 1.0 1.0 1 [2,] 1.5 1.5 2 2.0 2.5 2.5 3 [3,] 2.0 2.5 3 3.5 4.0 4.5 5 [4,] 2.5 3.5 4 5.0 5.5 6.5 7 [5,] 3.0 4.0 5 6.0 7.0 8.0 9
4、 vapply()函数
vapply类似于sapply,提供了FUN.VALUE参数,用来控制返回值的行名,这样可以让程序更丰满。
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)X :a vector (atomic or list) or an expression object. Other objects (including classed objects) will be coerced by base::as.list.
FUN :the function to be applied to each element of X: see ‘Details’. In the case of functions like +, %*%, the function name must be backquoted or quoted.
... :optional arguments to FUN.
simplify :logical or character string; should the result be simplified to a vector, matrix or higher dimensional array if possible? For sapply it must be named and not abbreviated. The default value, TRUE, returns a vector or matrix if appropriate, whereas if simplify = "array" the result may be an array of “rank” (=length(dim(.))) one higher than the result of FUN(X[[i]]).
USE.NAMES :logical; if TRUE and if X is character, use X as names for the result unless it had names already. Since this argument follows ... its name cannot be abbreviated.
以上参数和sapply一样FUN.VALUE :a (generalized) vector; a template for the return value from FUN.
> i39 <- sapply(3:9, seq) # list of vectors,每个key中有n个n > sapply(i39, fivenum) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 1.0 1.0 1 1.0 1.0 1.0 1 [2,] 1.5 1.5 2 2.0 2.5 2.5 3 [3,] 2.0 2.5 3 3.5 4.0 4.5 5 [4,] 2.5 3.5 4 5.0 5.5 6.5 7 [5,] 3.0 4.0 5 6.0 7.0 8.0 9 > vapply(i39, fivenum, c(Min. = 0, "1st Qu." = 0, Median = 0, "3rd Qu." = 0, Max. = 0)) #添加行名 [,1] [,2] [,3] [,4] [,5] [,6] [,7] Min. 1.0 1.0 1 1.0 1.0 1.0 1 1st Qu. 1.5 1.5 2 2.0 2.5 2.5 3 Median 2.0 2.5 3 3.5 4.0 4.5 5 3rd Qu. 2.5 3.5 4 5.0 5.5 6.5 7 Max. 3.0 4.0 5 6.0 7.0 8.0 9
特有参数添加行名5、 tapply()函数
tapply用于分组的循环计算,通过INDEX参数可以把数据集X进行分组,相当于group by的操作。
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)> # 通过iris$Species品种进行分组 > tapply(iris$Petal.Length,iris$Species,mean) setosa versicolor virginica 1.462 4.260 5.552
6、mapply函数
mapply也是sapply的变形函数,类似多变量的sapply,但是参数定义有些变化。第一参数为自定义的FUN函数,第二个参数’…’可以接收多个数据,作为FUN函数的参数调用。
参数介绍:
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE,USE.NAMES = TRUE)FUN: 自定义的调用函数
…: 接收多个数据
MoreArgs: 参数列表
SIMPLIFY: 是否数组化,当值array时,输出结果按数组进行分组
USE.NAMES: 如果X为字符串,TRUE设置字符串为数据名,FALSE不设置> set.seed(1) # 长度为4 > n<-rep(4,4) # m为均值,v为方差 > m<-v<-c(1,10,100,1000) # 生成4组数据,按列分组 > mapply(rnorm,n,m,v) [,1] [,2] [,3] [,4] [1,] 0.3735462 13.295078 157.57814 378.7594 [2,] 1.1836433 1.795316 69.46116 -1214.6999 [3,] 0.1643714 14.874291 251.17812 2124.9309 [4,] 2.5952808 17.383247 138.98432 955.0664
由于mapply是可以接收多个参数的,所以我们在做数据操作的时候,就不需要把数据先合并为data.frame了,直接一次操作就能计算出结果了。
7、rapply函数
rapply是一个递归版本的lapply,它只处理list类型数据,对list的每个元素进行递归遍历,如果list包括子元素则继续遍历。
函数定义:
rapply(object, f, classes = "ANY", deflt = NULL, how = c("unlist", "replace", "list"), ...)
object:list数据
f: 自定义的调用函数
classes : 匹配类型, ANY为所有类型
deflt: 非匹配类型的默认值
how: 3种操作方式,当为replace时,则用调用f后的结果替换原list中原来的元素;当为list时,新建一个list,类型匹配调用f函数,不匹配赋值为deflt;当为unlist时,会执行一次unlist(recursive = TRUE)的操作
…: 更多参数,可选
比如,对一个list的数据进行过滤,把所有数字型numeric的数据进行从小到大的排序。> x=list(a=12,b=1:4,c=c('b','a')) > z=data.frame(a=rnorm(10),b=1:10) > a <- list(x=x,y=y,z=z) # 进行排序,并替换原list的值 > rapply(a,sort, classes='numeric',how='replace') [1] 12 [1] 4 3 2 1 [1] "b" "a" [1] 3.141593 [1] -0.8356286 -0.8204684 -0.6264538 -0.3053884 0.1836433 0.3295078 [7] 0.4874291 0.5757814 0.7383247 1.5952808 [1] 10 9 8 7 6 5 4 3 2 1 > class(a$z$b) [1] "integer"
从结果发现,只有a的数据进行了排序,检查b的类型,发现是integer,是不等于numeric的,所以没有进行排序。
接下来,对字符串类型的数据进行操作,把所有的字符串型加一个字符串’++++’,非字符串类型数据设置为NA。
> rapply(a,function(x) paste(x,'++++'),classes="character",deflt=NA, how = "list") [1] NA [1] NA [1] "b ++++" "a ++++" [1] NA [1] NA [1] NA
只有x$c为字符串向量,都合并了一个新字符串。那么,有了rapply就可以对list类型的数据进行方便的数据过滤了。
8、eapply函数
eapply(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE)
尽管apply家族数量庞大,一般前三位就可满足我们对循环的需要,合理使用apply家族从而更加高效简洁的达到我们的目的。http://blog.fens.me/r-apply/
https://www.jianshu.com/p/8e04245bfe6d