I decided to write this short post after wasting some hours to optimize my R code to make median calculation possible for some million of values. Hope it can help someone else. The problem is easy to explain and to solve. Median is a very expensive operation to compute as it implies ordination of values, therefore sometimes may considerably slow down the code running time.
air jordan Factory
My example makes use of a list of matrices, like this:
pippo <- list(matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5))
The list is made by matrices with 2000 rows and 5 columns, a total of 1M values:
ray ban rimless sunglasses
R> str(pippo) List of 10 $ : num [1:20000, 1:5] 0.000267 0.195338 0.648922 0.107793 0.917322 ... $ : num [1:20000, 1:5] 0.528 0.812 0.256 0.081 0.282 ... $ : num [1:20000, 1:5] 0.2151 0.2923 0.4407 0.0223 0.0631 ... $ : num [1:20000, 1:5] 0.887 0.932 0.604 0.211 0.894 ... $ : num [1:20000, 1:5] 0.0537 0.2992 0.7143 0.6015 0.9257 ... $ : num [1:20000, 1:5] 0.252 0.616 0.153 0.643 0.521 ... $ : num [1:20000, 1:5] 0.7559 0.1904 0.0753 0.8705 0.7023 ... $ : num [1:20000, 1:5] 0.183 0.767 0.993 0.798 0.842 ... $ : num [1:20000, 1:5] 0.772 0.991 0.641 0.598 0.314 ... $ : num [1:20000, 1:5] 0.948 0.309 0.638 0.189 0.232 ...
I needed a median matrix, since each matrix was a simulation of the same process:
I found several different approaches to solve this problem on blogs and forums. I listed them below and pointed out the fastest I found.
Required packages
library(matrixStats) library(parallelize.dynamic) library(plyr) library(microbenchmark)</pre>
</pre> <pre>R> system.time(a<-Apply(simplify2array(pippo),c(1,2),median)) user system elapsed 2.746 0.000 2.747 R> system.time(b<-apply(simplify2array(pippo),c(1,2),median)) user system elapsed 2.750 0.000 2.751 R> system.time(c<-aaply(laply(pippo,as.matrix),c(2,3),median)) user system elapsed 12.22 0.00 12.23 R> system.time(d<-Apply(aperm(simplify2array(pippo),c(2,3,1)),3,rowMedians)) user system elapsed 0.273 0.000 0.272 R> system.time(e<-Apply(aperm(simplify2array(pippo),c(3,2,1)),3,colMedians)) user system elapsed 0.274 0.004 0.278 R> system.time(f<-apply(aperm(simplify2array(pippo),c(2,3,1)),3,rowMedians)) user system elapsed 0.270 0.004 0.275 R> system.time(g<-apply(aperm(simplify2array(pippo),c(3,2,1)),3,colMedians)) user system elapsed 0.28 0.00 0.28
nike air max 90 black
All the result gave the same supposedly correct median matrix:
R> all.equal(a, b, c, d,e ,f, g,h) [1] TRUE
But d,e,f,g were considerably faster than the others even though very similar among them.
Since I needed to apply this operation several times and on much bigger lists of matrices, I wanted to know the fastest one among them. To do this I used microbenchmark package which allows to run iterations and look at time statistics.
nike air max Discount
R> microbenchmark(Apply(aperm(simplify2array(pippo),c(2,3,1)),3,rowMedians)) Unit: milliseconds expr min lq Apply(aperm(simplify2array(pippo), c(2, 3, 1)), 3, rowMedians) 274.9 348.4 mean median uq max neval 355.6 351.1 358.6 425.9 100 R> microbenchmark(Apply(aperm(simplify2array(pippo),c(3,2,1)),3,colMedians)) Unit: milliseconds expr min lq mean Apply(aperm(simplify2array(pippo), c(3, 2, 1)), 3, colMedians) 287 353.8 359.9 median uq max neval 359.2 363 417.6 100 R> microbenchmark(apply(aperm(simplify2array(pippo),c(2,3,1)),3,rowMedians)) Unit: milliseconds expr min lq apply(aperm(simplify2array(pippo), c(2, 3, 1)), 3, rowMedians) 279.1 351.8 mean median uq max neval 360.1 358 363.2 431.9 100 R> microbenchmark(apply(aperm(simplify2array(pippo),c(3,2,1)),3,colMedians)) Unit: milliseconds expr min lq apply(aperm(simplify2array(pippo), c(3, 2, 1)), 3, colMedians) 282.6 350.9 mean median uq max neval 353.3 351.5 352 417.5 100
air nike jordan
Therefore approach h, which makes use of aperm+apply+simplify2array+colMedians, is, on average, the fastest method for matrix median calculation. In general, row based operations seem to be a bit faster than column based operations. I read that R is row-optimized indeed! Surprisingly, the parallelized version of apply (Apply) does not improve the computational time!
Matteo