Speed up matrix median calculation in R

I decided to write this short post after wasting some hours to optimize my R code to make median calculation possible for some million of values. Hope it can help someone else. The problem is easy to explain and to solve. Median is a very expensive operation to compute as it implies ordination of values, therefore sometimes may considerably slow down the code running time.
air jordan Factory
My example makes use of a list of matrices, like this:

pippo <- list(matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5),matrix(runif(100000),ncol=5))

The list is made by matrices with 2000 rows and 5 columns, a total of  1M values:
ray ban rimless sunglasses

R> str(pippo)
List of 10
$ : num [1:20000, 1:5] 0.000267 0.195338 0.648922 0.107793 0.917322 ...
$ : num [1:20000, 1:5] 0.528 0.812 0.256 0.081 0.282 ...
$ : num [1:20000, 1:5] 0.2151 0.2923 0.4407 0.0223 0.0631 ...
$ : num [1:20000, 1:5] 0.887 0.932 0.604 0.211 0.894 ...
$ : num [1:20000, 1:5] 0.0537 0.2992 0.7143 0.6015 0.9257 ...
$ : num [1:20000, 1:5] 0.252 0.616 0.153 0.643 0.521 ...
$ : num [1:20000, 1:5] 0.7559 0.1904 0.0753 0.8705 0.7023 ...
$ : num [1:20000, 1:5] 0.183 0.767 0.993 0.798 0.842 ...
$ : num [1:20000, 1:5] 0.772 0.991 0.641 0.598 0.314 ...
$ : num [1:20000, 1:5] 0.948 0.309 0.638 0.189 0.232 ...

I needed a median matrix, since each matrix was a simulation of the same process:

I found several different approaches to solve this problem on blogs and forums. I listed them below and pointed out the fastest I found.

Required packages

library(matrixStats)
library(parallelize.dynamic)
library(plyr)
library(microbenchmark)</pre>

ray ban brasil

</pre>
<pre>R> system.time(a<-Apply(simplify2array(pippo),c(1,2),median))
user system elapsed
2.746 0.000 2.747
R> system.time(b<-apply(simplify2array(pippo),c(1,2),median))
user system elapsed
2.750 0.000 2.751
R> system.time(c<-aaply(laply(pippo,as.matrix),c(2,3),median))
user system elapsed
12.22 0.00 12.23
R> system.time(d<-Apply(aperm(simplify2array(pippo),c(2,3,1)),3,rowMedians))
user system elapsed
0.273 0.000 0.272
R> system.time(e<-Apply(aperm(simplify2array(pippo),c(3,2,1)),3,colMedians))
user system elapsed
0.274 0.004 0.278
R> system.time(f<-apply(aperm(simplify2array(pippo),c(2,3,1)),3,rowMedians))
user system elapsed
0.270 0.004 0.275
R> system.time(g<-apply(aperm(simplify2array(pippo),c(3,2,1)),3,colMedians))
user system elapsed
0.28 0.00 0.28

nike air max 90 black
All the result gave the same supposedly correct median matrix:

R> all.equal(a, b, c, d,e ,f, g,h)
[1] TRUE

But d,e,f,g were considerably faster than the others even though very similar among them.
Since I needed to apply this operation several times and on much bigger lists of matrices, I wanted to know the fastest one among them. To do this I used microbenchmark package which allows to run iterations and look at time statistics.
nike air max Discount

R> microbenchmark(Apply(aperm(simplify2array(pippo),c(2,3,1)),3,rowMedians))
Unit: milliseconds
expr min lq
Apply(aperm(simplify2array(pippo), c(2, 3, 1)), 3, rowMedians) 274.9 348.4
mean median uq max neval
355.6 351.1 358.6 425.9 100
R> microbenchmark(Apply(aperm(simplify2array(pippo),c(3,2,1)),3,colMedians))
Unit: milliseconds
expr min lq mean
Apply(aperm(simplify2array(pippo), c(3, 2, 1)), 3, colMedians) 287 353.8 359.9
median uq max neval
359.2 363 417.6 100
R> microbenchmark(apply(aperm(simplify2array(pippo),c(2,3,1)),3,rowMedians))
Unit: milliseconds
expr min lq
apply(aperm(simplify2array(pippo), c(2, 3, 1)), 3, rowMedians) 279.1 351.8
mean median uq max neval
360.1 358 363.2 431.9 100
R> microbenchmark(apply(aperm(simplify2array(pippo),c(3,2,1)),3,colMedians))
Unit: milliseconds
expr min lq
apply(aperm(simplify2array(pippo), c(3, 2, 1)), 3, colMedians) 282.6 350.9
mean median uq max neval
353.3 351.5 352 417.5 100

air nike jordan
Therefore approach h, which makes use of aperm+apply+simplify2array+colMedians, is, on average, the fastest method for matrix median calculation. In general, row based operations seem to be a bit faster than column based operations. I read that R is row-optimized indeed! Surprisingly, the parallelized version of apply (Apply) does not improve the computational time!

Matteo

Leave a Reply

Your email address will not be published. Required fields are marked *