In this document, a performance evaluation of three different implementations of the QuickSort algorithm is presented. The document compare between:
For the hardware:
## Processor: Intel(R) Core(TM) i5 CPU M 450 @ 2.40GHz
## Processor architecture: x86_64
## #Processors: 4
## #cores per processor: 2
## Main memory: 3832292 kB
## Caches:
## L1d cache: 32K
## L1i cache: 32K
## L2 cache: 256K
## L3 cache: 3072K
For the software:
## Operating system: GNU/Linux
## Distribution: Ubuntu 15.10
## Kernel version: 4.2.0-23-generic
## Compiler version: g++ (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010
clock_gettime
is used to get the current system time.O3
.The design of experiments is:
fac.design(nfactors=2, replications=15, repeat.only=FALSE,
blocks=1, randomize=TRUE, seed=24625,
factor.names=list(size=(1:20)*50000, tlevel=c(2:6)))
First plot of the data, 8 threads is used in the parallel version:
The LibC is worse than the two others, and the gap increases with the increased size. This mainly due to the comparison function calls, as it is shown by profiling.
Let’s focus on the sequential and parallel implementations:
With the confidence interval:
With the regression line:
For array size over 300K, the parallel version outperforms the sequential one
For a high number of threads (16 and 32) we notice a high variance and high number of outliers; this can be explained by the increasing number of cache misses.
Just focusing on 4 and 8 threads:
For now it is not clear who is the winner.. let’s go for larger arrays..
The design of this experiments is:
Design.2 <- fac.design(nfactors=2, replications=5, repeat.only=FALSE,
blocks=1, randomize=TRUE, seed=24625,
factor.names=list(size=(1:10)*(10^6), tlevel=c(3:4)))
The results:
When the size goes large, the best number of threads tends to be 8 which is exactly equals to nb_processors * nb_cores_per_processor
.
The sequential and the parallel version should be used together to obtain the maximum performance for any input size. Selecting of the thread level should be done according the underlying system hadrware.