Intro

In this document, a performance evaluation of three different implementations of the QuickSort algorithm is presented. The document compare between:

Testbed

For the hardware:

## Processor: Intel(R) Core(TM) i5 CPU M 450 @ 2.40GHz
## Processor architecture: x86_64
## #Processors: 4
## #cores per processor: 2
## Main memory: 3832292 kB
## Caches: 
## L1d cache:             32K
## L1i cache:             32K
## L2 cache:              256K
## L3 cache:              3072K

For the software:

## Operating system: GNU/Linux
## Distribution: Ubuntu 15.10
## Kernel version: 4.2.0-23-generic
## Compiler version: g++ (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010

Software spec.

Experiments

The design of experiments is:

fac.design(nfactors=2, replications=15, repeat.only=FALSE,
                       blocks=1, randomize=TRUE, seed=24625,
                       factor.names=list(size=(1:20)*50000, tlevel=c(2:6)))

Results

First plot of the data, 8 threads is used in the parallel version:

The LibC is worse than the two others, and the gap increases with the increased size. This mainly due to the comparison function calls, as it is shown by profiling.

Let’s focus on the sequential and parallel implementations:

With the confidence interval:

With the regression line:

For array size over 300K, the parallel version outperforms the sequential one

Analyzing the number of threads in the parallel verion

For a high number of threads (16 and 32) we notice a high variance and high number of outliers; this can be explained by the increasing number of cache misses.

Just focusing on 4 and 8 threads:

For now it is not clear who is the winner.. let’s go for larger arrays..

The design of this experiments is:

Design.2 <- fac.design(nfactors=2, replications=5, repeat.only=FALSE,
                      blocks=1, randomize=TRUE, seed=24625,
                      factor.names=list(size=(1:10)*(10^6), tlevel=c(3:4)))

The results:

When the size goes large, the best number of threads tends to be 8 which is exactly equals to nb_processors * nb_cores_per_processor.

Conclusion

The sequential and the parallel version should be used together to obtain the maximum performance for any input size. Selecting of the thread level should be done according the underlying system hadrware.