Performance Evaluation of Parallel QuickSort implementation

Intro

In this document, a performance evaluation of three different implementations of the QuickSort algorithm is presented. The document compare between:

Sequential implementation
Parallel implementation (POSIX Threads)
Built-in implementation (Libc)

Testbed

For the hardware:

## Processor: Intel(R) Core(TM) i5 CPU M 450 @ 2.40GHz
## Processor architecture: x86_64
## #Processors: 4
## #cores per processor: 2
## Main memory: 3832292 kB
## Caches: 
## L1d cache:             32K
## L1i cache:             32K
## L2 cache:              256K
## L3 cache:              3072K

For the software:

## Operating system: GNU/Linux
## Distribution: Ubuntu 15.10
## Kernel version: 4.2.0-23-generic
## Compiler version: g++ (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010

Software spec.

A fixed seed is used to guarantee that all replications are done on the same values.
The elements type of arrays are double-precision floating-point.
For measuring time, the function clock_gettime is used to get the current system time.
The optimization level of the compiler is set to O3.

Experiments

For each run, we do the measurements for each sorting function three times, in a random order, and take the average.
The parameter space is:
- For the size: from 50k to 1M by step of 50k
- For the thread level: from 2 threads to 32 threads

The design of experiments is:

fac.design(nfactors=2, replications=15, repeat.only=FALSE,
                       blocks=1, randomize=TRUE, seed=24625,
                       factor.names=list(size=(1:20)*50000, tlevel=c(2:6)))

Results

First plot of the data, 8 threads is used in the parallel version:

The LibC is worse than the two others, and the gap increases with the increased size. ~~This mainly due to the comparison function calls, as it is shown by profiling.~~

Let’s focus on the sequential and parallel implementations:

With the confidence interval:

With the regression line:

For array size over 300K, the parallel version outperforms the sequential one

Analyzing the number of threads in the parallel verion

For a high number of threads (16 and 32) we notice a high variance and high number of outliers; this can be explained by the increasing number of cache misses.

Just focusing on 4 and 8 threads:

For now it is not clear who is the winner.. let’s go for larger arrays..

The design of this experiments is:

Design.2 <- fac.design(nfactors=2, replications=5, repeat.only=FALSE,
                      blocks=1, randomize=TRUE, seed=24625,
                      factor.names=list(size=(1:10)*(10^6), tlevel=c(3:4)))

The results:

When the size goes large, the best number of threads tends to be 8 which is exactly equals to nb_processors * nb_cores_per_processor.

Conclusion

The sequential and the parallel version should be used together to obtain the maximum performance for any input size. Selecting of the thread level should be done according the underlying system hadrware.