SMPI Status Report

Table of Contents

Sitemap

---> misc
| ---> 2016
| ---> 2015
| ---> 2014
| ---> 2013
| ---> 2012
`--> Agenda

This is a short summary of Augustin's recent work on SMPI.

BigDFT

  • Augustin has finally gained access to the BigDFT bazaar and has thus access to the latest version of the code. Input file management has been improved but was still broken for SMPI and had to be fixed. Augustin had to hack a lot but finally got it working.
  • The __thread pragma is not available in FORTRAN. Instead he relied on the openMPI pragma threadprivate that provides a similar functionnality. Obviously, this required to annotate everywhere with this pragma, which has to be used only for SMPI. This could be enclosed in #ifdef but cpp is not necessarily run with fortran code… So these modifications won't a priori get committed since, as such, they would induce performance loss. Another point is that BigDFT over SMPI has to be compiled and run with openMP and run with one openMP thread per node.
  • It still segfaults in the end because of a bunch of frees that are not correctly protected yet. At the moment it does not segfault or double free in a non-deterministic way because kernel characterization is still done at runtime.
  • In term of communication patterns, BigDFT is mainly a series of independent computations interleaved with AllReduce, All2Allv, AllGatherv.
  • Something that is painful is that BigDFT is in FORTRAN and Augustin fails tracing it with Tau at the moment. So he relies on Extrae (that works with MPICH but not with openMPI… :( ).

All2All and load injection to saturation

All2All is likely to be the most difficult operation to simulate as it induces a lot of communications. The ultimate goal is to "prove" that we manage to account for modeling of contention in hierarchical networks. So we rely heavily on griffon, which has three cabinets.

Augustin has run many many experiments (with different cabinet configurations) and compared RL traces with SMPI. There are important discrepencies for the All2All regions. There are several reasons for this:

  1. The all2all communication pattern is different. We forced using MPICH and openMPI to use the "default" pair-wise algorithm but they may decide to use something else at runtime. Note that observing the detail of communications of collective operations requires to steal the code from the MPI src and to run it as a standard MPI code to allow tracing library to trace individual communications and not calls and exits of the All2All that brings rather poor level of information.
  2. Send-Receive communications do not behave as expected.

    • OpenMPI:ISend, Recv, Wait
    • MPICH: ISend, IRecv, Waitall

    But although a series of SendRecv involving a pair or processors, differ between version (sometimes, it is possible to get both Send and Recv at full speed and sometimes, they are serialized), this disappears as soon as more processors are involved. This phenomenon is quite visible here. In the first phase (red and blue), the first processors do 10 Send while the last ones do 10 Recv (1->N/2+1, 2->n/2+2, …). In the second phase (green), the processors use pairwise SendRecv (1<->N/2+1, 2<->n/2+2, …) and there is a global 1.3 slowdown (roughly). A more detailed look confirms this trend and reveals much more important slowdowns (around 2).

    Yet, the following screenshot shows the communications times for a 10MB All2All using OpenMPI (pairwise using default SendRecv). As can be seen communications in the middle are slowed down (they correspond to the situation 1<->N/2+1, 2<->n/2+2, …) but this discrepency is not really significative and cannot explain the underestimation of the whole All2all operation.

    So we need to focus on SendRecv involving larger number of participants such as 1->2->3->4->1. In such configurations everyone gets the same communication times but it is much lower than expected: links to not behave as a simple full-duplex links. These send and receive operations interfere with each others and instead of getting twice the bandwidth, only 1.5 (at best and more generally 1.2 for OpenMPI) of the BW can be reached. Such saturation phenomenon is easily accounted for by changing the cluster representation and adding a new link (for each node) that is used both for ongoing and outgoing communications.

  3. Saturation is not what what we expected. Since pair-wise (1<->2 || 3<->4) have this serialization issue, we only run 1->2->3->4->1 Sendrecvs patterns:
    • When increasing the number of such groups within a cabinet, no degradation is visible.
    • When grouping 1 and 3 in the the same cabinet and 2 and 4 in the other cabinet, saturation starts to appear on the link between cabinets.
    • The behavior is different from what is observed when doing 1->2 and 3->4, which means again that the connexion between the cabinets is not really full duplex. The saturation in the interconnect is different.
    • Saturation on the interconnect is not "fair". Some communications are "normal" without slowdown but some others are extremely slowed down (3 to 11 times slower). This "slowdown factor" seems almost discrete with a strange distribution, somehow just as if there was some kind of slotting mechanism. At the moment, even if this is not satisfactory since these times are multi-modal, Augustin can compute the average and derive a sort of "maximum average capacity" that can be used to limit the full-duplex link used for modeling interconnect saturation that allows to finally optain a satisfactory global all2all communication time.

AllReduce in BigDFT

When running BigDFT, actually, the critical collective operation is AllReduce and there is an important discrepency between RL and SG. The main reason seems to be that the collective communication algorithm used in RL MPI implementations is very different from the simplistic ones of SMPI. E.g., there are 2300 lines of code for the AllReduce in OpenMPI! This is to be compared with around 60 lines in SMPI… Using such or such version of the collective communication algorithm can be forced but it seems that in general, there is some kind of decision module that dynamically adapts itself to take the best possible decision.

  • This really raises the question of re-implementing all these collectives operations. Choosing an implementation and acting as a "driver" for MPI would solve these issues… MPI 3.0 is likely to make things even worse and force us into "marrying" with a standard implementation.
  • Another alternative was suggested by Augustin who stumbled upon starMPI that provides many different implementations (using directly MPI Send and Recv primitives) of collective operations and that we could thus reuse.
  • More pragmatically, Augustin currently looks at which algorithm is used for the AllReduce in MPICH so as to see whether using the right pattern would put things back into order. Actually, using one of these optimized operations in SimGrid dramatically helps and should solve our problems.

Run de BigDFT sur Curie

Two types of run:

  1. A very scalable one (in the upper part, light blue is compute, yellow is bcast and red is barrier) to show that BigDFT can scale (note that initialization is before the khaki part - all2allv - and that they are mainly interested in what happens after this). The trace mainly shows large periods of computations followed by short periods of communications so it's not really interesting for us.
  2. The second trace shows an unbalanced situation (blue is computation, green is wait, pink is all_reduce actually waiting for all nodes to be available). There are several orbitals to compute and it's not easily to divide between the different nodes d. There is thus some important load imbalance and nice patterns of wait. Although this is not a desired situation, this would be interesting to study.

Entered on [2013-03-14 jeu. 11:09]