This page is now outdated. All this work has been moved to http://simgrid.gforge.inria.fr/contrib/smpi-paraver.html. Please, consider using the new up-to-date version.

Achievements¶

Links to generated files¶

Presentation of current work from both sides¶

Simulation of MPI programs (Arnaud Legrand)
Spatial and Temporal Aggregation of Traces of Parallel Systems (Damien Dosimont)
Evolution of the BigDFT code (Luigi Genovese)
Presentation of the Paraver Format to improve interoperability (Juan Gonzalez)
Clustering techniques applied to BigDFT (Harald Servat)
Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures (Luka Stanisic)
Raising the Level of Abstraction: Simulation of Large Chip Multiprocessors Running Multithreaded Applications (Alejandro Rico)

TODO BigDFT simulation [⅔]¶

[x] Simulate order(n) BigDFT with SMPI with no modification.
[x] Obtained an unbalanced (paje) trace where we could observe the same kind of (paraver) trace as what Luigi, Brice and our BSC colleagues obtained on a real run. The timing obviously do not make any sense as the platform model was completely different from the real platform but the general unbalanced shape was the same and the same process was slowing the whole application.
[ ] Instrument order(n) BigDFT to speed up the simulation ?

TODO Interaction between Paraver and SMPI [⅝]¶

[ ] Paraver conversion
[x] Wrote a paraver to csv/pjdump/smpi converter (in perl) that worked on an old small 8 node BigDFT paraver trace.
[ ] A few uggly things had to be done here (reduce, alltoallV, no handling of p2p operations, second/nanosecond issue, …) and need to be cleaned.
[ ] Maybe it would be interesting to have an option that allows extrae to trace all the parameters ?
[x] Wrote a simple shell script to replay this trace with SMPI and generated an SMPI paje trace.
[x] I still need to improve the shell script so that it takes arguments on the command line.
[x] Wrote a perl script that converts an SMPI paje trace to the paraver file format.
[ ] Improve this perl script
[x] improve the conversion to export events so that collective operation names are the same and things are easily comparable. (Edit: this was done in Chicago with Harald)
[ ] Currently there are two scripts (pjdump2prv.pl and pjsmpi2prv.pl). The 1^st one is for ocelotl/pjdump output while the second one is intended for the SMPI -> PRV final step. I'm currently merging them together.
[ ] add links (arrows) so that bandwidth can be computed in paraver
[x] Managed to open the resulting paraver trace in paraver.
[x] Have a prototype integration of SMPI within Paraver. (Edit: this was done in Chicago. If you use the dimemas-wrapper.sh) instead of the original one, it will launch smpi. Better integration to allow to specify platform and deployment would be nice.
[ ] Make a model of Mare Nostrum, the Mont-blanc prototype, so that BSC staff can really play with SMPI. (Edit: this was discussed in Chicago with Judit. I explained here the SimGrid XML plaform representation and she will try to play with SMPI and come back to me with questions).

TODO Trace Aggregation [⅘]¶

All this is better sumarized in the blog entry Damien wrote about this.

[x] The paraver to pjdump converter was integrated in framesoc.
[x] Damien managed to load several paraver traces in ocelotl and to play with aggregation.
[x] Managed to load a SMPI replayed trace of order(n) BigDFT and could aggregate it and easily spot the disturbing process and the application phases.
[x] Convert the real O(n) BigDFT paraver trace and aggregates it
[ ] Convert the 12 GB Nancy LU trace (700 process on 3 clusters) to paraver to see whether the behavior exhibited by ocelotl can be observer in Paraver. This involves slightly modifying the paje to paraver converter which was designed for SMPI paje traces.

This trace was on flutin and I got it here:

[ ] Fix the state name conversion and the event conversion
[ ] The ',9' at end of the header is the number of communicators…
[ ] The resulting prv starts from the pjdump and I forgot to sort it. Could we give an option to pjdump so that it sorts it according to time?
[ ] Do not use state 0 as it's reserved for computation
[ ] Create a state and event for MPI application (derived from being outside MPI calls)
[ ] clock resolution issue

Interaction between Paraver and SMPI¶

A year and a half ago, I needed to write paraver converter because in a particular setup I could not trace BigDFT neither with TAU not Scalasca. My goal was simply to compute statistics on the trace using R. Today, we're in Barcelona and we're discussing on whether SMPI could be used as an alternative to Dimemas within the paraver framework. To this end, we need to make sure that SMPI can simulate paraver traces and output paraver traces. Ideally, we would modify SMPI to that it can parse and generate such traces but it's probably more work than what we can achieve in two days so we'll go for simple trace conversions, i.e., a paraver to SMPI time-independent trace format conversion and a Paje to paraver conversion.

Let's start from the traces I used at that time.

cp -r ../../../2013/04/03/paraver_trace ./
ls paraver_trace/

EXTRAE_Paraver_trace_mpich.pcf
EXTRAE_Paraver_trace_mpich.prv
EXTRAE_Paraver_trace_mpich.row

Paraver to CSV and SMPI format Conversion¶

Juan Gonzalez provided us a description of the Paraver and Dimemas format. The Paraver description is available here, i.e., from the Paraver documentation. Remember the pcf file describes events, the row file defines the cpu/node/thread mapping and the prv is the trace with all events. I reworked my old script to convert from paraver to csv, pjdump and SMPI time-independant trace format during the night. Unfortunately, on the morning, Juan explained me I should not trust the state records but only the the event and communication records. Ideally, I should have worked from the dimemas trace instead of the paraver trace to obtain SMPI trace but at least, this allowed me to get a converter to csv/pjdump, which is very useful to Damien for framesoc/ocelotl.

So I really struggled to make it work and had to make several assumptions and "Uggly hacks" (indicated in the code). In particular, something that is really uggly at the moment is that the V collective operations where send and receive are process specific appear as many times as there are process and since I translate on the fly, I do not produce a correct input for SMPI. The easiest solution to handle this is probably to have two pass but nevermind for a first proof of concept.

use strict;
use Data::Dumper;

my $power_reference=286.087E-3; # in flop/mus

sub main {
    # default values for $input, $output and $format may have be
    # defined when tangling from babel but command line arguments
    # should always override them.
    my($arg);

    while(defined($arg=shift(@ARGV))) {
        for ($arg) {
            if (/^-i$/) { $input = shift(@ARGV); last; }
            if (/^-o$/) { $output = shift(@ARGV); last; }
            if (/^-f$/) { $format = shift(@ARGV); last; }
            print "unrecognized argument '$arg'";
        }
    }

    if(!defined($input) || $input eq "") { die "No valid input file provided.\n"; }
    if(!defined($output) || $output eq "") { die "No valid input file provided.\n"; }

    print "Input: '$input'\n";
    print "Output: '$output'\n";
    print "Format: '$format'\n";

    my($state_name,$event_name) = parse_pcf($input.".pcf");
    my($resource_name) = parse_row($input.".row");
    convert_prv($input.".prv",$state_name,$event_name,$resource_name,$output,$format);
}

sub parse_row {
    my($row) = shift;
    my $line;
    my(%resource_name);

    open(INPUT,$row) or die "Cannot open $row. $!";
    while(defined($line=<INPUT>)) {
        chomp $line;
        if($line =~ /^LEVEL (.*) SIZE/) {
            my $type = $1;
            $resource_name{$type}= [];
            while((defined($line=<INPUT>)) &&
                  !($line =~ /^\s*$/g)) {
                chomp $line;
                push @{$resource_name{$type}}, $line;
            }
        }
    }

    return (\%resource_name);
}

sub parse_pcf {
    my($pcf) = shift;
    my $line;
    my(%state_name, %event_name) ;
    open(INPUT,$pcf) or die "Cannot open $pcf. $!";
    while(defined($line=<INPUT>)) {
        chomp $line;
        if($line =~ /^STATES$/) {
            while((defined($line=<INPUT>)) &&
                  ($line =~ /^(\d+)\s+(.*)/g)) {
                $state_name{$1} = $2;
            }
        }
        if($line =~ /^EVENT_TYPE$/) {
            while($line=<INPUT>) {
                if($line =~ /VALUES/g) {last;}
                $line =~ /[6|9]\s+(\d+)\s+(.*)/g or next; #E.g. , EVENT_TYPE\n 1    50100001    Send Size in MPI Global OP
                my($id)=$1;
                $event_name{$id}{type} = $2;
            }
            while((defined($line=<INPUT>)) &&
                  ($line =~ /^(\d+)\s+(.*)/g)) {
                my($id);
                foreach $id (keys %event_name) {
                    $event_name{$id}{value}{$1} = $2;
                }
            }
        }
    }
    # print Dumper(\%state_name);
    # print Dumper(\%event_name);
    return (\%state_name,\%event_name);
}

my(%pcf_coll_arg) = (
    "send" => "50100001",
    "recv" => "50100002",
    "root" => "50100003",
    "communicator" => "50100003",
    "compute" => "my_reduce_compute_amount",
);

my(%tit_translate) = (
    "Running" => "compute",
    "Not created" => "", # skip me
    "I/O" => "",         # skip me
    "Synchronization" => "", # skip me
    "MPI_Comm_size" => "",   # skip me
    "MPI_Comm_rank" => "",   # skip me
    "Outside MPI" => "",     # skip me
    "End" => "",             # skip me
    "MPI_Init" => "init",
    "MPI_Bcast" => "bcast",
    "MPI_Allreduce" => "allReduce",
    "MPI_Alltoallv" => "allToAllV",
    "MPI_Alltoall" => "allToAll",
    "MPI_Reduce" => "reduce",
    "MPI_Allgatherv" => "", # allGatherV Uggly hack 
    "MPI_Gather" => "gather",
    "MPI_Gatherv" => "gatherV",
    "MPI_Reduce_scatter" => "reduceScatter",
    "MPI_Finalize" => "finalize",
    "MPI_Barrier" => "barrier",
 );

sub convert_prv {
    my($prv,$state_name,$event_name,$resource_name,$output,$format) = @_;
    my $line;
    my (%event);
    my(@fh)=();

    open(INPUT,$prv) or die "Failed to open $prv:$!\n";


    # Start parsing the header to get the trace hierarchy. 
    # We should get something like
    # #Paraver (dd/mm/yy at hh:m):ftime:0:nAppl:applicationList[:applicationList]

    $line=<INPUT>; chomp $line;
    $line=~/^\#Paraver / or die "Invalid header '$line'\n";
    my $header=$line;
    $header =~ s/^[^:\(]*\([^\)]*\):// or die "Invalid header '$line'\n";
    $header =~ s/(\d+):(\d+)([^\(\d])/$1\_$2$3/g;
    $header =~ s/,\d+$//g;
    my ($max_duration,$resource,$nb_app,@appl) = split(/:/,$header);
    $max_duration =~ s/_.*$//g;
    $resource =~ /^(.*)\((.*)\)$/ or die "Invalid resource description '$resource'\n";
    my($nb_nodes,$cpu_list)= ($1,$2);

    $nb_app==1 or die "I can handle only one application type at the moment\n";

    my @cpu_list=split(/,/,$cpu_list);

    # print("$max_duration --> '$nb_nodes' '@cpu_list'    $nb_app  @appl \n");
    my(%Appl);
    my($nb_task);
    foreach my $app (1..$nb_app) {
        my($task_list);
        $appl[$app-1] =~ /^(.*)\((.*)\)$/ or die "Invalid resource description '$resource'\n";
        ($nb_task,$task_list) = ($1,$2);

        my(@task_list) = split(/,/,$task_list);


        my(%mapping);
        my($task);
        foreach $task (1..$nb_task) {
            my($nb_thread,$node_id) = split(/_/,$task_list[$task-1]);
            if(!defined($mapping{$node_id})) { $mapping{$node_id}=[]; }
            push @{$mapping{$node_id}},[$task,$nb_thread];
        }
        $Appl{$app}{nb_task}=$nb_task;
        $Appl{$app}{mapping}=\%mapping;
    }

    for ($format) {
        if (/^csv$/) { 
            $output .= ".csv";
            open(OUTPUT,"> $output") or die "Cannot open $output. $!"; 
            last; 
        } 
        if (/^pjdump$/) { 
            $output .= ".pjdump";
            open(OUTPUT,"> $output"); 
            my @tab = split(/:/,`tail -n 1 $prv`);
            print OUTPUT "Container, 0, 0, 0.0, $max_duration, $max_duration, 0\n";
            foreach my $node (1..$nb_nodes) {
                print OUTPUT "Container, 0, N, 0.0, $max_duration, $max_duration, node_$node\n";
            }
            foreach my $app (values(%Appl)) {
                foreach my $node (keys%{$$app{mapping}}) {
                    foreach my $t (@{$$app{mapping}{$node}}) {
                        print OUTPUT "Container, node_$node, P, 0.0, $max_duration, $max_duration, MPI_Rank_$$t[0]\n";
                        foreach my $thread (1..$$t[1]) {
                            print OUTPUT "Container, MPI_Rank_$$t[0], T, 0.0, $max_duration, $max_duration, Thread_$$t[0]_$thread\n";
                        }
                    }
                }
            }
            last;
        }
        if(/^tit$/) {
            my $nb_proc = 0;
            foreach my $node (@{$$resource_name{NODE}}) { 
                my $filename = $output."_$nb_proc.tit";
                open($fh[$nb_proc], "> $filename") or die "Cannot open > $filename: $!";
                $nb_proc++;
            }
            last;
        }
        die "Invalid format '$format'\n";
    }

    # Now, let's process the records 
    sub process_event {
        my(%event_list)=@_;
        my($sname);
        my($sname_param);

        if(defined($event_list{50000003})) {
            $sname = $$event_name{50000003}{value}{$event_list{50000003}};
            $sname_param = "";
        } elsif(defined($event_list{50000002})) {
            $sname = $$event_name{50000002}{value}{$event_list{50000002}};
            my $t;
            if($tit_translate{$sname} =~ /V$/) { # Really Uggly hack because of "poor" tracing of V operations
                if($event_list{$pcf_coll_arg{"send"}}==251 ||
                   $event_list{$pcf_coll_arg{"recv"}}==251 ) {
                }

                $event_list{$pcf_coll_arg{"send"}} = 100000;
                $event_list{$pcf_coll_arg{"recv"}} = 100000;
                $sname =~ s/v$//i;
            }

            if($tit_translate{$sname} eq "reduce") { # Uggly hack because the amount of computation is not given
                $event_list{$pcf_coll_arg{"compute"}} = 1;
            }
            if($tit_translate{$sname} eq "gather") { # Uggly hack because the amount of receive does not make sense here
                $event_list{$pcf_coll_arg{"recv"}} = $event_list{$pcf_coll_arg{"send"}};
                $event_list{$pcf_coll_arg{"root"}} = 1; # Uggly hack. AAAAARGH
            }
            if($tit_translate{$sname} eq "reduceScatter") { # Uggly hack because of "poor" tracing
                $event_list{$pcf_coll_arg{"recv"}} = $event_list{$pcf_coll_arg{"send"}}; 
                my $foo=$event_list{$pcf_coll_arg{"recv"}};
                $event_list{$pcf_coll_arg{"recv"}}="";
                for (1..$nb_task) { $event_list{$pcf_coll_arg{"recv"}} .= $foo." "; }
                $event_list{$pcf_coll_arg{"compute"}} = 1;
            }

            foreach $t ("send","recv", "compute", "root") {
                if(defined($event_list{$pcf_coll_arg{$t}}) &&
                   $event_list{$pcf_coll_arg{$t}} ne "0") {
                    if($t eq "root") { $event_list{$pcf_coll_arg{$t}}--; }
                    $sname_param.= "$event_list{$pcf_coll_arg{$t}} ";
                }
            }
        } else { # This may be application of trace flushing event
                 # and hardware counter, user function, ...
            my($warn)=1;
            for (40000018,40000003,40000001,
                 42009999,42001003,42001010,42001015,300,
                 70000001,70000002,70000003,80000001,80000002,80000003, 
                 45000000) {
                if(defined($event_list{$_})) {$warn=0; last;}
            }
            if($warn) { print "Skipping event:\n"; 
                        print Dumper(%event_list);}
            next;
        }
        return($sname,$sname_param);
    }

    while(defined($line=<INPUT>)) {
        chomp($line);
        # State records 1:cpu:appl:task:thread : begin_time:end_time : state
        if($line =~ /^1/) {
            my($sname);
            my($sname_param);
            my($record,$cpu,$appli,$task,$thread,$begin_time,$end_time,$state) =
                split(/:/,$line);
            if($$state_name{$state} =~ /Group/ || $$state_name{$state} =~ /Others/ ) {
                $line=<INPUT>;
                chomp $line;
                my($event,$ecpu,$eappli,$etask,$ethread,$etime,%event_list) =
                    split(/:/,$line);
                (($event==2) && ($ecpu eq $cpu) && ($eappli eq $appli) && 
                 ($etask eq $task) && ($ethread eq $thread) &&
                 ($etime >= $begin_time) && ($etime <= $end_time)) or
                 die "Invalid event!";

                ($sname,$sname_param)=process_event(%event_list);
            } else {
                $sname = $$state_name{$state};
            }

            if($sname eq "Running") { $sname_param.= (($end_time-$begin_time)*$power_reference); }

            if($format eq "csv") {
                print OUTPUT "State, $task, MPI_STATE, $begin_time, $end_time, ".
                    ($end_time-$begin_time).", 0, ".
                    $sname."\n";
            } 
            if($format eq "pjdump") {
                print OUTPUT "State, Thread_${task}_$thread, STATE, $begin_time, $end_time, ".
                    ($end_time-$begin_time).", 0, ".
                    $sname."\n";
            }
            if($format eq "tit") {
                $task=$task-1;                  
                defined($tit_translate{$sname}) or die "Unknown state '$sname' for tit\n";
                if($tit_translate{$sname} ne "") {
                    print { $fh[$task] } "$task $tit_translate{$sname} $sname_param\n",
                }
            }
        } elsif ($line =~ /^2/) {
          # Event records 2:cpu:appl:task:thread : time : event_type:event_value
          my($event,$cpu,$appli,$task,$thread,$time,%event_list) =
                  split(/:/,$line);
          my($sname,$sname_param)=process_event(%event_list);

          if($format eq "tit") {
              $task=$task-1;                  
              defined($tit_translate{$sname}) or die "Unknown state '$sname' for tit:\n\t$line\n";
              if($tit_translate{$sname} ne "") {
                  print { $fh[$task] } "$task $tit_translate{$sname} $sname_param\n",
              }
          }
        } elsif($line =~ /^3/) { 
            # Communication records 3: cpu_send:ptask_send:task_send:thread_send : logical_time_send: actual_time_send: cpu_recv:ptask_recv:task_recv:thread_recv : logical_time_recv: actual_time_recv: size: tag
            print STDERR "Skipping this communication event\n";
        }
        if($line =~ /^c/) {
            # Communicator record c: app_id: communicator_id: number_of_process : thread_list (e.g., 1:2:3:4:5:6:7:8)
            print STDERR "Skipping communicator definition\n";
        }
    }

    for ($format) {
        if (/^csv$/) { 
            close(OUTPUT); print "Generated [[file:$output]]\n";
            last; 
        }
        if (/^pjdump$/) { 
            close(OUTPUT); print "Generated [[file:$output]]\n";
            last; 
        }
        if(/^tit$/) {
            foreach my $f (@fh) {
                close($f) or die "Failed closing file descriptor. $!\n";
            }
            print "Generated [[file:${output}_0.tit]] among other ones\n";
            last;
        }
        die "Invalid format '$format'\n";
    }
}

main();

Input: './paraver_trace/EXTRAE_Paraver_trace_mpich'
Output: './paraver_trace/bigdft_8_rl'
Format: 'tit'
Generated [[file:./paraver_trace/bigdft_8_rl_0.tit]] among other ones

head paraver_trace/bigdft_8_rl.csv

State, 1, MPI_STATE, 0, 10668, 10668, 0, Not created
State, 2, MPI_STATE, 0, 5118733, 5118733, 0, Not created
State, 3, MPI_STATE, 0, 9374527, 9374527, 0, Not created
State, 4, MPI_STATE, 0, 17510142, 17510142, 0, Not created
State, 5, MPI_STATE, 0, 5989994, 5989994, 0, Not created
State, 6, MPI_STATE, 0, 5737601, 5737601, 0, Not created
State, 7, MPI_STATE, 0, 5866978, 5866978, 0, Not created
State, 8, MPI_STATE, 0, 5891099, 5891099, 0, Not created
State, 1, MPI_STATE, 10668, 25576057, 25565389, 0, Running
State, 2, MPI_STATE, 5118733, 18655258, 13536525, 0, Running

Let's try to replay on SMPI¶

cp /home/alegrand/Work/SimGrid/infra-songs/WP4/SC13/graphene.xml ./graphene.xml

print_usage()
{
    echo "Usage: $0 [OPTIONS]"
cat <<'End-of-message'
  -i|--input Paraver input file
  -o|--output output file (in the paje format)
  -p|--platform XML platform file
  -m|--machine_file 
  -h|help print help information
End-of-message
 exit 1
}

TEMP=`getopt -o i:o:p:m:h --long input:,output:,platform:,machine_file:,help -n 'smpi2pj.sh' -- "$@"`
eval set -- "$TEMP"
while true;do 
 case "$1" in 
    -i|--input)
        case "$2" in 
          "") shift 2;;
           *) INPUT=$2;shift 2;;
        esac;;
    -o|--output)
        case "$2" in 
          "") shift 2;;
           *) OUTPUT=$2;shift 2;;
        esac;;
    -p|--platform)
        case "$2" in 
          "") shift 2;;
           *) PLATFORM=$2;shift 2;;
        esac;;
    -m|--machine)
        case "$2" in 
          "") shift 2;;
           *) MACHINE_FILE=$2;shift 2;;
        esac;;
    -h|--help)
        print_usage;shift;;
     --) shift; break;;
     *) echo "Unknown option '$1'"; print_usage;;
 esac
done


TMP_WORKING_PATH=`mktemp -d`

# Creating input for smpi_replay
REPLAY_INPUT=$TMP_WORKING_PATH/smpi_replay.txt
ls $INPUT*.tit > $REPLAY_INPUT

# Get the number of MPI ranks
export NP=`cat $REPLAY_INPUT | wc -l`

# Generating a dumb deployment (machine_file) if needed
if [ -z "$MACHINE_FILE" ]; then
    MACHINE_FILE=$TMP_WORKING_PATH/machine_file.txt;
    if [ -e "$MACHINE_FILE" ]; then
        echo "Ooups $MACHINE_FILE already exists. Do not want to overwrite" ;
        exit 1 ;
    fi;
    rm -f $MACHINE_FILE;
    touch $MACHINE_FILE;
    for i in `seq 1 144`; do
        echo graphene-${i}.nancy.grid5000.fr >> $MACHINE_FILE ;
    done
    cp $MACHINE_FILE $MACHINE_FILE.sav
    cat $MACHINE_FILE.sav $MACHINE_FILE.sav $MACHINE_FILE.sav $MACHINE_FILE.sav > $MACHINE_FILE
fi

## To debug
# $SMPIRUN -ext smpi_replay --log=replay.thresh:critical --log=smpi_replay.thresh:verbose \
#          --cfg=smpi/cpu_threshold:-1  -hostfile machine_file -platform $PLATFORM \
#          -np $NP gdb\ --args\ $REPLAY /tmp/smpi_replay.txt  --log=smpi_kernel.thres:warning \
#          --cfg=contexts/factory:thread

$SMPIRUN -ext smpi_replay \
         --cfg=smpi/cpu_threshold:-1 -trace --cfg=tracing/filename:$OUTPUT \
         -hostfile $MACHINE_FILE -platform $PLATFORM -np $NP \
         $REPLAY $REPLAY_INPUT --log=smpi_kernel.thres:warning  \
         --cfg=contexts/factory:thread 2>&1 
# --log=replay.thresh:critical  --log=smpi_replay.thresh:verbose

SMPI Paje to Paraver Conversion¶

This was quick and dirty and reused the original pcf file but in the end it kinda works… Yeepee! :)

use strict;
use Env;

my($arg);
my($strict_option) = "";

while(defined($arg=shift(@ARGV))) {
    for ($arg) {
        print "$arg \n";
        if (/^-i$/) { $input = shift(@ARGV); last; }
        if (/^-o$/) { $output = shift(@ARGV); last; }
        if (/^-ns$/){ $strict_option = "-n -z"; last; }
        print "unrecognized argument '$arg'";
    }
}

my $pjfile=$input;
$pjfile=~ s/\.trace$/.pjdump/;
$pjfile ne $input or die;

$ENV{LANG}="C";

system("pj_dump $strict_option $input | grep State | sed 's/ //g' | sort -n -t ',' -k 4n > $pjfile");
my $duration = `tail -n 1 $pjfile`;
my @duration = split(/,/,$duration);
$duration = $duration[4];
$duration *= 1E9;
my $nb_nodes = `sed -e 's/.*rank-//' -e 's/,.*//' $pjfile | sort | uniq | wc -l`;
chomp($nb_nodes);
my(%smpi_to_pcf) = (
    "action_allReduce" => "10",
    "action_allToAll"  => "11",
    "action_barrier"   => "8",
    "action_bcast"     => "7",
    "action_gather"    => "13",
    "action_reduce"    => "9",
    "action_reducescatter" => "80",
#        "smpi_replay_finalize" => "32",
#        "smpi_replay_init" => "31"
    );

my($pcf_file_content)="DEFAULT_OPTIONS

LEVEL               THREAD
UNITS               NANOSEC
LOOK_BACK           100
SPEED               1
FLAG_ICONS          ENABLED
NUM_OF_STATE_COLORS 1000
YMAX_SCALE          37


DEFAULT_SEMANTIC

THREAD_FUNC          State As Is


STATES
0    Idle
1    Running
2    Not created
3    Waiting a message
4    Blocking Send
5    Synchronization
6    Test/Probe
7    Scheduling and Fork/Join
8    Wait/WaitAll
9    Blocked
10    Immediate Send
11    Immediate Receive
12    I/O
13    Group Communication
14    Tracing Disabled
15    Others
16    Send Receive
17    Memory transfer


STATES_COLOR
0    {117,195,255}
1    {0,0,255}
2    {255,255,255}
3    {255,0,0}
4    {255,0,174}
5    {179,0,0}
6    {0,255,0}
7    {255,255,0}
8    {235,0,0}
9    {0,162,0}
10    {255,0,255}
11    {100,100,177}
12    {172,174,41}
13    {255,144,26}
14    {2,255,177}
15    {192,224,0}
16    {66,66,66}
17    {255,0,96}

EVENT_TYPE
9   50000001    MPI Point-to-point
VALUES
2   MPI_Recv
1   MPI_Send
0   Outside MPI

EVENT_TYPE
9   50000002    MPI Collective Comm
VALUES
18   MPI_Allgatherv
10   MPI_Allreduce
11   MPI_Alltoall
12   MPI_Alltoallv
8   MPI_Barrier
7   MPI_Bcast
13   MPI_Gather
14   MPI_Gatherv
80   MPI_Reduce_scatter
9   MPI_Reduce
0   Outside MPI


EVENT_TYPE
9   50000003    MPI Other
VALUES
21   MPI_Comm_create
19   MPI_Comm_rank
20   MPI_Comm_size
32   MPI_Finalize
31   MPI_Init
0   Outside MPI


EVENT_TYPE
1    50100001    Send Size in MPI Global OP
1    50100002    Recv Size in MPI Global OP
1    50100003    Root in MPI Global OP
1    50100004    Communicator in MPI Global OP


EVENT_TYPE
6    40000001    Application
VALUES
0      End
1      Begin


EVENT_TYPE
6    40000003    Flushing Traces
VALUES
0      End
1      Begin


GRADIENT_COLOR
0    {0,255,2}
1    {0,244,13}
2    {0,232,25}
3    {0,220,37}
4    {0,209,48}
5    {0,197,60}
6    {0,185,72}
7    {0,173,84}
8    {0,162,95}
9    {0,150,107}
10    {0,138,119}
11    {0,127,130}
12    {0,115,142}
13    {0,103,154}
14    {0,91,166}


GRADIENT_NAMES
0    Gradient 0
1    Grad. 1/MPI Events
2    Grad. 2/OMP Events
3    Grad. 3/OMP locks
4    Grad. 4/User func
5    Grad. 5/User Events
6    Grad. 6/General Events
7    Grad. 7/Hardware Counters
8    Gradient 8
9    Gradient 9
10    Gradient 10
11    Gradient 11
12    Gradient 12
13    Gradient 13
14    Gradient 14


EVENT_TYPE
9    40000018    Tracing mode:
VALUES
1      Detailed
2      CPU Bursts
";

my($pcf_output)=$output;
$pcf_output =~ s/\.prv$/.pcf/;
open OUTPUT, "> $pcf_output";
print OUTPUT $pcf_file_content;
close OUTPUT;

my($line);
open(INPUT,$pjfile) or die;
open(OUTPUT,"> $output") or die;
my(@tab);

@tab=();
foreach (1..$nb_nodes) {
    push @tab,1;
}
my $node_list = join(',',@tab);
@tab=();
foreach (1..$nb_nodes) {
    push @tab,"1:$_";
}
my $thread_list = join(',',@tab);

print OUTPUT "#Paraver (generated with perl from SMPI):${duration}_ns:$nb_nodes($node_list):1:$nb_nodes($thread_list),9\n";

my $comm_list = join(':',(1..$nb_nodes));
my $comm=1;
print OUTPUT "c:1:$comm:$nb_nodes:$comm_list\n";  $comm++;
foreach (1..$nb_nodes) {
    print OUTPUT "c:1:$comm:1:$_\n";  
}

while(defined($line=<INPUT>)) {
    chomp($line);
    my($Foo1,$rank,$Foo2,$start,$end,$duration,$Foo3,$type) = split(/,/,$line);
    $rank=~ s/\D*//g;
    $rank++;
    $start *= 1E9;
    $end *= 1E9;
    if(defined($smpi_to_pcf{$type})) {
        print OUTPUT "1:$rank:1:$rank:1:$start:$end:13\n"; # group communication
        print OUTPUT "2:$rank:1:$rank:1:$start:50000002:$smpi_to_pcf{$type}\n";
        print OUTPUT "2:$rank:1:$rank:1:$end:50000002:0\n"; # Output MPI
#          print OUTPUT "1:$rank:1:$rank:1:$start:$end:$smpi_to_pcf{$type}\n";
    } else {
        warn("Unknown type $type: Skipping $line\n");
    }
}

Gluing everything together to allow calling SMPI¶

The Dimemas wrapper called by paraver is

Let's make a copy of it.

mv /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh.backup

Basically, what I want to do is something like

perl prv2pj.pl
sh smpi2pj.sh >/dev/null
perl pjsmpi2prv.pl

Here is an equivalent version inspired from the dimemas wrapper.

#
# Simple wrapper for SMPI based on the Dimemas one
#

set -e

function usage
{
  echo "Usage: $0  source_trace  dimemas_cfg  output_trace  reuse_dimemas_trace [extra_parameters] [-n]"
  echo "  source_trace:        Paraver trace"
  echo "  dimemas_cfg:         Simulation parameters"
  echo "  output_trace:        Output trace of Dimemas; must end with '.prv'"
  echo "  reuse_dimemas_trace: 0 -> don't reuse, rerun prv2dim"
  echo "                       1 -> reuse, don't rerun prv2dim"
  echo "  extra_parameters:    See complete list of Dimemas help with 'Dimemas -h'"
  echo "  -n:                  prv2dim -n parameter => no generate initial idle states"
}


# Read and check parameters
if [ $# -lt 4 ]; then
  usage
  exit 1
fi

#PARAVER_TRACE=${1}
PARAVER_TRACE=`readlink -eqs "${1}"`
DIMEMAS_CFG=${2}
OUTPUT_PARAVER_TRACE=${3}
DIMEMAS_REUSE_TRACE=${4}


if [[ ${DIMEMAS_REUSE_TRACE} != "0"  && ${DIMEMAS_REUSE_TRACE} != "1" ]]; then
  usage
  exit 1
fi
echo "Go to hell!"

exit 12;

echo "===============================================================================" 

# Check SMPI availability
### Oh right, we should do that...

# Get tracename, without extensions
TRACENAME=$(echo "$PARAVER_TRACE" | sed "s/\.[^\.]*$//")
EXTENSION=$(echo "$PARAVER_TRACE" | sed "s/^.*\.//")

#Is gzipped?
if [[ ${EXTENSION} = "gz" ]]; then
  echo
  echo -n "[MSG] Decompressing $PARAVER_TRACE trace..."
  gunzip ${PARAVER_TRACE}
  TRACENAME=$(echo "${TRACENAME}" | sed "s/\.[^\.]*$//")
  PARAVER_TRACE=${TRACENAME}.prv
  echo "...Done!"
fi

DIMEMAS_TRACE=${TRACENAME}.dim

# Adapt Dimemas CFG with new trace name
DIMEMAS_CFG_NAME=$(echo "$DIMEMAS_CFG" | sed "s/\.[^\.]*$//")

DIMEMAS_COPY_CFG_NAME=`basename ${DIMEMAS_CFG_NAME}`
OLD_DIMEMAS_TRACENAME=`grep "mapping information" ${DIMEMAS_CFG} | grep ".dim" | awk -F'"' {'print $4'}`
NEW_DIMEMAS_TRACENAME=`basename ${DIMEMAS_TRACE}`
DIMEMAS_CFG_PATH=`dirname ${DIMEMAS_TRACE}`

# Append extra parameters if they exist
shift
shift
shift
shift
EXTRA_PARAMETERS=""
PRV2DIM_N=""
while [ -n "$1" ]; do
  if [[ ${1} == "-n" ]]; then # caution! this works because no -n parameters exists in Dimemas
    PRV2DIM_N="-n"
  else
    EXTRA_PARAMETERS="$EXTRA_PARAMETERS $1"
  fi
  shift
done

# Change directory to see .dim
DIMEMAS_TRACE_DIR=`dirname ${DIMEMAS_TRACE}`/
pushd . > /dev/null
cd ${DIMEMAS_TRACE_DIR}


# Translate from .prv to SMPI time independant trace

if [[ ${DIMEMAS_REUSE_TRACE} = "0" || \
      ${DIMEMAS_REUSE_TRACE} = "1" && ! -f ${DIMEMAS_TRACE} ]]; then

  if [[ ${DIMEMAS_REUSE_TRACE} = "1" ]]; then
    echo
    echo "[WARN] Unable to find ${DIMEMAS_TRACE}"
    echo "[WARN] Generating it."
  fi

  PARAVER_TRACE_TRIMED=`echo ${PARAVER_TRACE} | sed 's/.prv$//'`
  echo
  echo "[COM] prv2pj.pl -i ${PARAVER_TRACE_TRIMED} -o ${DIMEMAS_TRACE} -f tit"
  echo
  prv2pj.pl -i ${PARAVER_TRACE_TRIMED} -o ${DIMEMAS_TRACE} -f tit
  echo
fi


# Simulate
  echo 
  echo "*** Running SMPI :) ***"
  echo 
  echo "[COM]   smpi2pj.sh -i ${DIMEMAS_TRACE} -o ${OUTPUT_PAJE_TRACE}"
  echo
OUTPUT_PAJE_TRACE=`echo ${OUTPUT_PARAVER_TRACE} | sed 's/.prv$/.trace/'`
smpi2pj.sh -i ${DIMEMAS_TRACE} -o ${OUTPUT_PAJE_TRACE}

# Convert back to paraver
  echo
  echo "[COM]   pjsmpi2prv.pl -i ${OUTPUT_PAJE_TRACE} -o ${OUTPUT_PARAVER_TRACE}"
  echo
pjsmpi2prv.pl -i ${OUTPUT_PAJE_TRACE} -o ${OUTPUT_PARAVER_TRACE}
echo "===============================================================================" 

popd > /dev/null

For this to work "system wide", I need to put the previous perl and sh scripts in the PATH. Eventually, they will be shipped with SMPI.

TMP_FILENAME=`mktemp`

for i in *.pl ; do
    mv $i $TMP_FILENAME;
    echo "#!/usr/bin/perl" > $i;
    cat $TMP_FILENAME >> $i;
    rm $TMP_FILENAME;
    chmod +x $i;
    cp $i ~/bin/
done
for i in  smpi2pj.sh ; do
    mv $i $TMP_FILENAME;
    echo "#!/bin/sh" > $i;
    cat $TMP_FILENAME >> $i;
    rm $TMP_FILENAME;
    chmod +x $i;
    cp $i ~/bin/
done
for i in /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh ; do
    mv $i $TMP_FILENAME;
    echo "#!/bin/bash" > $i;
    cat $TMP_FILENAME >> $i;
    rm $TMP_FILENAME;
    chmod +x $i;
    cp $i ~/bin/
done

Specific Pjdump to Paraver Conversion for Damien¶

#!/usr/bin/perl

my $output=q(/exports/nancy_700_lu.C.700.prv);

my $input=q(/exports/nancy_700_lu.C.700.pjdump.bz2);

use strict;
use Env;

my($duration,$nb_nodes);
my($strict_option) = "";

my($arg);
while(defined($arg=shift(@ARGV))) {
    for ($arg) {
        if (/^-i$/) { $input = shift(@ARGV); last; }
        if (/^-o$/) { $output = shift(@ARGV); last; }
        if (/^-d$/) { $duration = shift(@ARGV); last; }
        if (/^-n$/) { $nb_nodes = shift(@ARGV); last; }
        if (/^-ns$/){ $strict_option = "-n -z"; last; }
      print "unrecognized argument '$arg'";
    }
}

print " ---> $input \n";

my($pjfile);
if($input =~/\.trace$/) {
    $ENV{LANG}="C";
    $pjfile = $input;
    $pjfile =~ s/\.trace$/.pjdump/;
    my $command = "pj_dump $strict_option $input | grep State | sed 's/ //g' | sort -n -t ',' -k 4n > $pjfile";
    print "---> $command\n";
    system($command);
} elsif($input =~/\.pjdump/) {
    $pjfile = $input;
} else {
    die "Unknown input format '$input'\n";
}

print " ---> $pjfile \n";

if(!defined($duration)) {
    $duration = `tail -n 1 $pjfile`;
    my @duration = split(/,/,$duration);
    $duration = $duration[4];
    $duration *= 1E9;
}

if(!defined($nb_nodes)) {
    $nb_nodes = `sed -e 's/.*rank-//' -e 's/,.*//' $pjfile | sort | uniq | wc -l`;
    chomp($nb_nodes);
}

my($pcf_file_content)="DEFAULT_OPTIONS

LEVEL               THREAD
UNITS               NANOSEC
LOOK_BACK           100
SPEED               1
FLAG_ICONS          ENABLED
NUM_OF_STATE_COLORS 1000
YMAX_SCALE          37


DEFAULT_SEMANTIC

THREAD_FUNC          State As Is


STATES
0    Idle
1    Running
2    Not created
3    Waiting a message
4    Blocking Send
5    Synchronization
6    Test/Probe
7    Scheduling and Fork/Join
8    Wait/WaitAll
9    Blocked
10    Immediate Send
11    Immediate Receive
12    I/O
13    Group Communication
14    Tracing Disabled
15    Others
16    Send Receive
17    Memory transfer


STATES_COLOR
0    {117,195,255}
1    {0,0,255}
2    {255,255,255}
3    {255,0,0}
4    {255,0,174}
5    {179,0,0}
6    {0,255,0}
7    {255,255,0}
8    {235,0,0}
9    {0,162,0}
10    {255,0,255}
11    {100,100,177}
12    {172,174,41}
13    {255,144,26}
14    {2,255,177}
15    {192,224,0}
16    {66,66,66}
17    {255,0,96}

EVENT_TYPE
9   50000001    MPI Point-to-point
VALUES
2   MPI_Recv
1   MPI_Send
0   Outside MPI

EVENT_TYPE
9   50000002    MPI Collective Comm
VALUES
18   MPI_Allgatherv
10   MPI_Allreduce
11   MPI_Alltoall
12   MPI_Alltoallv
8   MPI_Barrier
7   MPI_Bcast
13   MPI_Gather
14   MPI_Gatherv
80   MPI_Reduce_scatter
9   MPI_Reduce
0   Outside MPI


EVENT_TYPE
9   50000003    MPI Other
VALUES
21   MPI_Comm_create
19   MPI_Comm_rank
20   MPI_Comm_size
32   MPI_Finalize
31   MPI_Init
0   Outside MPI


EVENT_TYPE
1    50100001    Send Size in MPI Global OP
1    50100002    Recv Size in MPI Global OP
1    50100003    Root in MPI Global OP
1    50100004    Communicator in MPI Global OP


EVENT_TYPE
6    40000001    Application
VALUES
0      End
1      Begin


EVENT_TYPE
6    40000003    Flushing Traces
VALUES
0      End
1      Begin


GRADIENT_COLOR
0    {0,255,2}
1    {0,244,13}
2    {0,232,25}
3    {0,220,37}
4    {0,209,48}
5    {0,197,60}
6    {0,185,72}
7    {0,173,84}
8    {0,162,95}
9    {0,150,107}
10    {0,138,119}
11    {0,127,130}
12    {0,115,142}
13    {0,103,154}
14    {0,91,166}


GRADIENT_NAMES
0    Gradient 0
1    Grad. 1/MPI Events
2    Grad. 2/OMP Events
3    Grad. 3/OMP locks
4    Grad. 4/User func
5    Grad. 5/User Events
6    Grad. 6/General Events
7    Grad. 7/Hardware Counters
8    Gradient 8
9    Gradient 9
10    Gradient 10
11    Gradient 11
12    Gradient 12
13    Gradient 13
14    Gradient 14


EVENT_TYPE
9    40000018    Tracing mode:
VALUES
1      Detailed
2      CPU Bursts
";

my($pcf_output)=$output;
$pcf_output =~ s/\.prv$/.pcf/;
open OUTPUT, "> $pcf_output";
print OUTPUT $pcf_file_content;
close OUTPUT;

my(%mpi_to_pcf) = (
    "MPI_Running" => "1",
    "MPI_Send"   => "10",
    "MPI_Recv"     => "11",
    "Collective"  => "13",
    "Others"    => "15",
    );

my(%mpi_coll_to_pcf) = (
    "MPI_Bcast" => "18",
    "MPI_Allreduce"  => "10",
    "MPI_Alltoall"   => "11",
    "MPI_Alltoallv"     => "12",
    "MPI_Bcast"    => "7",
    "MPI_Gather"    => "13",
    "MPI_Gatherv"    => "14",
    "MPI_Reduce_Scatter" => "80",
    "MPI_Reduce" => "9",
    );

my(%mpi_others_to_pcf) = (
    "MPI_Comm_create" => "21",
    "MPI_Comm_rank" =>   "19",
    "MPI_Comm_size" =>   "20",  
    "MPI_Finalize"  =>   "32",
    "MPI_Init"      =>   "31",
    );

my(%smpi_to_mpi) = (
    "action_allReduce" => "MPI_Allreduce",
    "action_allToAll"  => "MPI_Alltoall",
    "action_barrier"   => "MPI_Barrier",
    "action_bcast"     => "MPI_Bcast",
    "action_gather"    => "MPI_Gather",
    "action_reduce"    => "MPI_reduce",
    "action_reducescatter" => "MPI_Reduce_Scatter",
    "smpi_replay_finalize" => "MPI_Finalize",
    "smpi_replay_init" => "MPI_Init",
    "PMPI_Init"        => "MPI_Init",
    "PMPI_Send"        => "MPI_Send",
    "PMPI_Recv"        => "MPI_Recv",
    "PMPI_Finalize"    => "MPI_Finalize"
    );


my($line);
open(INPUT,$pjfile) or die;
open(OUTPUT,"> $output") or die;
my(@tab);

@tab=();
foreach (1..$nb_nodes) {
    push @tab,1;
}
my $node_list = join(',',@tab);
@tab=();
foreach (1..$nb_nodes) {
    push @tab,"1:$_";
}
my $thread_list = join(',',@tab);

print OUTPUT "#Paraver (generated with perl from SMPI):${duration}_ns:$nb_nodes($node_list):1:$nb_nodes($thread_list),3\n";

my $comm_list = join(':',(1..$nb_nodes));
my $comm=1;
print OUTPUT "c:1:$comm:$nb_nodes:$comm_list\n";  $comm++;
foreach (1..$nb_nodes) {
    print OUTPUT "c:1:$comm:1:$_\n";  
}

while(defined($line=<INPUT>)) {
    chomp($line);
    my($Foo1,$rank,$Foo2,$start,$end,$duration,$Foo3,$type) = split(/,/,$line);
    $rank=~ s/\D*//g;
    $rank++;
    $start *= 1E9;
    $end *= 1E9;

    if($type =~ /action_/ or $type =~ /smpi_/ or $type =~ /PMPI_/) {
        my($key);
        foreach $key (keys(%smpi_to_mpi)) {
            if($type eq $key) {
                $type = $smpi_to_mpi{$key};
                last;
            }
        }
    }

    if(defined($mpi_to_pcf{$type})) {
        print "$type $mpi_to_pcf{$type}\n";
        print OUTPUT "1:$rank:1:$rank:1:$start:$end:$mpi_to_pcf{$type}\n";
    } elsif(defined($mpi_coll_to_pcf{$type})) {
        print OUTPUT "1:$rank:1:$rank:1:$start:$end:$mpi_to_pcf{Collective}\n"; # group communication
        print OUTPUT "2:$rank:1:$rank:1:$start:50000002:$mpi_coll_to_pcf{$type}\n";
        print OUTPUT "2:$rank:1:$rank:1:$end:50000002:0\n"; # Output MPI
    } elsif(defined($mpi_others_to_pcf{$type})) {
        print OUTPUT "1:$rank:1:$rank:1:$start:$end:$mpi_to_pcf{Others}\n";
        print OUTPUT "2:$rank:1:$rank:1:$start:50000003:$mpi_others_to_pcf{$type}\n";
        print OUTPUT "2:$rank:1:$rank:1:$end:50000003:0\n"; # Output MPI
    } else {
        warn("Unknown type $type: Skipping $line\n");
    }
}