This page is now outdated. All this work has been moved to http://simgrid.gforge.inria.fr/contrib/smpi-paraver.html. Please, consider using the new up-to-date version.
Achievements¶
Links to generated files¶
Presentation of current work from both sides¶
- Simulation of MPI programs (Arnaud Legrand)
- Spatial and Temporal Aggregation of Traces of Parallel Systems (Damien Dosimont)
- Evolution of the BigDFT code (Luigi Genovese)
- Presentation of the Paraver Format to improve interoperability (Juan Gonzalez)
- Clustering techniques applied to BigDFT (Harald Servat)
- Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures (Luka Stanisic)
- Raising the Level of Abstraction: Simulation of Large Chip Multiprocessors Running Multithreaded Applications (Alejandro Rico)
TODO BigDFT simulation [⅔]¶
- [x] Simulate order(n) BigDFT with SMPI with no modification.
- [x] Obtained an unbalanced (paje) trace where we could observe the same kind of (paraver) trace as what Luigi, Brice and our BSC colleagues obtained on a real run. The timing obviously do not make any sense as the platform model was completely different from the real platform but the general unbalanced shape was the same and the same process was slowing the whole application.
- [ ] Instrument order(n) BigDFT to speed up the simulation ?
TODO Interaction between Paraver and SMPI [⅝]¶
- [ ] Paraver conversion
- [x] Wrote a paraver to csv/pjdump/smpi converter (in perl) that worked on an old small 8 node BigDFT paraver trace.
- [ ] A few uggly things had to be done here (reduce, alltoallV, no handling of p2p operations, second/nanosecond issue, …) and need to be cleaned.
- [ ] Maybe it would be interesting to have an option that allows extrae to trace all the parameters ?
- [x] Wrote a simple shell script to replay this trace with SMPI and generated an SMPI paje trace.
- [x] I still need to improve the shell script so that it takes arguments on the command line.
- [x] Wrote a perl script that converts an SMPI paje trace to the paraver file format.
- [ ] Improve this perl script
- [x] improve the conversion to export events so that collective operation names are the same and things are easily comparable. (Edit: this was done in Chicago with Harald)
- [ ] Currently there are two scripts (pjdump2prv.pl and pjsmpi2prv.pl). The 1st one is for ocelotl/pjdump output while the second one is intended for the SMPI -> PRV final step. I'm currently merging them together.
- [ ] add links (arrows) so that bandwidth can be computed in paraver
- [x] Managed to open the resulting paraver trace in paraver.
- [x] Have a prototype integration of SMPI within Paraver. (Edit: this was done in Chicago. If you use the dimemas-wrapper.sh) instead of the original one, it will launch smpi. Better integration to allow to specify platform and deployment would be nice.
- [ ] Make a model of Mare Nostrum, the Mont-blanc prototype, so that BSC staff can really play with SMPI. (Edit: this was discussed in Chicago with Judit. I explained here the SimGrid XML plaform representation and she will try to play with SMPI and come back to me with questions).
TODO Trace Aggregation [⅘]¶
All this is better sumarized in the blog entry Damien wrote about this.
-
[x] The paraver to pjdump converter was integrated in framesoc.
-
[x] Damien managed to load several paraver traces in ocelotl and to play with aggregation.
-
[x] Managed to load a SMPI replayed trace of order(n) BigDFT and could aggregate it and easily spot the disturbing process and the application phases.
-
[x] Convert the real O(n) BigDFT paraver trace and aggregates it
-
[ ] Convert the 12 GB Nancy LU trace (700 process on 3 clusters) to paraver to see whether the behavior exhibited by ocelotl can be observer in Paraver. This involves slightly modifying the paje to paraver converter which was designed for SMPI paje traces.
This trace was on flutin and I got it here:
- [ ] Fix the state name conversion and the event conversion
- [ ] The ',9' at end of the header is the number of communicators…
- [ ] The resulting prv starts from the pjdump and I forgot to sort it. Could we give an option to pjdump so that it sorts it according to time?
- [ ] Do not use state 0 as it's reserved for computation
- [ ] Create a state and event for MPI application (derived from being outside MPI calls)
- [ ] clock resolution issue
Interaction between Paraver and SMPI¶
A year and a half ago, I needed to write paraver converter because in a particular setup I could not trace BigDFT neither with TAU not Scalasca. My goal was simply to compute statistics on the trace using R. Today, we're in Barcelona and we're discussing on whether SMPI could be used as an alternative to Dimemas within the paraver framework. To this end, we need to make sure that SMPI can simulate paraver traces and output paraver traces. Ideally, we would modify SMPI to that it can parse and generate such traces but it's probably more work than what we can achieve in two days so we'll go for simple trace conversions, i.e., a paraver to SMPI time-independent trace format conversion and a Paje to paraver conversion.
Let's start from the traces I used at that time.
cp -r ../../../2013/04/03/paraver_trace ./
ls paraver_trace/
EXTRAE_Paraver_trace_mpich.pcf
EXTRAE_Paraver_trace_mpich.prv
EXTRAE_Paraver_trace_mpich.row
Paraver to CSV and SMPI format Conversion¶
Juan Gonzalez provided us a description of the Paraver and Dimemas
format. The Paraver description is available
here, i.e., from the Paraver
documentation.
Remember the pcf file describes events, the row file defines the
cpu/node/thread mapping and the prv is the trace with all events. I
reworked my old script to convert from paraver to csv, pjdump and SMPI
time-independant trace format during the night. Unfortunately, on the
morning, Juan explained me I should not trust the state records but only
the the event and communication records. Ideally, I should have worked
from the dimemas trace instead of the paraver trace to obtain SMPI trace
but at least, this allowed me to get a converter to csv/pjdump, which is
very useful to Damien for framesoc/ocelotl.
So I really struggled to make it work and had to make several assumptions and "Uggly hacks" (indicated in the code). In particular, something that is really uggly at the moment is that the V collective operations where send and receive are process specific appear as many times as there are process and since I translate on the fly, I do not produce a correct input for SMPI. The easiest solution to handle this is probably to have two pass but nevermind for a first proof of concept.
use strict;
use Data::Dumper;
my $power_reference=286.087E-3; # in flop/mus
sub main {
# default values for $input, $output and $format may have be
# defined when tangling from babel but command line arguments
# should always override them.
my($arg);
while(defined($arg=shift(@ARGV))) {
for ($arg) {
if (/^-i$/) { $input = shift(@ARGV); last; }
if (/^-o$/) { $output = shift(@ARGV); last; }
if (/^-f$/) { $format = shift(@ARGV); last; }
print "unrecognized argument '$arg'";
}
}
if(!defined($input) || $input eq "") { die "No valid input file provided.\n"; }
if(!defined($output) || $output eq "") { die "No valid input file provided.\n"; }
print "Input: '$input'\n";
print "Output: '$output'\n";
print "Format: '$format'\n";
my($state_name,$event_name) = parse_pcf($input.".pcf");
my($resource_name) = parse_row($input.".row");
convert_prv($input.".prv",$state_name,$event_name,$resource_name,$output,$format);
}
sub parse_row {
my($row) = shift;
my $line;
my(%resource_name);
open(INPUT,$row) or die "Cannot open $row. $!";
while(defined($line=<INPUT>)) {
chomp $line;
if($line =~ /^LEVEL (.*) SIZE/) {
my $type = $1;
$resource_name{$type}= [];
while((defined($line=<INPUT>)) &&
!($line =~ /^\s*$/g)) {
chomp $line;
push @{$resource_name{$type}}, $line;
}
}
}
return (\%resource_name);
}
sub parse_pcf {
my($pcf) = shift;
my $line;
my(%state_name, %event_name) ;
open(INPUT,$pcf) or die "Cannot open $pcf. $!";
while(defined($line=<INPUT>)) {
chomp $line;
if($line =~ /^STATES$/) {
while((defined($line=<INPUT>)) &&
($line =~ /^(\d+)\s+(.*)/g)) {
$state_name{$1} = $2;
}
}
if($line =~ /^EVENT_TYPE$/) {
while($line=<INPUT>) {
if($line =~ /VALUES/g) {last;}
$line =~ /[6|9]\s+(\d+)\s+(.*)/g or next; #E.g. , EVENT_TYPE\n 1 50100001 Send Size in MPI Global OP
my($id)=$1;
$event_name{$id}{type} = $2;
}
while((defined($line=<INPUT>)) &&
($line =~ /^(\d+)\s+(.*)/g)) {
my($id);
foreach $id (keys %event_name) {
$event_name{$id}{value}{$1} = $2;
}
}
}
}
# print Dumper(\%state_name);
# print Dumper(\%event_name);
return (\%state_name,\%event_name);
}
my(%pcf_coll_arg) = (
"send" => "50100001",
"recv" => "50100002",
"root" => "50100003",
"communicator" => "50100003",
"compute" => "my_reduce_compute_amount",
);
my(%tit_translate) = (
"Running" => "compute",
"Not created" => "", # skip me
"I/O" => "", # skip me
"Synchronization" => "", # skip me
"MPI_Comm_size" => "", # skip me
"MPI_Comm_rank" => "", # skip me
"Outside MPI" => "", # skip me
"End" => "", # skip me
"MPI_Init" => "init",
"MPI_Bcast" => "bcast",
"MPI_Allreduce" => "allReduce",
"MPI_Alltoallv" => "allToAllV",
"MPI_Alltoall" => "allToAll",
"MPI_Reduce" => "reduce",
"MPI_Allgatherv" => "", # allGatherV Uggly hack
"MPI_Gather" => "gather",
"MPI_Gatherv" => "gatherV",
"MPI_Reduce_scatter" => "reduceScatter",
"MPI_Finalize" => "finalize",
"MPI_Barrier" => "barrier",
);
sub convert_prv {
my($prv,$state_name,$event_name,$resource_name,$output,$format) = @_;
my $line;
my (%event);
my(@fh)=();
open(INPUT,$prv) or die "Failed to open $prv:$!\n";
# Start parsing the header to get the trace hierarchy.
# We should get something like
# #Paraver (dd/mm/yy at hh:m):ftime:0:nAppl:applicationList[:applicationList]
$line=<INPUT>; chomp $line;
$line=~/^\#Paraver / or die "Invalid header '$line'\n";
my $header=$line;
$header =~ s/^[^:\(]*\([^\)]*\):// or die "Invalid header '$line'\n";
$header =~ s/(\d+):(\d+)([^\(\d])/$1\_$2$3/g;
$header =~ s/,\d+$//g;
my ($max_duration,$resource,$nb_app,@appl) = split(/:/,$header);
$max_duration =~ s/_.*$//g;
$resource =~ /^(.*)\((.*)\)$/ or die "Invalid resource description '$resource'\n";
my($nb_nodes,$cpu_list)= ($1,$2);
$nb_app==1 or die "I can handle only one application type at the moment\n";
my @cpu_list=split(/,/,$cpu_list);
# print("$max_duration --> '$nb_nodes' '@cpu_list' $nb_app @appl \n");
my(%Appl);
my($nb_task);
foreach my $app (1..$nb_app) {
my($task_list);
$appl[$app-1] =~ /^(.*)\((.*)\)$/ or die "Invalid resource description '$resource'\n";
($nb_task,$task_list) = ($1,$2);
my(@task_list) = split(/,/,$task_list);
my(%mapping);
my($task);
foreach $task (1..$nb_task) {
my($nb_thread,$node_id) = split(/_/,$task_list[$task-1]);
if(!defined($mapping{$node_id})) { $mapping{$node_id}=[]; }
push @{$mapping{$node_id}},[$task,$nb_thread];
}
$Appl{$app}{nb_task}=$nb_task;
$Appl{$app}{mapping}=\%mapping;
}
for ($format) {
if (/^csv$/) {
$output .= ".csv";
open(OUTPUT,"> $output") or die "Cannot open $output. $!";
last;
}
if (/^pjdump$/) {
$output .= ".pjdump";
open(OUTPUT,"> $output");
my @tab = split(/:/,`tail -n 1 $prv`);
print OUTPUT "Container, 0, 0, 0.0, $max_duration, $max_duration, 0\n";
foreach my $node (1..$nb_nodes) {
print OUTPUT "Container, 0, N, 0.0, $max_duration, $max_duration, node_$node\n";
}
foreach my $app (values(%Appl)) {
foreach my $node (keys%{$$app{mapping}}) {
foreach my $t (@{$$app{mapping}{$node}}) {
print OUTPUT "Container, node_$node, P, 0.0, $max_duration, $max_duration, MPI_Rank_$$t[0]\n";
foreach my $thread (1..$$t[1]) {
print OUTPUT "Container, MPI_Rank_$$t[0], T, 0.0, $max_duration, $max_duration, Thread_$$t[0]_$thread\n";
}
}
}
}
last;
}
if(/^tit$/) {
my $nb_proc = 0;
foreach my $node (@{$$resource_name{NODE}}) {
my $filename = $output."_$nb_proc.tit";
open($fh[$nb_proc], "> $filename") or die "Cannot open > $filename: $!";
$nb_proc++;
}
last;
}
die "Invalid format '$format'\n";
}
# Now, let's process the records
sub process_event {
my(%event_list)=@_;
my($sname);
my($sname_param);
if(defined($event_list{50000003})) {
$sname = $$event_name{50000003}{value}{$event_list{50000003}};
$sname_param = "";
} elsif(defined($event_list{50000002})) {
$sname = $$event_name{50000002}{value}{$event_list{50000002}};
my $t;
if($tit_translate{$sname} =~ /V$/) { # Really Uggly hack because of "poor" tracing of V operations
if($event_list{$pcf_coll_arg{"send"}}==251 ||
$event_list{$pcf_coll_arg{"recv"}}==251 ) {
}
$event_list{$pcf_coll_arg{"send"}} = 100000;
$event_list{$pcf_coll_arg{"recv"}} = 100000;
$sname =~ s/v$//i;
}
if($tit_translate{$sname} eq "reduce") { # Uggly hack because the amount of computation is not given
$event_list{$pcf_coll_arg{"compute"}} = 1;
}
if($tit_translate{$sname} eq "gather") { # Uggly hack because the amount of receive does not make sense here
$event_list{$pcf_coll_arg{"recv"}} = $event_list{$pcf_coll_arg{"send"}};
$event_list{$pcf_coll_arg{"root"}} = 1; # Uggly hack. AAAAARGH
}
if($tit_translate{$sname} eq "reduceScatter") { # Uggly hack because of "poor" tracing
$event_list{$pcf_coll_arg{"recv"}} = $event_list{$pcf_coll_arg{"send"}};
my $foo=$event_list{$pcf_coll_arg{"recv"}};
$event_list{$pcf_coll_arg{"recv"}}="";
for (1..$nb_task) { $event_list{$pcf_coll_arg{"recv"}} .= $foo." "; }
$event_list{$pcf_coll_arg{"compute"}} = 1;
}
foreach $t ("send","recv", "compute", "root") {
if(defined($event_list{$pcf_coll_arg{$t}}) &&
$event_list{$pcf_coll_arg{$t}} ne "0") {
if($t eq "root") { $event_list{$pcf_coll_arg{$t}}--; }
$sname_param.= "$event_list{$pcf_coll_arg{$t}} ";
}
}
} else { # This may be application of trace flushing event
# and hardware counter, user function, ...
my($warn)=1;
for (40000018,40000003,40000001,
42009999,42001003,42001010,42001015,300,
70000001,70000002,70000003,80000001,80000002,80000003,
45000000) {
if(defined($event_list{$_})) {$warn=0; last;}
}
if($warn) { print "Skipping event:\n";
print Dumper(%event_list);}
next;
}
return($sname,$sname_param);
}
while(defined($line=<INPUT>)) {
chomp($line);
# State records 1:cpu:appl:task:thread : begin_time:end_time : state
if($line =~ /^1/) {
my($sname);
my($sname_param);
my($record,$cpu,$appli,$task,$thread,$begin_time,$end_time,$state) =
split(/:/,$line);
if($$state_name{$state} =~ /Group/ || $$state_name{$state} =~ /Others/ ) {
$line=<INPUT>;
chomp $line;
my($event,$ecpu,$eappli,$etask,$ethread,$etime,%event_list) =
split(/:/,$line);
(($event==2) && ($ecpu eq $cpu) && ($eappli eq $appli) &&
($etask eq $task) && ($ethread eq $thread) &&
($etime >= $begin_time) && ($etime <= $end_time)) or
die "Invalid event!";
($sname,$sname_param)=process_event(%event_list);
} else {
$sname = $$state_name{$state};
}
if($sname eq "Running") { $sname_param.= (($end_time-$begin_time)*$power_reference); }
if($format eq "csv") {
print OUTPUT "State, $task, MPI_STATE, $begin_time, $end_time, ".
($end_time-$begin_time).", 0, ".
$sname."\n";
}
if($format eq "pjdump") {
print OUTPUT "State, Thread_${task}_$thread, STATE, $begin_time, $end_time, ".
($end_time-$begin_time).", 0, ".
$sname."\n";
}
if($format eq "tit") {
$task=$task-1;
defined($tit_translate{$sname}) or die "Unknown state '$sname' for tit\n";
if($tit_translate{$sname} ne "") {
print { $fh[$task] } "$task $tit_translate{$sname} $sname_param\n",
}
}
} elsif ($line =~ /^2/) {
# Event records 2:cpu:appl:task:thread : time : event_type:event_value
my($event,$cpu,$appli,$task,$thread,$time,%event_list) =
split(/:/,$line);
my($sname,$sname_param)=process_event(%event_list);
if($format eq "tit") {
$task=$task-1;
defined($tit_translate{$sname}) or die "Unknown state '$sname' for tit:\n\t$line\n";
if($tit_translate{$sname} ne "") {
print { $fh[$task] } "$task $tit_translate{$sname} $sname_param\n",
}
}
} elsif($line =~ /^3/) {
# Communication records 3: cpu_send:ptask_send:task_send:thread_send : logical_time_send: actual_time_send: cpu_recv:ptask_recv:task_recv:thread_recv : logical_time_recv: actual_time_recv: size: tag
print STDERR "Skipping this communication event\n";
}
if($line =~ /^c/) {
# Communicator record c: app_id: communicator_id: number_of_process : thread_list (e.g., 1:2:3:4:5:6:7:8)
print STDERR "Skipping communicator definition\n";
}
}
for ($format) {
if (/^csv$/) {
close(OUTPUT); print "Generated [[file:$output]]\n";
last;
}
if (/^pjdump$/) {
close(OUTPUT); print "Generated [[file:$output]]\n";
last;
}
if(/^tit$/) {
foreach my $f (@fh) {
close($f) or die "Failed closing file descriptor. $!\n";
}
print "Generated [[file:${output}_0.tit]] among other ones\n";
last;
}
die "Invalid format '$format'\n";
}
}
main();
Input: './paraver_trace/EXTRAE_Paraver_trace_mpich'
Output: './paraver_trace/bigdft_8_rl'
Format: 'tit'
Generated [[file:./paraver_trace/bigdft_8_rl_0.tit]] among other ones
head paraver_trace/bigdft_8_rl.csv
State, 1, MPI_STATE, 0, 10668, 10668, 0, Not created
State, 2, MPI_STATE, 0, 5118733, 5118733, 0, Not created
State, 3, MPI_STATE, 0, 9374527, 9374527, 0, Not created
State, 4, MPI_STATE, 0, 17510142, 17510142, 0, Not created
State, 5, MPI_STATE, 0, 5989994, 5989994, 0, Not created
State, 6, MPI_STATE, 0, 5737601, 5737601, 0, Not created
State, 7, MPI_STATE, 0, 5866978, 5866978, 0, Not created
State, 8, MPI_STATE, 0, 5891099, 5891099, 0, Not created
State, 1, MPI_STATE, 10668, 25576057, 25565389, 0, Running
State, 2, MPI_STATE, 5118733, 18655258, 13536525, 0, Running
Let's try to replay on SMPI¶
cp /home/alegrand/Work/SimGrid/infra-songs/WP4/SC13/graphene.xml ./graphene.xml
print_usage()
{
echo "Usage: $0 [OPTIONS]"
cat <<'End-of-message'
-i|--input Paraver input file
-o|--output output file (in the paje format)
-p|--platform XML platform file
-m|--machine_file
-h|help print help information
End-of-message
exit 1
}
TEMP=`getopt -o i:o:p:m:h --long input:,output:,platform:,machine_file:,help -n 'smpi2pj.sh' -- "$@"`
eval set -- "$TEMP"
while true;do
case "$1" in
-i|--input)
case "$2" in
"") shift 2;;
*) INPUT=$2;shift 2;;
esac;;
-o|--output)
case "$2" in
"") shift 2;;
*) OUTPUT=$2;shift 2;;
esac;;
-p|--platform)
case "$2" in
"") shift 2;;
*) PLATFORM=$2;shift 2;;
esac;;
-m|--machine)
case "$2" in
"") shift 2;;
*) MACHINE_FILE=$2;shift 2;;
esac;;
-h|--help)
print_usage;shift;;
--) shift; break;;
*) echo "Unknown option '$1'"; print_usage;;
esac
done
TMP_WORKING_PATH=`mktemp -d`
# Creating input for smpi_replay
REPLAY_INPUT=$TMP_WORKING_PATH/smpi_replay.txt
ls $INPUT*.tit > $REPLAY_INPUT
# Get the number of MPI ranks
export NP=`cat $REPLAY_INPUT | wc -l`
# Generating a dumb deployment (machine_file) if needed
if [ -z "$MACHINE_FILE" ]; then
MACHINE_FILE=$TMP_WORKING_PATH/machine_file.txt;
if [ -e "$MACHINE_FILE" ]; then
echo "Ooups $MACHINE_FILE already exists. Do not want to overwrite" ;
exit 1 ;
fi;
rm -f $MACHINE_FILE;
touch $MACHINE_FILE;
for i in `seq 1 144`; do
echo graphene-${i}.nancy.grid5000.fr >> $MACHINE_FILE ;
done
cp $MACHINE_FILE $MACHINE_FILE.sav
cat $MACHINE_FILE.sav $MACHINE_FILE.sav $MACHINE_FILE.sav $MACHINE_FILE.sav > $MACHINE_FILE
fi
## To debug
# $SMPIRUN -ext smpi_replay --log=replay.thresh:critical --log=smpi_replay.thresh:verbose \
# --cfg=smpi/cpu_threshold:-1 -hostfile machine_file -platform $PLATFORM \
# -np $NP gdb\ --args\ $REPLAY /tmp/smpi_replay.txt --log=smpi_kernel.thres:warning \
# --cfg=contexts/factory:thread
$SMPIRUN -ext smpi_replay \
--cfg=smpi/cpu_threshold:-1 -trace --cfg=tracing/filename:$OUTPUT \
-hostfile $MACHINE_FILE -platform $PLATFORM -np $NP \
$REPLAY $REPLAY_INPUT --log=smpi_kernel.thres:warning \
--cfg=contexts/factory:thread 2>&1
# --log=replay.thresh:critical --log=smpi_replay.thresh:verbose
SMPI Paje to Paraver Conversion¶
This was quick and dirty and reused the original pcf file but in the end it kinda works… Yeepee! :)
use strict;
use Env;
my($arg);
my($strict_option) = "";
while(defined($arg=shift(@ARGV))) {
for ($arg) {
print "$arg \n";
if (/^-i$/) { $input = shift(@ARGV); last; }
if (/^-o$/) { $output = shift(@ARGV); last; }
if (/^-ns$/){ $strict_option = "-n -z"; last; }
print "unrecognized argument '$arg'";
}
}
my $pjfile=$input;
$pjfile=~ s/\.trace$/.pjdump/;
$pjfile ne $input or die;
$ENV{LANG}="C";
system("pj_dump $strict_option $input | grep State | sed 's/ //g' | sort -n -t ',' -k 4n > $pjfile");
my $duration = `tail -n 1 $pjfile`;
my @duration = split(/,/,$duration);
$duration = $duration[4];
$duration *= 1E9;
my $nb_nodes = `sed -e 's/.*rank-//' -e 's/,.*//' $pjfile | sort | uniq | wc -l`;
chomp($nb_nodes);
my(%smpi_to_pcf) = (
"action_allReduce" => "10",
"action_allToAll" => "11",
"action_barrier" => "8",
"action_bcast" => "7",
"action_gather" => "13",
"action_reduce" => "9",
"action_reducescatter" => "80",
# "smpi_replay_finalize" => "32",
# "smpi_replay_init" => "31"
);
my($pcf_file_content)="DEFAULT_OPTIONS
LEVEL THREAD
UNITS NANOSEC
LOOK_BACK 100
SPEED 1
FLAG_ICONS ENABLED
NUM_OF_STATE_COLORS 1000
YMAX_SCALE 37
DEFAULT_SEMANTIC
THREAD_FUNC State As Is
STATES
0 Idle
1 Running
2 Not created
3 Waiting a message
4 Blocking Send
5 Synchronization
6 Test/Probe
7 Scheduling and Fork/Join
8 Wait/WaitAll
9 Blocked
10 Immediate Send
11 Immediate Receive
12 I/O
13 Group Communication
14 Tracing Disabled
15 Others
16 Send Receive
17 Memory transfer
STATES_COLOR
0 {117,195,255}
1 {0,0,255}
2 {255,255,255}
3 {255,0,0}
4 {255,0,174}
5 {179,0,0}
6 {0,255,0}
7 {255,255,0}
8 {235,0,0}
9 {0,162,0}
10 {255,0,255}
11 {100,100,177}
12 {172,174,41}
13 {255,144,26}
14 {2,255,177}
15 {192,224,0}
16 {66,66,66}
17 {255,0,96}
EVENT_TYPE
9 50000001 MPI Point-to-point
VALUES
2 MPI_Recv
1 MPI_Send
0 Outside MPI
EVENT_TYPE
9 50000002 MPI Collective Comm
VALUES
18 MPI_Allgatherv
10 MPI_Allreduce
11 MPI_Alltoall
12 MPI_Alltoallv
8 MPI_Barrier
7 MPI_Bcast
13 MPI_Gather
14 MPI_Gatherv
80 MPI_Reduce_scatter
9 MPI_Reduce
0 Outside MPI
EVENT_TYPE
9 50000003 MPI Other
VALUES
21 MPI_Comm_create
19 MPI_Comm_rank
20 MPI_Comm_size
32 MPI_Finalize
31 MPI_Init
0 Outside MPI
EVENT_TYPE
1 50100001 Send Size in MPI Global OP
1 50100002 Recv Size in MPI Global OP
1 50100003 Root in MPI Global OP
1 50100004 Communicator in MPI Global OP
EVENT_TYPE
6 40000001 Application
VALUES
0 End
1 Begin
EVENT_TYPE
6 40000003 Flushing Traces
VALUES
0 End
1 Begin
GRADIENT_COLOR
0 {0,255,2}
1 {0,244,13}
2 {0,232,25}
3 {0,220,37}
4 {0,209,48}
5 {0,197,60}
6 {0,185,72}
7 {0,173,84}
8 {0,162,95}
9 {0,150,107}
10 {0,138,119}
11 {0,127,130}
12 {0,115,142}
13 {0,103,154}
14 {0,91,166}
GRADIENT_NAMES
0 Gradient 0
1 Grad. 1/MPI Events
2 Grad. 2/OMP Events
3 Grad. 3/OMP locks
4 Grad. 4/User func
5 Grad. 5/User Events
6 Grad. 6/General Events
7 Grad. 7/Hardware Counters
8 Gradient 8
9 Gradient 9
10 Gradient 10
11 Gradient 11
12 Gradient 12
13 Gradient 13
14 Gradient 14
EVENT_TYPE
9 40000018 Tracing mode:
VALUES
1 Detailed
2 CPU Bursts
";
my($pcf_output)=$output;
$pcf_output =~ s/\.prv$/.pcf/;
open OUTPUT, "> $pcf_output";
print OUTPUT $pcf_file_content;
close OUTPUT;
my($line);
open(INPUT,$pjfile) or die;
open(OUTPUT,"> $output") or die;
my(@tab);
@tab=();
foreach (1..$nb_nodes) {
push @tab,1;
}
my $node_list = join(',',@tab);
@tab=();
foreach (1..$nb_nodes) {
push @tab,"1:$_";
}
my $thread_list = join(',',@tab);
print OUTPUT "#Paraver (generated with perl from SMPI):${duration}_ns:$nb_nodes($node_list):1:$nb_nodes($thread_list),9\n";
my $comm_list = join(':',(1..$nb_nodes));
my $comm=1;
print OUTPUT "c:1:$comm:$nb_nodes:$comm_list\n"; $comm++;
foreach (1..$nb_nodes) {
print OUTPUT "c:1:$comm:1:$_\n";
}
while(defined($line=<INPUT>)) {
chomp($line);
my($Foo1,$rank,$Foo2,$start,$end,$duration,$Foo3,$type) = split(/,/,$line);
$rank=~ s/\D*//g;
$rank++;
$start *= 1E9;
$end *= 1E9;
if(defined($smpi_to_pcf{$type})) {
print OUTPUT "1:$rank:1:$rank:1:$start:$end:13\n"; # group communication
print OUTPUT "2:$rank:1:$rank:1:$start:50000002:$smpi_to_pcf{$type}\n";
print OUTPUT "2:$rank:1:$rank:1:$end:50000002:0\n"; # Output MPI
# print OUTPUT "1:$rank:1:$rank:1:$start:$end:$smpi_to_pcf{$type}\n";
} else {
warn("Unknown type $type: Skipping $line\n");
}
}
Gluing everything together to allow calling SMPI¶
The Dimemas wrapper called by paraver is
Let's make a copy of it.
mv /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh.backup
Basically, what I want to do is something like
perl prv2pj.pl
sh smpi2pj.sh >/dev/null
perl pjsmpi2prv.pl
Here is an equivalent version inspired from the dimemas wrapper.
#
# Simple wrapper for SMPI based on the Dimemas one
#
set -e
function usage
{
echo "Usage: $0 source_trace dimemas_cfg output_trace reuse_dimemas_trace [extra_parameters] [-n]"
echo " source_trace: Paraver trace"
echo " dimemas_cfg: Simulation parameters"
echo " output_trace: Output trace of Dimemas; must end with '.prv'"
echo " reuse_dimemas_trace: 0 -> don't reuse, rerun prv2dim"
echo " 1 -> reuse, don't rerun prv2dim"
echo " extra_parameters: See complete list of Dimemas help with 'Dimemas -h'"
echo " -n: prv2dim -n parameter => no generate initial idle states"
}
# Read and check parameters
if [ $# -lt 4 ]; then
usage
exit 1
fi
#PARAVER_TRACE=${1}
PARAVER_TRACE=`readlink -eqs "${1}"`
DIMEMAS_CFG=${2}
OUTPUT_PARAVER_TRACE=${3}
DIMEMAS_REUSE_TRACE=${4}
if [[ ${DIMEMAS_REUSE_TRACE} != "0" && ${DIMEMAS_REUSE_TRACE} != "1" ]]; then
usage
exit 1
fi
echo "Go to hell!"
exit 12;
echo "==============================================================================="
# Check SMPI availability
### Oh right, we should do that...
# Get tracename, without extensions
TRACENAME=$(echo "$PARAVER_TRACE" | sed "s/\.[^\.]*$//")
EXTENSION=$(echo "$PARAVER_TRACE" | sed "s/^.*\.//")
#Is gzipped?
if [[ ${EXTENSION} = "gz" ]]; then
echo
echo -n "[MSG] Decompressing $PARAVER_TRACE trace..."
gunzip ${PARAVER_TRACE}
TRACENAME=$(echo "${TRACENAME}" | sed "s/\.[^\.]*$//")
PARAVER_TRACE=${TRACENAME}.prv
echo "...Done!"
fi
DIMEMAS_TRACE=${TRACENAME}.dim
# Adapt Dimemas CFG with new trace name
DIMEMAS_CFG_NAME=$(echo "$DIMEMAS_CFG" | sed "s/\.[^\.]*$//")
DIMEMAS_COPY_CFG_NAME=`basename ${DIMEMAS_CFG_NAME}`
OLD_DIMEMAS_TRACENAME=`grep "mapping information" ${DIMEMAS_CFG} | grep ".dim" | awk -F'"' {'print $4'}`
NEW_DIMEMAS_TRACENAME=`basename ${DIMEMAS_TRACE}`
DIMEMAS_CFG_PATH=`dirname ${DIMEMAS_TRACE}`
# Append extra parameters if they exist
shift
shift
shift
shift
EXTRA_PARAMETERS=""
PRV2DIM_N=""
while [ -n "$1" ]; do
if [[ ${1} == "-n" ]]; then # caution! this works because no -n parameters exists in Dimemas
PRV2DIM_N="-n"
else
EXTRA_PARAMETERS="$EXTRA_PARAMETERS $1"
fi
shift
done
# Change directory to see .dim
DIMEMAS_TRACE_DIR=`dirname ${DIMEMAS_TRACE}`/
pushd . > /dev/null
cd ${DIMEMAS_TRACE_DIR}
# Translate from .prv to SMPI time independant trace
if [[ ${DIMEMAS_REUSE_TRACE} = "0" || \
${DIMEMAS_REUSE_TRACE} = "1" && ! -f ${DIMEMAS_TRACE} ]]; then
if [[ ${DIMEMAS_REUSE_TRACE} = "1" ]]; then
echo
echo "[WARN] Unable to find ${DIMEMAS_TRACE}"
echo "[WARN] Generating it."
fi
PARAVER_TRACE_TRIMED=`echo ${PARAVER_TRACE} | sed 's/.prv$//'`
echo
echo "[COM] prv2pj.pl -i ${PARAVER_TRACE_TRIMED} -o ${DIMEMAS_TRACE} -f tit"
echo
prv2pj.pl -i ${PARAVER_TRACE_TRIMED} -o ${DIMEMAS_TRACE} -f tit
echo
fi
# Simulate
echo
echo "*** Running SMPI :) ***"
echo
echo "[COM] smpi2pj.sh -i ${DIMEMAS_TRACE} -o ${OUTPUT_PAJE_TRACE}"
echo
OUTPUT_PAJE_TRACE=`echo ${OUTPUT_PARAVER_TRACE} | sed 's/.prv$/.trace/'`
smpi2pj.sh -i ${DIMEMAS_TRACE} -o ${OUTPUT_PAJE_TRACE}
# Convert back to paraver
echo
echo "[COM] pjsmpi2prv.pl -i ${OUTPUT_PAJE_TRACE} -o ${OUTPUT_PARAVER_TRACE}"
echo
pjsmpi2prv.pl -i ${OUTPUT_PAJE_TRACE} -o ${OUTPUT_PARAVER_TRACE}
echo "==============================================================================="
popd > /dev/null
For this to work "system wide", I need to put the previous perl and sh scripts in the PATH. Eventually, they will be shipped with SMPI.
TMP_FILENAME=`mktemp`
for i in *.pl ; do
mv $i $TMP_FILENAME;
echo "#!/usr/bin/perl" > $i;
cat $TMP_FILENAME >> $i;
rm $TMP_FILENAME;
chmod +x $i;
cp $i ~/bin/
done
for i in smpi2pj.sh ; do
mv $i $TMP_FILENAME;
echo "#!/bin/sh" > $i;
cat $TMP_FILENAME >> $i;
rm $TMP_FILENAME;
chmod +x $i;
cp $i ~/bin/
done
for i in /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh ; do
mv $i $TMP_FILENAME;
echo "#!/bin/bash" > $i;
cat $TMP_FILENAME >> $i;
rm $TMP_FILENAME;
chmod +x $i;
cp $i ~/bin/
done
Specific Pjdump to Paraver Conversion for Damien¶
#!/usr/bin/perl
my $output=q(/exports/nancy_700_lu.C.700.prv);
my $input=q(/exports/nancy_700_lu.C.700.pjdump.bz2);
use strict;
use Env;
my($duration,$nb_nodes);
my($strict_option) = "";
my($arg);
while(defined($arg=shift(@ARGV))) {
for ($arg) {
if (/^-i$/) { $input = shift(@ARGV); last; }
if (/^-o$/) { $output = shift(@ARGV); last; }
if (/^-d$/) { $duration = shift(@ARGV); last; }
if (/^-n$/) { $nb_nodes = shift(@ARGV); last; }
if (/^-ns$/){ $strict_option = "-n -z"; last; }
print "unrecognized argument '$arg'";
}
}
print " ---> $input \n";
my($pjfile);
if($input =~/\.trace$/) {
$ENV{LANG}="C";
$pjfile = $input;
$pjfile =~ s/\.trace$/.pjdump/;
my $command = "pj_dump $strict_option $input | grep State | sed 's/ //g' | sort -n -t ',' -k 4n > $pjfile";
print "---> $command\n";
system($command);
} elsif($input =~/\.pjdump/) {
$pjfile = $input;
} else {
die "Unknown input format '$input'\n";
}
print " ---> $pjfile \n";
if(!defined($duration)) {
$duration = `tail -n 1 $pjfile`;
my @duration = split(/,/,$duration);
$duration = $duration[4];
$duration *= 1E9;
}
if(!defined($nb_nodes)) {
$nb_nodes = `sed -e 's/.*rank-//' -e 's/,.*//' $pjfile | sort | uniq | wc -l`;
chomp($nb_nodes);
}
my($pcf_file_content)="DEFAULT_OPTIONS
LEVEL THREAD
UNITS NANOSEC
LOOK_BACK 100
SPEED 1
FLAG_ICONS ENABLED
NUM_OF_STATE_COLORS 1000
YMAX_SCALE 37
DEFAULT_SEMANTIC
THREAD_FUNC State As Is
STATES
0 Idle
1 Running
2 Not created
3 Waiting a message
4 Blocking Send
5 Synchronization
6 Test/Probe
7 Scheduling and Fork/Join
8 Wait/WaitAll
9 Blocked
10 Immediate Send
11 Immediate Receive
12 I/O
13 Group Communication
14 Tracing Disabled
15 Others
16 Send Receive
17 Memory transfer
STATES_COLOR
0 {117,195,255}
1 {0,0,255}
2 {255,255,255}
3 {255,0,0}
4 {255,0,174}
5 {179,0,0}
6 {0,255,0}
7 {255,255,0}
8 {235,0,0}
9 {0,162,0}
10 {255,0,255}
11 {100,100,177}
12 {172,174,41}
13 {255,144,26}
14 {2,255,177}
15 {192,224,0}
16 {66,66,66}
17 {255,0,96}
EVENT_TYPE
9 50000001 MPI Point-to-point
VALUES
2 MPI_Recv
1 MPI_Send
0 Outside MPI
EVENT_TYPE
9 50000002 MPI Collective Comm
VALUES
18 MPI_Allgatherv
10 MPI_Allreduce
11 MPI_Alltoall
12 MPI_Alltoallv
8 MPI_Barrier
7 MPI_Bcast
13 MPI_Gather
14 MPI_Gatherv
80 MPI_Reduce_scatter
9 MPI_Reduce
0 Outside MPI
EVENT_TYPE
9 50000003 MPI Other
VALUES
21 MPI_Comm_create
19 MPI_Comm_rank
20 MPI_Comm_size
32 MPI_Finalize
31 MPI_Init
0 Outside MPI
EVENT_TYPE
1 50100001 Send Size in MPI Global OP
1 50100002 Recv Size in MPI Global OP
1 50100003 Root in MPI Global OP
1 50100004 Communicator in MPI Global OP
EVENT_TYPE
6 40000001 Application
VALUES
0 End
1 Begin
EVENT_TYPE
6 40000003 Flushing Traces
VALUES
0 End
1 Begin
GRADIENT_COLOR
0 {0,255,2}
1 {0,244,13}
2 {0,232,25}
3 {0,220,37}
4 {0,209,48}
5 {0,197,60}
6 {0,185,72}
7 {0,173,84}
8 {0,162,95}
9 {0,150,107}
10 {0,138,119}
11 {0,127,130}
12 {0,115,142}
13 {0,103,154}
14 {0,91,166}
GRADIENT_NAMES
0 Gradient 0
1 Grad. 1/MPI Events
2 Grad. 2/OMP Events
3 Grad. 3/OMP locks
4 Grad. 4/User func
5 Grad. 5/User Events
6 Grad. 6/General Events
7 Grad. 7/Hardware Counters
8 Gradient 8
9 Gradient 9
10 Gradient 10
11 Gradient 11
12 Gradient 12
13 Gradient 13
14 Gradient 14
EVENT_TYPE
9 40000018 Tracing mode:
VALUES
1 Detailed
2 CPU Bursts
";
my($pcf_output)=$output;
$pcf_output =~ s/\.prv$/.pcf/;
open OUTPUT, "> $pcf_output";
print OUTPUT $pcf_file_content;
close OUTPUT;
my(%mpi_to_pcf) = (
"MPI_Running" => "1",
"MPI_Send" => "10",
"MPI_Recv" => "11",
"Collective" => "13",
"Others" => "15",
);
my(%mpi_coll_to_pcf) = (
"MPI_Bcast" => "18",
"MPI_Allreduce" => "10",
"MPI_Alltoall" => "11",
"MPI_Alltoallv" => "12",
"MPI_Bcast" => "7",
"MPI_Gather" => "13",
"MPI_Gatherv" => "14",
"MPI_Reduce_Scatter" => "80",
"MPI_Reduce" => "9",
);
my(%mpi_others_to_pcf) = (
"MPI_Comm_create" => "21",
"MPI_Comm_rank" => "19",
"MPI_Comm_size" => "20",
"MPI_Finalize" => "32",
"MPI_Init" => "31",
);
my(%smpi_to_mpi) = (
"action_allReduce" => "MPI_Allreduce",
"action_allToAll" => "MPI_Alltoall",
"action_barrier" => "MPI_Barrier",
"action_bcast" => "MPI_Bcast",
"action_gather" => "MPI_Gather",
"action_reduce" => "MPI_reduce",
"action_reducescatter" => "MPI_Reduce_Scatter",
"smpi_replay_finalize" => "MPI_Finalize",
"smpi_replay_init" => "MPI_Init",
"PMPI_Init" => "MPI_Init",
"PMPI_Send" => "MPI_Send",
"PMPI_Recv" => "MPI_Recv",
"PMPI_Finalize" => "MPI_Finalize"
);
my($line);
open(INPUT,$pjfile) or die;
open(OUTPUT,"> $output") or die;
my(@tab);
@tab=();
foreach (1..$nb_nodes) {
push @tab,1;
}
my $node_list = join(',',@tab);
@tab=();
foreach (1..$nb_nodes) {
push @tab,"1:$_";
}
my $thread_list = join(',',@tab);
print OUTPUT "#Paraver (generated with perl from SMPI):${duration}_ns:$nb_nodes($node_list):1:$nb_nodes($thread_list),3\n";
my $comm_list = join(':',(1..$nb_nodes));
my $comm=1;
print OUTPUT "c:1:$comm:$nb_nodes:$comm_list\n"; $comm++;
foreach (1..$nb_nodes) {
print OUTPUT "c:1:$comm:1:$_\n";
}
while(defined($line=<INPUT>)) {
chomp($line);
my($Foo1,$rank,$Foo2,$start,$end,$duration,$Foo3,$type) = split(/,/,$line);
$rank=~ s/\D*//g;
$rank++;
$start *= 1E9;
$end *= 1E9;
if($type =~ /action_/ or $type =~ /smpi_/ or $type =~ /PMPI_/) {
my($key);
foreach $key (keys(%smpi_to_mpi)) {
if($type eq $key) {
$type = $smpi_to_mpi{$key};
last;
}
}
}
if(defined($mpi_to_pcf{$type})) {
print "$type $mpi_to_pcf{$type}\n";
print OUTPUT "1:$rank:1:$rank:1:$start:$end:$mpi_to_pcf{$type}\n";
} elsif(defined($mpi_coll_to_pcf{$type})) {
print OUTPUT "1:$rank:1:$rank:1:$start:$end:$mpi_to_pcf{Collective}\n"; # group communication
print OUTPUT "2:$rank:1:$rank:1:$start:50000002:$mpi_coll_to_pcf{$type}\n";
print OUTPUT "2:$rank:1:$rank:1:$end:50000002:0\n"; # Output MPI
} elsif(defined($mpi_others_to_pcf{$type})) {
print OUTPUT "1:$rank:1:$rank:1:$start:$end:$mpi_to_pcf{Others}\n";
print OUTPUT "2:$rank:1:$rank:1:$start:50000003:$mpi_others_to_pcf{$type}\n";
print OUTPUT "2:$rank:1:$rank:1:$end:50000003:0\n"; # Output MPI
} else {
warn("Unknown type $type: Skipping $line\n");
}
}