Skip to content
(setq org-babel-default-header-args:R
      '((:session . "org-R")))

For some reason, BigDFT is just impossible to trace in RL with Tau or Scalasca. Either it segfault or it hangs… Surprisingly, it works like a charm with Paraver. Remember we want to compare a RL trace with an SMPI trace. Unfortunately, we haven't found any converter from Paraver files to Tau or OTF or anything that we could later convert to CSV and analyze with R… So I've quickly written a quick and dirty perl script that converts Paraver files to CSV files similar to the ones we obtain with pjdump.

Paraver Conversion and R visualization

Augustin just sent me a typical paraver trace. Let's convert it:

ls paraver_trace/
EXTRAEParavertracempich.pcf
EXTRAEParavertracempich.prv
EXTRAEParavertracempich.row

The pcf file describes events, the row file defines the cpu/node/thread mapping and the prv is the trace with all events. The description of the file format is given here http://pcsostres.ac.upc.edu/eitm/doku.php/paraver:prv

# use strict;
use Data::Dumper;
my $prv_input=$input.".prv";
my $pcf_input=$input.".pcf";

open(OUTPUT,"> $output");

sub parse_pcf {
  my($pcf) = shift;
  my $line;
  my(%state_name, %event_name) ;
  open(INPUT,$pcf) or die "Cannot open $pcf. $!";
  while(defined($line=<INPUT>)) {
     chomp $line;
     if($line =~ /^STATES$/) {
       while((defined($line=<INPUT>)) &&
             ($line =~ /^(\d+)\s+(.*)/g)) {
          $state_name{$1} = $2;
       }
     }
     if($line =~ /^EVENT_TYPE$/) {
       $line=<INPUT>;
       $line =~ /[6|9]\s+(\d+)\s+(.*)/g or next; #E.g. , EVENT_TYPE\n 1    50100001    Send Size in MPI Global OP
       my($id)=$1;
       $event_name{$id}{type} = $2;
       $line=<INPUT>;
       $line =~ /VALUES/g or die;
       while((defined($line=<INPUT>)) &&
             ($line =~ /^(\d+)\s+(.*)/g)) {
          $event_name{$id}{value}{$1} = $2;
       }
     }
  }
#  print Dumper(\%state_name);
#  print Dumper(\%event_name);
  return (\%state_name,\%event_name);
}

sub parse_prv {
  my($prv,$state_name,$event_name) = @_;
  my $line;
  my (%event);

  open(INPUT,$prv);
  my($pcf) = shift;
  while(defined($line=<INPUT>)) {
    chomp $line;
    # State records 1:cpu:appl:task:thread : begin_time:end_time : state
    if($line =~ /^1/) {
      my($sname);
      my($state,$cpu,$appli,$task,$thread,$begin_time,$end_time,$state) =
         split(/:/,$line);
      if($$state_name{$state} =~ /Group/) {
        $line=<INPUT>;
        chomp $line;
        my($event,$ecpu,$eappli,$etask,$ethread,$etime,%event_list) =
           split(/:/,$line);
        (($ecpu eq $cpu) && ($eappli eq $appli) && 
         ($etask eq $task) && ($ethread eq $thread) &&
         ($etime >= $begin_time) && ($etime <= $end_time)) or
        die "Invalid event!";
        $sname = $$event_name{50000002}{value}{$event_list{50000002}};
      } else {
        $sname = $$state_name{$state};
      }
      print OUTPUT "State, $task, MPI_STATE, $begin_time, $end_time, ".
                   ($end_time-$begin_time).", 0, ".
                   $sname."\n";
    }
    # Event records 2:cpu:appl:task:thread : time : event_type:event_value
    if($line =~ /^2/) {
      my($event,$cpu,$appli,$task,$thread,$time,@event_list) =
         split(/:/,$line);
#      print OUTPUT "$time @event_list\n";
    }
  }
}

my($state_name,$event_name) = parse_pcf($pcf_input);
parse_prv($prv_input,$state_name,$event_name);
print "$output";
/tmp/bigdft_8_rl.csv
head /tmp/bigdft_8_rl.csv
State 1 MPISTATE 0 10668 10668 0 Not created
State 2 MPISTATE 0 5118733 5118733 0 Not created
State 3 MPISTATE 0 9374527 9374527 0 Not created
State 4 MPISTATE 0 17510142 17510142 0 Not created
State 5 MPISTATE 0 5989994 5989994 0 Not created
State 6 MPISTATE 0 5737601 5737601 0 Not created
State 7 MPISTATE 0 5866978 5866978 0 Not created
State 8 MPISTATE 0 5891099 5891099 0 Not created
State 1 MPISTATE 10668 25576057 25565389 0 Running
State 2 MPISTATE 5118733 18655258 13536525 0 Running

OK. Conversion seems fine. It's a little messy and some stuff is hardcoded but it does the job. Let's load it in R.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
options( width = 200 )
library(plyr)
library(ggplot2)

read_tau_trace <- function(file) {
  df <- read.csv(file, header=FALSE, strip.white=TRUE)
  names(df) <- c("Nature","ResourceId","Type","Start","End","Duration", "Level", "Value")
  df = df[!(names(df) %in% c("Nature","Type",  "Level"))]
  df$Origin=file
  df
}

df <- read_tau_trace(input)
# Let's convert from micro-seconds to seconds
df$End <- as.numeric(df$End)/1E9
df$Start <- as.numeric(df$Start)/1E9
df$Duration <- as.numeric(df$Duration)/1E9
head(df)
Use suppressPackageStartupMessages to eliminate package startup messages.
  ResourceId Start         End    Duration       Value               Origin
1          1     0 0.000010668 0.000010668 Not created /tmp/bigdft_8_rl.csv
2          2     0 0.005118733 0.005118733 Not created /tmp/bigdft_8_rl.csv
3          3     0 0.009374527 0.009374527 Not created /tmp/bigdft_8_rl.csv
4          4     0 0.017510142 0.017510142 Not created /tmp/bigdft_8_rl.csv
5          5     0 0.005989994 0.005989994 Not created /tmp/bigdft_8_rl.csv
6          6     0 0.005737601 0.005737601 Not created /tmp/bigdft_8_rl.csv

Fine. Now, let's make a Gantt Chart.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Reorder time from most running.
df_profile <- ddply(df, c("Value"), summarize, Duration = sum(as.numeric(Duration)))
new_value_order <- df_profile[with(df_profile, order(-Duration)),]$Value
df_profile$Value <- factor(df_profile$Value, levels=new_value_order)
df_profile[with(df_profile, order(-Duration)),]
df$Value <- factor(df$Value, levels=new_value_order)

ggplot(df) + theme_bw() + scale_fill_brewer(palette="Paired") +
    geom_rect(aes(xmin=Start,xmax=End, ymin=ResourceId, 
                  ymax=ResourceId+1,fill=Value)) 

And voilà. Now, we have to compare it more finely with an SMPI trace…

First Comparison with SMPI

Let's try with one Augustin just sent me.

pj_dump $input | grep State | sed -e 's/PMPI/MPI/g' -e 's/rank-//' -e 's/computing/Running/' > $output
echo $output
/tmp/bigdft_8_smpi.csv
1
2
dfs <- read_tau_trace(input)
head(dfs)
  ResourceId    Start      End Duration       Value                 Origin
1          7 0.000000 0.000084 0.000084     Running /tmp/bigdft_8_smpi.csv
2          7 0.000084 0.000424 0.000340 MPI_Barrier /tmp/bigdft_8_smpi.csv
3          7 0.000424 0.000444 0.000020     Running /tmp/bigdft_8_smpi.csv
4          7 0.000444 0.000510 0.000066   MPI_Bcast /tmp/bigdft_8_smpi.csv
5          7 0.000510 0.000513 0.000003     Running /tmp/bigdft_8_smpi.csv
6          7 0.000513 0.000620 0.000107   MPI_Bcast /tmp/bigdft_8_smpi.csv
1
2
3
4
# Reorder time from most running.
ggplot(dfs) + theme_bw() + scale_fill_brewer(palette="Paired") +
    geom_rect(aes(xmin=Start,xmax=End, ymin=ResourceId, 
                  ymax=ResourceId+1,fill=Value)) 

Let's plot them side by side

1
2
3
4
5
6
# Reorder time from most running.
df2 <- rbind(df,dfs)
ggplot(df2) + theme_bw() + scale_fill_brewer(palette="Paired") +
    geom_rect(aes(xmin=Start,xmax=End, ymin=ResourceId, 
                  ymax=ResourceId+1,fill=Value)) +
    facet_wrap(~Origin, ncol=1)

OK. That's a first step but obviously, our estimation of MPI_Allreduce is quite pessimistic. It's likely to be explained by the fact that our implementation of this collective in SMPI does not correspond at all to the one of RL…

Mmmmh, let's limit to the first phase:

1
2
3
df <- df[df$End<=max(df[df$Value == "MPI_Reduce_scatter",]$End),]
dfs <- dfs[dfs$End<=max(dfs[dfs$Value == "MPI_Reduce_scatter",]$End),]
df2 <- rbind(df,dfs)

And let's plot them side by side again

1
2
3
4
5
# Reorder time from most running.
ggplot(df2) + theme_bw() + scale_fill_brewer(palette="Paired") +
    geom_rect(aes(xmin=Start,xmax=End, ymin=ResourceId, 
                  ymax=ResourceId+1,fill=Value)) +
    facet_wrap(~Origin, ncol=1)

Now, let's compare "state distributions".

1
2
3
4
5
6
7
8
# Reorder time from most running.
df_sum <- ddply(df2, c("Value","ResourceId","Origin"),summarize, 
                Duration = sum(Duration))
ggplot(df_sum, aes(x=Value, y=Duration, color=Origin)) + 
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    geom_boxplot() + 
    geom_jitter(alpha=.4)

OK, so as expected the main difference lies in MPI_Allreduce. There is also a small difference in MPI_Alltoallv and MPI_Reduce_scatter. The other differences do not seem significant at this level of detail.

Using a "better" Allreduce implementation in SMPI

OK. Last attempt with a new trace that uses another implementation of Allreduce.

/tmp/bigdft_8_smpi_rdb.csv
1
2
dfs2 <- read_tau_trace(input)
dfs2 <- dfs2[dfs2$End<=max(dfs2[dfs2$Value == "MPI_Reduce_scatter",]$End),]

Again, let's display them side by side:

1
2
3
4
5
6
# Reorder time from most running.
df2 <- rbind(df,dfs,dfs2)
ggplot(df2) + theme_bw() + scale_fill_brewer(palette="Paired") +
    geom_rect(aes(xmin=Start,xmax=End, ymin=ResourceId, 
                  ymax=ResourceId+1,fill=Value)) +
    facet_wrap(~Origin, ncol=1)

1
2
3
4
5
6
7
8
# Reorder time from most running.
df_sum <- ddply(df2, c("Value","ResourceId","Origin"),summarize, 
                Duration = sum(Duration))
ggplot(df_sum, aes(x=Value, y=Duration, color=Origin)) + 
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    geom_boxplot(position="dodge") + 
    geom_jitter(alpha=.3)

That's better. It's not perfect yet but it's improving and we may still not have the true algorithm…

Entered on [2013-04-03 mer. 11:36]