# M2R Parallel Systems

## Table of Contents

## Sitemap

## Parallel Systems

### General Informations

These lectures take place in every monday from 8h15 to
11h15. The coordinators for these lectures are Arnaud Legrand and Jean-Louis Roch. Please send a mail to **both**
of them for any question regarding the lectures.

The next lecture will be on Monday 1/12 in room H201 from 8h15 to 11h15

### Objectives

Today, parallel computing is omnipresent across a large spectrum of computing platforms. At the ``microscopic'' level, processor cores have used multiple functional units in concurrent and pipelined fashions for years, and multiple-core chips are now commonplace with a trend toward rapidly increasing numbers of cores per chip. At the ``macroscopic'' level, one can now build clusters of hundreds to thousands of individual (multi-core) computers. Such distributed-memory systems have become mainstream and affordable in the form of commodity clusters. Furthermore, advances in network technology and infrastructures have made it possible to aggregate parallel computing platforms across wide-area networks in so-called ``grids.''

An efficient exploitation of such platforms requires a deep understanding of both architecture, software and infrastructure mechanisms and of advanced algorithmic principles. The aim of this course is thus twofold. It aims at introducing the main trends and principles in the area of high performance computing infrastructures, illustrated by examples of the current state of the art. It intends to provide a rigorous yet accessible treatment of parallel algorithms, including theoretical models of parallel computation, parallel algorithm design for homogeneous and heterogeneous platforms, complexity and performance analysis, and fundamental notions of scheduling and work-stealing. These notions will always be presented in connection with real applications and platforms.

### Program and expected schedule

`Mon Sep 29 2008: Jean-François Méhaut`

Introduction to parallel computing. High Performance Architectures Processors (superscalar, simultaneous multi-threading, multi-core, GPU…). Symmetric MultiProcessors. OS features for cluster computing Multi-threading. 01_mosig.pdf, 01_INTEL.pdf.`Mon Oct 6 2008: Vincent Danjean`

How to Efficiently Program High Performance Architectures ? System and Low Level approaches. 02_progpar.pdf, 02_MPI.pdf.`Mon Oct 13 2008: Vincent Danjean and Arnaud Legrand`

High Performance Networks: bandwidth, latency, DMA, PIO, overlapping. 03_networks.pdf Platform taxonomy: From clusters to grid platforms. 03_clusters_to_grid.pdf Communication models (Hockney, LogP, TCP). Analysis of global communication schemes. 03_clusters_to_grid.pdf`Mon Oct 20 2008: Arnaud Legrand`

End of the lecture on communication models Modeling of parallel programs and platforms. Fundamental characteristics: Work and Depth. Dataflow graph representation of an execution. BSP programs. Classical scheduling techniques and application to Resource management (PBS, LSF, SGE, OAR). 04_scheduling.pdf`Mon Nov 3 2008: Arnaud Legrand`

Algorithms on a ring and on a grid. Overview of algorithms on heterogeneous platforms.- Virtual Topologies 05_ics632_vtop.pdf
- Communications on a ring 05_ics632_1Dcomm.pdf
- Algorithms on a ring 05_ics632_1D.pdf 05_ics632_1D_2.pdf
- Algorithms on a grid 05_ics632_2D_1.pdf, 05_ics632_2D_2.pdf
- Overview of algorithms on heterogeneous platforms 05_ics632_het.pdf

`Mon Nov 10 2008: Vincent Danjean`

Parallel programming: MPI, POSIX threads. 06_MPI_tutorial.tgz`Mon Nov 17 2008: Alexandre Termier & Arnaud Legrand`

Case study 1: Data-mining 07_data_mining.pdf. Introduction to divisible load scheduling 07_divisible.pdf`Mon Nov 24 2008: Derrick Kondo`

Case study 2: Desktop Grids. On the convergence of cloud computing and desktop grids; Google Map Reduce. 08_kondo.pdf`Mon Dec 1 2008: Jean-Louis Roch`

Work, depth and work-stealing. Analysis on processors with changing speeds. Parallelism extraction and algorithmic schemes by recursive splitting. Examples: iterated product/accumulate, cascading divide-and-conquer.- slides 09_roch_1.pdf
- Cost analysis by recurence formula
- Application to the design of parallel accumulate: cascading parallel/sequential
- Cascading divide and conquer: example of the maximum 09_roch_2.pdf
- Training exercices: parallel merge and sort. Problem 09_roch_3.pdf,

Answers/Correction 09_roch_4.pdf and LaTeX file 09_roch_34.tex to translate into english! Thanks in advance, 2 credit points to the exam on the for a good translation…

`Mon Dec 8 2008: Jean-Louis Roch`

Work-stealing and data locality. Sorting and merging, FFT, matrix operations.- Analysis of work-stealing: R. Blumofe's slides from Arora-Plaxton-Blumofe paper presentation 10_roch_1.pdf.
- Matrix multiplication and triangular matrix inversion. Communication cost.
- Ultrafast merge (training exercise)
- Sorting

`Mon Dec 15 2008: Jean-Louis Roch`

Adaptive algorithms and cascading divide & conquer: prefix computation, data compression, linear system solving.`Mon Jan 5 2008: Bruno Raffin`

Case Study 3: Illustration on the parallelization of distributed interactive simulation applications. 3D-scenes reconstruction.`Mon Jan 12 2008`

Presentation of students' individual studies (research articles, performance evaluation of an algorithm, parallelization of an application).`Mon Jan 19 2008`

Presentation of students' individual studies.

### Course Organization

The course gives 6 credits (ECTS). In addition to the lectures, each student performs an individual study: either the analysis of the parallelization of a given application (to be proposed by the student) or the presentation of a research paper (to be chosen among a proposed list of papers) with answers to given questions.

### Project

#### Things to Note

- Most of the projects have an "open-ended" flavor to them. The idea is to approach them like small research projects in which you define your own methodology and (some of) your objectives. What I am looking for is "mature" approaches expected from graduate students. In particular, your first task should be to come up with a precise formulation of the project.
- Some projects may be harder than others, and expectations will be tailored to the projects.
- At most 2 students can pick the same project.
- Group projects involving 2 students are possible, but expectations will be higher for the end result and I refuse to get involved in "I did everything and my partner did nothing because he/she was out skiing all the time" arguments.
- Students are strongly encouraged to define their own projects, but these projects have to be approved by me beforehand and approval will be contingent on the project being sufficiently involved.
- Looking at previous work, papers, results in the literature is encouraged. Downloading actual code that does the project is ok for comparison with your own implementation, but using that code instead of your own implementation constitutes ground for getting a zero.
- For the projects that require that you write a parallel applications, it is understood that you should write a sequential version (if need be) and that you compute speedups with respect to the sequential version. It is also understood that you will perform in-depth performance measurements. It is your responsibility to come up with interesting things to say in your report! One way to do this is coming up with multiple versions of your implementation so that you can study what worked and what didn't in terms of performance. Producing only one implementation doesn't really give you anything interesting to talk about. Performance analysis/modeling is a plus.

**IMPORTANT:** You should discuss progress with me, and not hesitate to ask me questions and for directions!

#### What to Turn in

You must turn in your code, a report (PDF) around 10-pages, and be prepared to present your project to the class and answer questions (about 15/20 minutes).

Projects are due on ** January 5 2009**, with presentations starting January 12 2009.

#### List of Possible Projects

`Parallel Sorting`

(This project is assigned to Fahmi Gargouri et Imad Afyouni) Sorting a list of numbers is a key operation in many applications and it is well-known that it is difficult to parallelize it efficiently. For instance, due to the fact that the amount of data is large when compared to the amount of computation, the cost of I/O may be overwhelming. In this project you will consider the following problem:Problem Statement: You have a binary file in your home area that contains a list of N random integers. You must write another binary file in your home area that contains the sorted list. Your goal is to do this on our cluster as fast as possible. This is a difficult problem because sorting is not very compute intensive, and so the result may be disappointing. The point is to see whether some speedup can indeed be achieved and to see what matters for performance.

You will develop several parallel sorting algorithms (you can probably come up with a few of your own, or research existing algorithms), and most likely several versions of each algorithm. For you performance evaluation focus on measuring I/O and computing costs. You probably need to spend some time thinking of how the data gets given to the processors initially. This could be done by some script before the call to mpirun in your PBS script, in the MPI code itself. Note that each node has its own local disk, to which I/O is of course faster than over the NSF-mounted home areas. Comparing different ways of doing the data distribution is most likely in order. And of course you should vary the value of N in your experiments. Report on performance discounting file I/O and on performance including file I/O.

`Smith-Watterman Algorithm`

(This project is assigned to Anis Rahman and Mian Muhammad Hamayun) The Smith Waterman algorithm is a dynamic programming algorithm used to compute global alignments of biological sequences, e.g., to align full genomes of procaryot organisms (i.e., bacterias). Research this algorithm (it's famous and information on the algorithm is readily available) and implement a fast parallel implementation (using existing or random DNA sequences or arbitrary lengths). The input sequences are stored in files in the user's home area and the resulting alignment should be stored to an output file. Come up with reasonable ways of distributing the data. You can also find parallel implementations of the SW algorithm and compare them with your own.`N-body Simulation`

(This project is assigned to Zizhu Wang) An N-body simulation is one that studies how bodies (represented by a mass, a location, and a velocity) move in space (in our case a 2-D space), according to the laws of (Newtonian) physics. Here is an implementation of the N-body problem in Matlab (runnable using the free implementation Octave on any good Linux system): nbod.m and mkbod.m (courtesy of Howard Motteler at UMBC). Implement an MPI version of this sequential program (your MPI code should read a file that specifies the problem's parameters). Describe your data distribution strategies. You'll probably need to implement adaptive load-balancing. All these should be part of an in-depth performance analysis of subsequent versions of the code. Design a way in which your program will output the results in a verifiable and viewable format (best would probably be a sequence of bitmap images that can be animated as an animated gif, but don't spend all your time doing this if you have no idea how to do it!).`Intersection of Line Segments`

(This project is assigned to Jérémie Francone and also to Ali Zeeshan) Consider a 2-D rectangular area and a number n of random line segments of some maximum length l in this area. Write the fastest possible MPI program that returns the list of all intersections as a triplet segment1, segment2, intersection point, (for large n). Do the usual performance analysis describing what your different optimizations were and what worked what didn't.`DAG Scheduling`

(This project is assigned to Paraskevas Bourgos and Emmanouel Sifakis) In this project you will verify the notion that DAG-scheduling heuristics that based their scheduling decisions on the critical path are indeed more effective than the standard MaxMin, MinMin, and Sufferage heuristics. This will be done entirely in simulation (i.e., by constructing Gantt charts, etc.). You will have to define a DAG generator that generates random DAGs with given characteristics, and perhaps synthetic DAGs with idiosyncratic properties to highlight the behavior of the different algorithms. The end result will be a set of graphs plotting the performance of relevant scheduling algorithms versus important DAG characteristics and perhaps number of available processors. An important component of this project is defining the experimental framework. Trying your own heuristics, or trying some available in the literature, is obviously a great idea.

### Bibliography

- Fran Berman, Geoffrey Fox, and Anthony Hey. Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons, 2003.
- Ian Foster and Carl Kessellman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 2003.
- Arnaud Legrand and Yves Robert. Algorithmique Parallèle: Cours et exercices corrigés. Dunod, 2003.