# M2R Parallel Systems

## Table of Contents

## Sitemap

## Parallel Systems

### General Informations

These lectures take place in every monday from 9h45 to 12h45. The coordinator for these lectures is Arnaud Legrand

The next lecture will be on Monday 8/11 in room H 104 from 9h45 to 12h45

### Objectives

Today, parallel computing is omnipresent across a large spectrum of computing platforms. At the ``microscopic'' level, processor cores have used multiple functional units in concurrent and pipelined fashions for years, and multiple-core chips are now commonplace with a trend toward rapidly increasing numbers of cores per chip. At the ``macroscopic'' level, one can now build clusters of hundreds to thousands of individual (multi-core) computers. Such distributed-memory systems have become mainstream and affordable in the form of commodity clusters. Furthermore, advances in network technology and infrastructures have made it possible to aggregate parallel computing platforms across wide-area networks in so-called ``grids.''

An efficient exploitation of such platforms requires a deep understanding of both architecture, software and infrastructure mechanisms and of advanced algorithmic principles. The aim of this course is thus twofold. It aims at introducing the main trends and principles in the area of high performance computing infrastructures, illustrated by examples of the current state of the art. It intends to provide a rigorous yet accessible treatment of parallel algorithms, including theoretical models of parallel computation, parallel algorithm design for homogeneous and heterogeneous platforms, complexity and performance analysis, and fundamental notions of scheduling and work-stealing. These notions will always be presented in connection with real applications and platforms.

### Program and expected schedule

Check last year's schedule to get a foretaste. Tentative schedule.

`27/09/2008: Arnaud Legrand`

Introduction to parallel computing. High Performance Architectures Processors (superscalar, simultaneous multi-threading, multi-core, GPU…). Symmetric MultiProcessors. OS features for cluster computing Multi-threading. From clusters to Grids. PC_01_parallel_architectures.pdf`04/10/2010: Vincent Danjean`

High Performance Networks: bandwidth, latency, DMA, PIO, overlapping. How to Efficiently Program High Performance Architectures ? System and Low Level approaches. MPI, pthreads. PC_02_progpar.pdf`11/10/2010: Vincent Danjean`

How to Efficiently Communicate on Distributed Architectures ? Research aspects of mixing different HP API (e.g. how to efficiently use MPI and pthreads, how to efficiently use threads on hierarchical platforms, ….) PC_03_commdist.pdf`18/10/2010: Arnaud Legrand`

Performance Evaluation. Network models and topologies. Many parallel algorithms on a ring and on a grid. PC_04_parallel_algorithms.pdf`08/11/2010: Arnaud Legrand`

End of the previous lecture. From fine-grain to coarse-grain. PRAM, sorting networks and application to implementation on NOWs. PC_05_theory.pdf`15/11/2010: Arnaud Legrand`

End of the previous lecture + Modeling parallel programs and platforms. Fundamental characteristics: Work and Depth. Dataflow graph representation of an execution. BSP programs. Introduction to Scheduling! PC_06_scheduling.pdf`22/11/2010: Sascha Hunold`

Practical session on performance evaluation of parallel programs. PC_par_sort.pdf parsort-1.0.tar.gz mpi_sort.c`29/11/2010 Derrick Kondo`

Desktop Grids. On the convergence of cloud computing and desktop grids; Google Map Reduce. PC_07_desktop_grid.pdf`6/12/2010: Jean-Louis Roch`

Work, depth and work-stealing. Analysis on processors with changing speeds. Parallelism extraction and algorithmic schemes by recursive splitting. Examples: iterated product/accumulate, cascading divide-and-conquer. PC_08_algopar.pdf`13/12/2010: Jean-Louis Roch`

Work-stealing and data locality. Sorting and merging, FFT, matrix operations. Adaptive algorithms and cascading divide & conquer: prefix computation, data compression, linear system solving.`03/01/2010 Arnaud Legrand`

Cloud computing and how Virtualization changed the Grid perspective. VMscheduling-pres-grenobleDec2010_2.pdf`10/01/2011`

Presentation of students' individual studies.`17/01/2011`

Presentation of students' individual studies.

### Course Organization

The course gives 6 credits (ECTS). In addition to the lectures, each student performs an individual study: either the analysis of the parallelization of a given application (to be proposed by the student) or the presentation of a research paper (to be chosen among a proposed list of papers) with answers to given questions.

### Project

#### Things to Note

- Most of the projects have an "open-ended" flavor to them. The idea is to approach them like small research projects in which you define your own methodology and (some of) your objectives. What I am looking for is "mature" approaches expected from graduate students. In particular, your first task should be to come up with a precise formulation of the project.
- Some projects may be harder than others, and expectations will be tailored to the projects.
- At most 2 students can pick the same project.
- Group projects involving 2 students are possible, but expectations will be higher for the end result and I refuse to get involved in "I did everything and my partner did nothing because he/she was out skiing all the time" arguments.
- Students are strongly encouraged to define their own projects, but these projects have to be approved by me beforehand and approval will be contingent on the project being sufficiently involved.
- Looking at previous work, papers, results in the literature is encouraged. Downloading actual code that does the project is ok for comparison with your own implementation, but using that code instead of your own implementation constitutes ground for getting a zero.
- For the projects that require that you write a parallel applications, it is understood that you should write a sequential version (if need be) and that you compute speedups with respect to the sequential version. It is also understood that you will perform in-depth performance measurements. It is your responsibility to come up with interesting things to say in your report! One way to do this is coming up with multiple versions of your implementation so that you can study what worked and what didn't in terms of performance. Producing only one implementation doesn't really give you anything interesting to talk about. Performance analysis/modeling is a plus.

**IMPORTANT:** You should discuss progress with me, and not hesitate to ask me questions and for directions!

#### What to Turn in

You must turn in your code, a report (PDF) around 10-pages, and be prepared to present your project to the class and answer questions (about 15/20 minutes).

Projects are due on **January 6 2010**, with presentations starting January 11 2010.

#### Project Assignment

Name |
Subject |
---|---|

Luc Michel | Parallel Raytracing |

Ahmed El Rheddane | Sorting on a multicore machine |

Petr Sobeslavsky & Tsiligiannis Georgios | Measuring Application affinity in BOINC |

Damien Leone | Comparisons of parallel implementations of list ranking |

Cristian Camilo Ruiz | Batch scheduling evaluation |

Sam Jones | N-body simulation |

Modar Hijazi | GPUs |

Soguy Makkare GUEYE | Batch scheduling |

Tran Minh Dung | Sorting on a multicore machine |

Julien Henry & Julio Toss | CUDA vs. OpenCL: particle simulation |

Mohamad ALI | Non-cooperative scheduling |

#### List of Possible Projects

Again, this is a non exhaustive list of **possible**
projects. Feel free to propose me other subjects and/or to come
discuss with me if nothing really suits you here. If the one you
wanted to work on is already taken, we can try to think to another
subject in the same area but with a different perspective.

- System and Architecture
`Parallel sorting on a cluster`

Sorting a list of numbers is a key operation in many applications and it is well-known that it is difficult to parallelize it efficiently. For instance, due to the fact that the amount of data is large when compared to the amount of computation, the cost of I/O may be overwhelming. In this project you will consider the following problem:Problem Statement: You have a binary file in your home area that contains a list of N random integers. You must write another binary file in your home area that contains the sorted list. Your goal is to do this on our cluster as fast as possible. This is a difficult problem because sorting is not very compute intensive, and so the result may be disappointing. The point is to see whether some speedup can indeed be achieved and to see what matters for performance.

You will develop several parallel sorting algorithms (you can probably come up with a few of your own, or research existing algorithms), and most likely several versions of each algorithm. For you performance evaluation focus on measuring I/O and computing costs. You probably need to spend some time thinking of how the data gets given to the processors initially. This could be done by some script before the call to mpirun in your PBS script, in the MPI code itself. Note that each node has its own local disk, to which I/O is of course faster than over the NSF-mounted home areas. Comparing different ways of doing the data distribution is most likely in order. And of course you should vary the value of N in your experiments. Report on performance discounting file I/O and on performance including file I/O.

`Parallel Sorting on a Multi-Core Machine`

Sorting a list of numbers is a key operation in many applications and it is well-known that it is difficult to parallelize it efficiently. Since the advent of multi-core processors in standard PCs, it has become crucial to propose efficient parallel sorting implementations for such architectures. You should thus compare different approaches (e.g. pthreads, OpenMP, Intel TBB, Kaapi, Cilk, …) for various problem sizes and identify what makes some approaches better than others.`List ranking on a Multi-Core Machine`

(assigned to Damien Leone) List ranking isis a key operation in many applications and it is well-known that it is difficult to parallelize it efficiently. Since the advent of multi-core processors in standard PCs, it has become crucial to propose efficient parallel sorting implementations for such architectures. You should thus compare different approaches (e.g. pthreads, OpenMP, Intel TBB, Kaapi, Cilk, …) for various problem sizes and identify what makes some approaches better than others.`Playing with GPUs and CUDA`

(Modar HIJAZI and Toss Julio) If you are interested in playing with GPUs and CUDA, we can try to find a few simple algorithms already implemented and that you should try to improve.

- Application Parallelization
`Smith-Watterman Algorithm`

The Smith Waterman algorithm is a dynamic programming algorithm used to compute global alignments of biological sequences, e.g., to align full genomes of procaryot organisms (i.e., bacterias). Research this algorithm (it's famous and information on the algorithm is readily available) and implement a fast parallel implementation (using existing or random DNA sequences or arbitrary lengths). The input sequences are stored in files in the user's home area and the resulting alignment should be stored to an output file. Come up with reasonable ways of distributing the data. You can also find parallel implementations of the SW algorithm and compare them with your own.`N-body Simulation`

An N-body simulation is one that studies how bodies (represented by a mass, a location, and a velocity) move in space (in our case a 2-D space), according to the laws of (Newtonian) physics. Here is an implementation of the N-body problem in Matlab (runnable using the free implementation Octave on any good Linux system): nbod.m and mkbod.m (courtesy of Howard Motteler at UMBC). Implement an MPI version of this sequential program (your MPI code should read a file that specifies the problem's parameters). Describe your data distribution strategies. You'll probably need to implement adaptive load-balancing. All these should be part of an in-depth performance analysis of subsequent versions of the code. Design a way in which your program will output the results in a verifiable and viewable format (best would probably be a sequence of bitmap images that can be animated as an animated gif, but don't spend all your time doing this if you have no idea how to do it!).`Intersection of Line Segments`

Consider a 2-D rectangular area and a number n of random line segments of some maximum length l in this area. Write the fastest possible MPI program that returns the list of all intersections as a triplet segment1, segment2, intersection point, (for large n). Do the usual performance analysis describing what your different optimizations were and what worked what didn't.

- Scheduling
`DAG Scheduling`

In this project you will verify the notion that DAG-scheduling heuristics that based their scheduling decisions on the critical path are indeed more effective than the standard MaxMin, MinMin, and Sufferage heuristics. This will be done entirely in simulation (i.e., by constructing Gantt charts or using already existing simulators). You will have to define a DAG generator that generates random DAGs with given characteristics, and perhaps synthetic DAGs with idiosyncratic properties to highlight the behavior of the different algorithms. The end result will be a set of graphs plotting the performance of relevant scheduling algorithms versus important DAG characteristics and perhaps number of available processors. An important component of this project is defining the experimental framework. Trying your own heuristics, or trying some available in the literature, is obviously a great idea.`Scheduling for Volunteer Computing Systems`

(A subject on performance evaluation was assigned to Tsiligiannis Georgios and Petr Sobeslavsky) Advances in internetworking technology in the last decade have made it possible to establish distributed computing platforms at a global scale. A popular large-scale computing approach, made possible by the increasingly low cost/performance ratio of commodity computing components, is Volunteer Computing (VC). VC platforms aggregate tens or hundreds of thousands of independent and often individually owned hosts. While these platforms provide enormous amounts of computational power at low cost, their use is challenging due to resource volatility. This challenge leads to many research questions pertaining to the performance and use of VC platforms for various classes of applications. Most research in this area relies on simulation due to the difficulties faced when using production VC platforms for experimentation. In this project, you should focus on scheduling algorithms for such platforms (e.g. work unit replication, host selection, host prioritizing, deadlines, …) and try to evaluate the benefit of some of them in a realistic context.`Batch Scheduling`

(assigned to Mak Gueye as well as to Cristian Camilo Ruiz ) Batch scheduler rely on heuristics (list scheduling, backfilling, …) that have some guarantees in the better cases. Unfortunately, these guarantees rely on the hypothesis that the completion time is known beforehand. In practice, users often submit their jobs with processing times larger than required. Therefore none of these guarantees really hold and one may obtain surprising results depending on the workload. In this project, you should study the performance of a few classical job scheduling algorithms (or try a few of your inventions) under various (realistic if possible) workload. You could also study the impact of the non-clairvoyant hypothesis (the fact that announced processing times do not correspond to the real one), or the impact of reservations on the performance of these heuristic.`Non Cooperative Scheduling`

(assigned to Mohamad Ali) When a cluster is managed by a scheduler, it is generally difficult (not to say impossible) for a user to forecast the completion time of a job. Therefore, when a user has access to many different clusters, its best option is simply to submit its job on all clusters, wait for the first cluster that starts it and cancel the submission on the other clusters. However, if everybody does this for all their jobs, this may lead to an inefficiencies for everyone. In this project, you should study whether such inefficiencies occur and under which conditions.

### Bibliography

- Fran Berman, Geoffrey Fox, and Anthony Hey. Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons, 2003.
- Ian Foster and Carl Kessellman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 2003.
- Henri Casanova, Arnaud Legrand and Yves Robert. Parallel Algorithms. Chapman & Hall, 2008.