## HPC 101

#### Arnaud LEGRAND, CR CNRS, LIG/INRIA/Mescal

#### Vincent DANJEAN, MCF UJF, LIG/INRIA/Moais

October, 15th 2012

▲□▶▲□▶▲□▶▲□▶ □ のQ@

## Goals of the two next lectures

#### Learn and understand low-level software in HPC

## Understand the internal of HPC programming model implementations

< □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □

Limitation of mixing HPC programming models

Part I: Hardware in HPC Part II: Low-level software primitives in HPC Part III: Low-level API in HPC

◆□▶ ◆□▶ ▲□▶ ▲□▶ ■ ののの

## Hardware in HPC



#### Computational units

- Parallel Machines with Shared Memory
- Parallel Machines with Distributed Memory
- Current Architectures in HPC

#### 3 Networks

- (Fast|Giga)-Ethernet
- Legacy hardware
- Current networking hardware

## Summary

Part I: Hardware in HPC Part II: Low-level software primitives in HPC Part III: Low-level API in HPC

## Low-level software primitives in HPC

- basic programming models for computational units
  - Hardware support for synchronization
  - Threads
- 6 Low-level communication interface/libraries
  - BIP and MX/Myrinet
  - SiSCI/SCI
  - VIA
- Classical low-level techniques for efficient communications
  - Interacting with the network card: PIO and DMA
  - Zero-copy communications
  - Handshake Protocol
  - OS Bypass
- 8 Summary of low-level software primitives in HPC

Part I: Hardware in HPC Part II: Low-level software primitives in HPC Part III: Low-level API in HPC

◆□▶ ◆□▶ ▲□▶ ▲□▶ ■ ののの

## Low-level API in HPC



- Semaphores
- Monitors

#### 10 PThread

- Normalization of the threads interface
- Basic POSIX Thread API

## 🔟 MPI

- Message Passing
- Introduction to MPI
- Point-to-Point Communications
- Collective Communications

## Part I

## Hardware in HPC

◆□▶ ◆□▶ ◆ □▶ ◆ □▶ ● □ ● ● ●



## High-Performance Computing

- · Simulation complements theory and experiments
  - Climatology, seismology, astrophysics, nano-sciences, material chemistry, molecular biology, ...







- Computation needs are always higher
  - Faster or better results

2



## Irregular applications

- Multi-scale simulation
- Code coupling

Finite Element Method



Finite Difference Method

3



## Irregular applications

- Adaptive Mesh Refinement (AMR)
  - Behavior not known a priori







## Towards more and more hierarchical computers

- Landscape has changed
  - From super-computers to clusters
  - With more and more parallelism



Blue Gene, 106,496 cores





## High Performance Computing

#### Needs are always here

- numerical or financial simulation, modelisation, virtual reality virtuelle
- more data, more details, ...

Computing power will never be enough

One way to follow: using parallelism Idea: change space into time more resources to gain some time

### **Parallel Architectures**

#### Two main kinds

Architectures with shared memory and architectures with distributed memory.



Parallel Machines with Shared Memory Parallel Machines with Distributed Memory Current Architectures in HPC

## Outlines: Hardware in HPC



#### Computational units

- Parallel Machines with Shared Memory
- Parallel Machines with Distributed Memory
- Current Architectures in HPC

#### 3 Networks

## 4 Summary

Parallel Machines with Shared Memory Parallel Machines with Distributed Memory Current Architectures in HPC

## Why several processors/cores ?

#### Limits for monocore processors

- superscalar processors: instruction level parallelism
- frequency
- electrical power

#### What to do with place available on chips ?

- caches (bigger and quicker)
- several series of registers (hyperthreaded processors)
- several series of cores (multi-core processors)
- all of that

Parallel Machines with Shared Memory Parallel Machines with Distributed Memory Current Architectures in HPC

## Symmetric Multi Processors

- all processors have access to the same memory and I/O
- in case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors
- rarely used nowadays but on processors with few cores (2 or 4)

#### Non Uniform Memory Access Architectures

- memory access time depends on the memory location relative to a processor
- better scaling hardware architecture
- harder to program efficiently: trade off needed between load-balancing and memory data locality

Parallel Machines with Shared Memory Parallel Machines with Distributed Memory Current Architectures in HPC

## Clusters

#### Composed of a few to hundreds of machines

- often homogeneous
  - same processor, memory, etc.
- often linked with a high speed, low latency network
  - Myrinet, InfinityBand, Quadrix, etc.

#### Biggest clusters can be split in several parts

- computing nodes
- I/O nodes
- front (interactive) node

Parallel Machines with Shared Memory Parallel Machines with Distributed Memory Current Architectures in HPC



#### Lots of heterogeneous resources

- aggregation of clusters and/or standalone nodes
- high latency network (Internet for example)
- often dynamic resources (clusters/nodes appear and disappear)
- different architectures, networks, etc.

Parallel Machines with Shared Memory Parallel Machines with Distributed Memory Current Architectures in HPC

## Computational units

#### **Hierarchical Architectures**

- HT technology
- multi-core processor
- multi processors machine
- cluster of machines
- grid of clusters and individual machines

#### Even more complexity

- computing on GPU
  - require specialized codes but hardware far more powerful
- FPGA
  - hardware can be specialized on demand
  - still lots of work on interface programming here

(Fast|Giga)-Ethernet Legacy hardware Current networking hardware

## **Outlines: Hardware in HPC**



#### 3 Networks

- (Fast|Giga)-Ethernet
- Legacy hardware
- Current networking hardware

## 4 Summary

(Fast|Giga)-Ethernet Legacy hardware Current networking hardware

## High Speed Networks

#### High Speed Networks are used in clusters

- Iow distance
- very interesting performance
  - low latency: about 1 μs
  - high bandwidth: about 10 Gb/s and more
- specific light protocols
  - static routing of messages
  - no required packet fragmentation
  - sometimes, no packet required

Myrinet, Quadrics, SCI, ...

(Fast|Giga)-Ethernet Legacy hardware Current networking hardware

## (Fast|Giga)-Ethernet

- Interconnect:
  - Hub or switch
- Wires:
  - Copper or optical fiber
- Latency:
  - about 10 μs
- Bandwidth:
  - From 100 Mb/s to 10 Gb/s (100 Gb/s, june 2010)
- Remark:
  - compatible with traditional Ethernet





Computational units (Fast|Giga)-Ether Networks Legacy hardware Summary Current networkir

## Myrinet

- Myricom corporate
- Interconnect:
  - Switch
- PCI card with:
  - a processor: LANai
  - SRAM memory: about 4 MB
- Latency:
  - about 1 or 2  $\mu$ s
- Bandwidth:
  - 10 Gb/s
- Remark:
  - static, wormhole routing
  - can you RJ45 cables

|         | Г |
|---------|---|
|         |   |
|         |   |
|         | ٣ |
|         |   |
| LANai 🔤 |   |
|         |   |
|         |   |

nputational units (Fast|Giga)-Ethernet Networks Legacy hardware Summary Current networking hard



#### Scalable Coherent Interface

- IEEE norm (1993)
- Dolphin corporate
- Uses remote memory access:
  - Address space remotely mapped



Computational units (Fast|Giga)-Ethernet Networks Legacy hardware Summary Current networking hardware

## InfiniBand

- Several manufacturers (Cisco, HP, Intel, IBM, etc.)
- Interconnect:
  - Optical links
  - Serial, point-to-point connections
  - Switched fabric (possibility of several paths)
- Bandwidth
  - single line of 2, 4, 8, 14 or 25 Mb/s
  - possibility of bonding 4 or 12 lines
- Latency:
  - about 100 or 200 ns for hardware only
  - about 1 or  $2 \mu s$  for some hardware with its driver
- Remark:
  - can interconnect buildings
  - RDMA operations available

Computational units (Fast]Giga)-Ethernet Networks Legacy hardware Summary Current networking hardware

## Quadrics

- One manufacturer (Quadrics)
- Interconnect:
  - Bi-directional serial links
  - Switched fabric (possibility of several paths)
- Bandwidth
  - 1 to 2 Gb/s on each direction
- Latency:
  - about 1.3 µs in MPI
- Remark:
  - selected by Bull for the fastest supercomputer in Europe: Tera100 at CEA
  - global operations (reduction, barrier) available in hardware

## **Outlines: Hardware in HPC**



#### 3 Networks



## Hardware in HPC

#### Performance, performance and performance

HPC machines: as many TFlops as possible for the user

- very complex multi-cores, multi-processors machines
- specialized hardware accelerators (GPU, FPGA, etc.)
- dedicated specialized networks

#### Some limitations

- not the price
- but the power consumption dedicated EDF electrical line for the last French super-calculator
- and the physics an electric signal cannot go quicker than the light speed

## Part II

## Low-level software primitives in HPC

Hardware support for synchronization Threads

## Outlines: Low-level software primitives in HPC



#### basic programming models for computational units

- Hardware support for synchronization
- Threads
- 6 Low-level communication interface/libraries
- Classical low-level techniques for efficient communications
- 3 Summary of low-level software primitives in HPC

Hardware support for synchronization Threads

## Synchronization requires hardware support

| What happens with incrementations in parallel? |        |  |  |  |  |
|------------------------------------------------|--------|--|--|--|--|
| var++;                                         | var++; |  |  |  |  |

Hardware support for synchronization Threads

## Synchronization requires hardware support

| What happens with incrementations in parallel? |       |       |        |     |       |       |        |  |  |
|------------------------------------------------|-------|-------|--------|-----|-------|-------|--------|--|--|
| for                                            | (i=0; | i<10; | i++) { | for | (i=0; | i<10; | i++) { |  |  |
| var++;                                         |       |       | var++; |     |       |       |        |  |  |
| }                                              |       |       |        | }   |       |       |        |  |  |

Hardware support for synchronization Threads

## Synchronization requires hardware support

# What happens with incrementations in parallel? for (i=0; i<10; i++) { var++; } var++; }</pre>

#### Hardware support required

TAS atomic test and set instruction cmpexchge compare and exchange atomic operation incrementation, decrementation, adding, etc.

Hardware support for synchronization Threads

## Critical section with busy waiting

```
Example of code
while (TAS(&var))
;
/* in critical section */
var=0;
```

Hardware support for synchronization Threads

## Critical section with busy waiting

```
Example of code
while (TAS(&var))
while (var);
/* in critical section */
var=0;
```

Hardware support for synchronization Threads

## Critical section with busy waiting

#### Example of code

```
while (TAS(&var))
    while (var);
/* in critical section */
var=0;
```

#### Busy waiting

+

٨

٨

- very reactive
- no OS or lib support required
  - use a processor while not doing anything
- does not scale if there are lots of waiters
  - very difficult to do it correctly (compiler, processor, etc.)

Hardware support for synchronization Threads

## Programming on Shared Memory Parallel Machines



Hardware support for synchronization Threads

### **Programming on Shared Memory Parallel Machines**



Hardware support for synchronization Threads

basic programming models for computational units

#### Why threads ?

To take profit from shared memory parallel architectures
 SMP, hyperthreaded, multi-core, NUMA, etc. processors

future Intel processors: several hundreds cores

- To describe the parallelism within the applications
  - independent tasks, I/O overlap, etc.

#### What will use threads ?

- User application codes
  - directly (with thread libraries)

POSIX API (IEEE POSIX 1003.1c norm) in C, C++, ...

- with high-level programming languages (Ada, OpenMP, ...)
- Middleware programming environments
  - demonized tasks (garbage collector, ...), ...

basic programming models for computational units

Low-level communication interface/libraries Classical low-level techniques for efficient communications Summary of low-level software primitives in HPC Hardware support for synchronization Threads

### User threads



basic programming models for computational units

Low-level communication interface/libraries Classical low-level techniques for efficient communications Summary of low-level software primitives in HPC Hardware support for synchronization Threads

### Kernel threads



basic programming models for computational units

Low-level communication interface/libraries Classical low-level techniques for efficient communications Summary of low-level software primitives in HPC

### Mixed models

Multithreaded process User level threads User scheduler Entities managed by the kernel Kernel scheduler Operating system Processors Flexibility Blocking syscalls limited Efficiency SMP  $(\pm)$ +

Threads

Hardware support for synchronization Threads

### Thread models characteristics

|         | Characteristics |             |     |                   |
|---------|-----------------|-------------|-----|-------------------|
| Library | Efficiency      | Flexibility | SMP | Blocking syscalls |
| User    | <b>+</b>        | ÷           | -   | -                 |
| Kernel  | -               | -           | ŧ   | ÷                 |
| Mixed   | ÷               | ÷           | Ŧ   | limited           |

#### Summary

Mixed libraries seems more attractive however they are more complex to develop. They also suffer from the blocking system call problem.

Hardware support for synchronization Threads

### User Threads and Blocking System Calls



Hardware support for synchronization Threads

### Scheduler Activations

#### Idea proposed by Anderson et al. (91)

Dialogue (and not monologue) between the user and kernel schedulers

- the user scheduler uses system calls
- the kernel scheduler uses upcalls

#### Upcalls

Notify the application of scheduling kernel events

#### Activations

- a new structure to support upcalls a kinf of kernel thread or virtual processor
- creating and destruction managed by the kernel

Hardware support for synchronization Threads

### **Scheduler Activations**



Hardware support for synchronization Threads

### **Scheduler Activations**



#### ...better use the following schema:



basic programming models for computational units Low-level communication interface/libraries

Classical low-level techniques for efficient communications Summary of low-level software primitives in HPC

# Working principle

Threads



basic programming models for computational units Low-level communication interface/libraries

Classical low-level techniques for efficient communications Summary of low-level software primitives in HPC

### t communications Threads

### Working principle



Summary of low-level software primitives in HPC

Hardware support for synchronization Threads

### Working principle



Classical low-level techniques for efficient communications Summary of low-level software primitives in HPC

### Working principle

Hardware support for synchronization Threads



basic programming models for computational units Low-level communication interface/libraries

Classical low-level techniques for efficient communications Summary of low-level software primitives in HPC

#### Hardware support for synchronization Threads

### Working principle



Summary of low-level software primitives in HPC

### Working principle

Hardware support for synchronization Threads



Summary of low-level software primitives in HPC

### Working principle

Hardware support for synchronization Threads



Summary of low-level software primitives in HPC

### Working principle

Hardware support for synchronization Threads



Hardware support for synchronization Threads

### Working principle



Summary of low-level software primitives in HPC

## Working principle

Hardware support for synchronization Threads



Hardware support for synchronization Threads

### Exploiting computational resources

#### Some recurring patterns

- hardware synchronization primitives
- threads

#### Common feature: very hard to program correctly such hardware

One cannot expect HPC applications to handle directly these kind of resources. It is too complex for an scientific application programmer.

BIP and MX/Myrinet SiSCI/SCI VIA

### Outlines: Low-level software primitives in HPC

#### basic programming models for computational units

6 Low-level communication interface/libraries

- BIP and MX/Myrinet
- SiSCI/SCI
- VIA

Classical low-level techniques for efficient communications

3 Summary of low-level software primitives in HPC

### **BIP/Myrinet**

• Basic Interface for Parallelism

- L. Prylli and B. Tourancheau
- Dedicated to Myrinet networks
- Characteristics
  - Asynchronous communication
  - No error detection
  - No flow control
    - Small messages are copied into a fixed buffer at reception

**BIP and MX/Myrinet** 

Big messages are lost if the receiver is not ready

### MX/Myrinet

BIP and MX/Myrinet SiSCI/SCI VIA

- Myrinet eXpress
  - Official driver from Myricom
- Very simplistic interface to allow easy implementation of MPI
  - Flow control
  - Reliable communications
  - Non contiguous messages
  - Multiplexing



BIP and MX/Myrinet SiSCI/SCI VIA

- Driver for SCI cards
- Programming model
  - Remote memory access
    - Explicit: RDMA
    - Implicit: memory projections
- Performance
  - Explicit use of some operation required:
    - memory "flush"
    - SCI\_memcpy
    - RDMA

BIP and MX/Myrinet SiSCI/SCI VIA

### VIA

- Virtual Interface Architecture
- A new standard
  - Lots of industrials
    - Microsoft, Intel, Compaq, etc.
  - Use for InfiniBand networks
- Characteristics
  - Virtual interfaces objects
    - Queues of descriptors (for sending and receiving)
  - Explicit memory recording
  - Remote reads/writes
    - RDMA

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### Outlines: Low-level software primitives in HPC

- basic programming models for computational units
- 6 Low-level communication interface/libraries

### 7 Classical low-level techniques for efficient communications

- Interacting with the network card: PIO and DMA
- Zero-copy communications
- Handshake Protocol
- OS Bypass

B) Summary of low-level software primitives in HPC

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### Interacting with the network card: PIO mode



Programmed Input/Output

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### Interacting with the network card: DMA mode



**Direct Memory Access** 

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### Zero-copy communications

#### Goals

#### Reduce the communication time

- Copy time cannot be neglected
  - but it can be partially recovered with pipelining
- Reduce the processor use
  - currently, memcpy are executed by processor instructions

#### Idea

The network card directly read/write data from/to the application memory

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### Zero-copy communications



Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### Zero-copy communications



Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### Zero-copy communications for emission

#### **PIO mode transfers**

No problem for zero-copy

#### DMA mode transfers

- Non contiguous data in physical memory
- Headers added in the protocol
  - Iinked DMA
  - limits on the number of non contiguous segments

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### Zero-copy communications for reception

A network card cannot "freeze" the received message on the physical media

If the receiver posted a "recv" operation before the message arrives

- zero-copy OK if the card can filter received messages
- else, zero-copy allowed with bounded-sized messages with optimistic heuristics

#### If the receiver is not ready

- A handshake protocol must be setup for big messages
- Small messages can be stored in an internal buffer

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### Using a Handshake Protocol



Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

### A few more considerations

#### The receiving side plays an important role

- Flow-control is mandatory
- Zero-copy transfers
  - the sender has to ensure that the receiver is ready
  - a handshake (REQ+ACK) can be used

Communications in user-space introduce some difficulties

- Direct access to the NIC
  - most technologies impose "pinned" memory pages

Network drivers have limitations

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

## **Communication Protocol Selection**



Message size

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

## **Communication Protocol Selection**



Message size

#### Interacting with the network card: PIO and DM/ Zero-copy communications Handshake Protocol OS Bypass

## **Operating System Bypass**

- Initialization
  - traditional system calls
  - only at session beginning
- Transfers
  - direct from user space
  - no system call
  - "less" interrupts
- Humm...And what about security ?



Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

## OS-bypass + zero-copy

#### Problem

- Zero-copy mechanism uses DMA that requires physical addresses
- Mapping between virtual and physical address is only known by:
  - the processor (MMU)
  - the OS (pages table)
- We need that
  - the library knows this mapping
  - this mapping is not modified during the communication
    - ex: swap decided by the OS, copy-on-write, etc.
- No way to ensure this in user space !

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

#### OS-bypass + zero-copy



Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

## OS-bypass + zero-copy

#### First solution

- Pages "recorded" in the kernel to avoid swapping
- Management of a cache for virtual/physical addresses mapping
  - in user space or on the network card
- Diversion of system calls that can modify the address space

#### Second solution

- Management of a cache for virtual/physical addresses mapping on the network card
- OS patch so that the network card is informed when a modification occurs
- Solution chosen by MX/Myrinet and Elan/Quadrics

### **Direct consequences**

Interacting with the network card: PIO and DMA Zero-copy communications Handshake Protocol OS Bypass

- Latency measure can vary whether the memory region used
  - Some pages are "recorded" within the network card
- Ideal case are ping-pong exchanges
  - The same pages are reused hundred of times
- Worst case are applications using lots of different data regions...

## Outlines: Low-level software primitives in HPC

#### basic programming models for computational units

- 6 Low-level communication interface/libraries
- Classical low-level techniques for efficient communications
- 8 Summary of low-level software primitives in HPC

## Summary of low-level software primitives in HPC

#### Efficient hardware

- numerous parallel threads
- very low latency and high bandwidth networks
- complex hardware to be programmed efficiently
  - synchronization, onboard CPU, onboard MMU for DMA, etc.

#### Very specific programming interfaces

- specialized assembly instructions for synchronization
- dedicated to specific technologies (but VIA)
- different programming models
- quasi no portability

It is not reasonable to program a scientific application directly with such programming interfaces

## Part III

## Low-level API in HPC

## Outlines: Low-level API in HPC



#### Synchronization

- Semaphores
- Monitors





Semaphores Monitors

## Semaphores

Internal state: a counter initialised to a positive or null value

#### Two methods:

P(s) wait for a positive counter then decrease it onceV(s) increase the counter

#### Common analogy: a box with tokens

- Initial state: the box has *n* tokens in it
- One can put one more token in the box (V)
- One can take one token from the box (P) waiting if none is available

Semaphores Monitors

## Monitors

#### Mutex

Two states: locked or not

#### Two methods:

lock(m) take the mutex
unlock(m) release the mutex (must be done by the
thread owning the mutex)

#### Conditions

waiting thread list (conditions are not related with tests)

Three methods:

wait(c, m) sleep on the condition. The mutex is released atomically during the wait.
 signal(c) one sleeping thread is wake up
 broadcast(c) all sleeping threads are wake up

PThread

## Outlines: Low-level API in HPC





- Normalization of the threads interface
- Basic POSIX Thread API



Normalization of the threads interface Basic POSIX Thread API

## Normalisation of the thread interface

#### Before the norm

- each Unix had its (slightly) incompatible interface
- but same kinds of features was present

#### **POSIX** normalization

- IEEE POSIX 1003.1c norm (also called POSIX threads norm)
- Only the API is normalised (not the ABI)
  - POSIX thread libraries can easily be switched at source level but not at runtime
- POSIX threads own
  - processor registers, stack, etc.
  - signal mask
- POSIX threads can be of any kind (user, kernel, etc.)

Normalization of the threads interface Basic POSIX Thread API

## **Basic POSIX Thread API**

#### Creation/destruction

- int pthread\_create(pthread\_t \*thread, const pthread\_attr\_t \*attr, void \*(\*start\_routine)(void\*), void \*arg)
- void pthread\_exit (void \*value\_ptr)
- int pthread\_join(pthread\_t thread, void
   \*\*value\_ptr)

#### Synchronisation (semaphores)

- int sem\_init(sem\_t \*sem, int pshared, unsigned int value)
- int sem\_wait(sem\_t \*sem)
- int sem\_post(sem\_t \*sem)
- int sem\_destroy(sem\_t \*sem)

Normalization of the threads interface Basic POSIX Thread API

## Basic POSIX Thread API (2)

#### Synchronisation (mutex)

- int **pthread\_mutex\_init**(pthread\_mutex\_t \*mutex, const pthread\_mutexattr\_t \*attr)
- int pthread\_mutex\_lock(pthread\_mutex\_t \*mutex)
- int **pthread\_mutex\_unlock**(pthread\_mutex\_t \*mutex)
- int pthread\_mutex\_destroy(pthread\_mutex\_t
  \*mutex)

#### Synchronisation (conditions)

- int **pthread\_cond\_init** (pthread\_cond\_t \*cond, const pthread\_condattr\_t \*attr)
- int **pthread\_cond\_wait** (pthread\_cond\_t \*cond, pthread\_mutex\_t \*mutex)
- int pthread\_cond\_signal(pthread\_cond\_t \*cond)

Normalization of the threads interface Basic POSIX Thread API

## Basic POSIX Thread API (3)

#### Per thread data

- int pthread\_key\_create(pthread\_key\_t \*key, void
   (\*destr\_function) (void\*))
- int pthread\_key\_delete(pthread\_key\_t key)
- int pthread\_setspecific(pthread\_key\_t key, const void \*pointer)
- void \* pthread\_getspecific(pthread\_key\_t key)

Normalization of the threads interface Basic POSIX Thread API

## Basic POSIX Thread API (3)

#### Per thread data

- int pthread\_key\_create(pthread\_key\_t \*key, void
   (\*destr\_function) (void\*))
- int pthread\_key\_delete(pthread\_key\_t key)
- int pthread\_setspecific(pthread\_key\_t key, const void \*pointer)
- void \* pthread\_getspecific(pthread\_key\_t key)

#### The new \_\_\_\_\_thread C keyword

- used for a global per-thread variable
- need support from the compiler and the linker at compile time and execute time
- libraries can have efficient per-thread variables without disturbing the application

Introduction to MPI Point-to-Point Communications Collective Communications

## **Outlines: Low-level API in HPC**



### 10 PThread

#### 🔟 MPI

- Message Passing
- Introduction to MPI
- Point-to-Point Communications
- Collective Communications

Introduction to MPI Point-to-Point Communications Collective Communications

# Message Passing



- Each processor runs a process
- Processes communicate by exchanging messages
- They cannot share memory in the sense that they cannot address the same memory cells
- The above is a programming model and things may look different in the actual implementation (e.g., MPI over Shared Memory)
- Message Passing is popular because it is general:
  - Pretty much any distributed system works by exchanging messages, at some level
  - Distributed- or shared-memory multiprocessors, networks of workstations, uniprocessors
- It is not popular because it is easy (it's not)

Introduction to MPI Point-to-Point Communications Collective Communications

# Code Parallelization

- Shared-memory programming
  - Parallelizing existing code can be very easy
    - OpenMP: just add a few pragmas
    - Pthreads: wrap work in do\_work functions
  - Understanding parallel code is easy
  - Incremental parallelization is natural
- Distributed-memory programming
  - parallelizing existing code can be very difficult
    - No shared memory makes it impossible to "just" reference variables
    - Explicit message exchanges can get really tricky
  - Understanding parallel code is difficult
    - Data structured are split all over different memories
  - Incremental parallelization can be challenging

Synchronization Introduction to MPI PThread Point-to-Point Communication MPI Collective Communications

## Programming Message Passing

- Shared-memory programming is simple conceptually (sort of)
- Shared-memory machines are expensive when one wants a lot of processors
- It's cheaper (and more scalable) to build distributed memory machines
  - Distributed memory supercomputers (IBM SP series)
  - Commodity clusters
- But then how do we program them?
- At a basic level, let the user deal with explicit messages
  - difficult
  - but provides the most flexibility

Synchronization Introducti PThread Point-to-F MPI Collective



- Isn't exchanging messages completely known and understood?
  - That's the basis of the IP idea
  - Networked computers running programs that communicate are very old and common
    - DNS, e-mail, Web, ...
- The answer is that, yes it is, we have "Sockets"
  - Software abstraction of a communication between two Internet hosts
  - Provides and API for programmers so that they do not need to know anything (or almost anything) about TCP/IP and write code with programs that communicate over the internet

Synchronization In PThread P MPI C

Introduction to MPI Point-to-Point Communications Collective Communications

# Using Sockets for parallel programming?

- One could thing of writing all parallel code on a cluster using sockets
  - n nodes in the cluster
  - Each node creates n-1 sockets on n-1 ports
  - All nodes can communicate
- Problems with this approach
  - Complex code
  - Only point-to-point communication
  - No notion of types messages
  - But
    - All this complexity could be "wrapped" under a higher-level API
    - And in fact, we'll see that's the basic idea
  - Does not take advantage of fast networking within a cluster/ MPP
    - Sockets have "Internet stuff" in them that's not necessary
    - TPC/IP may not even be the right protocol!

Synchronization PThread MPI Collective Communica

Message Passing for Parallel Programs

- Although "systems" people are happy with sockets, people writing parallel applications need something better
  - easier to program to
  - able to exploit the hardware better within a single machine
- This "something better" right now is MPI
  - We will learn how to write MPI programs
- Let's look at the history of message passing for parallel computing

Introduction to MPI Point-to-Point Communications Collective Communications

## **Outlines: Low-level API in HPC**



### 10 PThread

#### 🔟 MPI

- Message Passing
- Introduction to MPI
- Point-to-Point Communications
- Collective Communications

Introduction to MPI Point-to-Point Communications Collective Communications

## The MPI Standard

- MPI Forum setup as early as 1992 to come up with a de facto standard with the following goals:
  - source-code portability
  - allow for efficient implementation (e.g., by vendors)
  - support for heterogeneous platforms
- MPI is not
  - a language
  - an implementation (although it provides hints for implementers)
- June 1995: MPI v1.1 (we're now at MPI v1.2)
  - http://www-unix.mcs.anl.gov/mpi/
  - C and FORTRAN bindings
  - We will use MPI v1.1 from C in the class
- Implementations:
  - well-adopted by vendors
  - free implementations for clusters: MPICH, LAM, CHIMP/MPI
  - research in fault-tolerance: MPICH-V, FT-MPI, MPIFT, etc.



- It is rare for a programmer to write a different program for each process of a parallel application
- In most cases, people write Single Program Multiple Data (SPMD) programs
  - the same program runs on all participating processors
  - processes can be identified by some rank
  - This allows each process to know which piece of the problem to work on
  - This allows the programmer to specify that some process does something, while all the others do something else (common in master-worker computations)

```
main(int argc, char **argv) {
    if (my_rank == 0) { /* master */
        ... load input and dispatch ...
    } else { /* workers */
        ... wait for data and compute ...
    }
```



- Fixed number of processors
  - When launching the application one must specify the number of processors to use, which remains unchanged throughout execution
- Communicator
  - Abstraction for a group of processes that can communicate
  - A process can belong to multiple communicators
  - Makes is easy to partition/organize the application in multiple layers of communicating processes
  - Default and global communicator: MPI\_COMM\_WORLD
- Process Rank
  - The index of a process within a communicator
  - Typically user maps his/her own virtual topology on top of just linear ranks
    - ring, grid, etc.





Introduction to MPI Point-to-Point Communications Collective Communications

# Compiling/Running it

- Compile with mpicc
- Run with mpirun
  - % mpirun -np 4 my\_program <args>
  - requests 4 processors for running my\_program with commandline arguments
  - see the mpirun man page for more information
  - in particular the *-machinefile* option that is used to run on a network of workstations
- Some systems just run all programs as MPI programs and no explicit call to *mpirun* is actually needed
- Previous example program:
- % mpirun -np 3 -machinefile hosts my\_program

```
I am the master: somehost1
```

- I am a worker: somehost2 (rank=2/2)
- I am a worker: somehost3 (rank=1/2)

(stdout/stderr redirected to the process calling mpirun)

Introduction to MPI Point-to-Point Communications Collective Communications

## **Outlines: Low-level API in HPC**



### 10 PThread

#### 🔟 MPI

- Message Passing
- Introduction to MPI
- Point-to-Point Communications
- Collective Communications



- Data to be communicated is described by three things:
  - address
  - data type of the message
  - length of the message
- Involved processes are described by two things
  - communicator
  - rank
- Message is identified by a "tag" (integer) that can be chosen by the user

Synchronization Introduction to MPI PThread Point-to-Point Communications MPI Collective Communications

# Point-to-Point Communication

- Two modes of communication:
  - Synchronous: Communication does not complete until the message has been received
  - Asynchronous: Completes as soon as the message is "on its way", and hopefully it gets to destination
- MPI provides four versions
  - synchronous, buffered, standard, ready

Synchronization Introduction to MPI PThread Point-to-Point Communications MPI Collective Communications

# Synchronous/Buffered sending in MPI

- Synchronous with MPI\_Ssend
  - The send completes only once the receive has succeeded
    - copy data to the network, wait for an ack
    - The sender has to wait for a receive to be posted
    - No buffering of data

#### Buffered with MPI\_Bsend

- The send completes once the message has been buffered internally by MPI
  - Buffering incurs an extra memory copy
  - Doe not require a matching receive to be posted
  - May cause buffer overflow if many bsends and no matching receives have been posted yet

Introduction to MPI Point-to-Point Communications Collective Communications

# Standard/Ready Send

- Standard with MPI\_Send
  - Up to MPI to decide whether to do synchronous or buffered, for performance reasons
  - The rationale is that a correct MPI program should not rely on buffering to ensure correct semantics
- Ready with MPI\_Rsend
  - May be started *only* if the matching receive has been posted
  - Can be done efficiently on some systems as no hand-shaking is required



- There is only one MPI\_Recv, which returns when the data has been received.
  - only specifies the MAX number of elements to receive
- Why all this junk?
  - Performance, performance, performance
  - MPI was designed with constructors in mind, who would endlessly tune code to extract the best out of the platform (LINPACK benchmark).
  - Playing with the different versions of MPL\_?send can improve performance without modifying program semantics
  - Playing with the different versions of MPI\_?send can modify program semantics
  - Typically parallel codes do not face very complex distributed system problems and it's often more about performance than correctness.
  - You'll want to play with these to tune the performance of your code in your assignments

Introduction to MPI Point-to-Point Communications Collective Communications

### Example: Sending and Receiving







- Rationale: a correct MPI program should not rely on buffering for semantics, just for performance.
- So how do we do this then? ...

Introduction to MPI Point-to-Point Communications Collective Communications

### Non-DIOCKING communications

- So far we've seen blocking communication:
  - The call returns whenever its operation is complete (MPI\_SSEND returns once the message has been received, MPI\_BSEND returns once the message has been buffered, etc..)
- MPI provides non-blocking communication: the call returns immediately and there is another call that can be used to check on completion.
- Rationale: Non-blocking calls let the sender/receiver do something useful while waiting for completion of the operation (without playing with threads, etc.).

Synchronization Introdu PThread Point-to MPI Collect

Introduction to MPI Point-to-Point Communications Collective Communications

## Non-blocking Communication

 MPI\_Issend, MPI\_Ibsend, MPI\_Isend, MPI\_Irsend, MPI\_Irecv

MPI\_Request request;

MPI\_Isend(&x,1,MPI\_INT,dest,tag,communicator,&request); MPI Irecv(&x,1,MPI INT,src,tag,communicator,&request);

 Functions to check on completion: MPI\_Wait, MPI\_Test, MPI\_Waitany, MPI\_Testany, MPI\_Waitall, MPI\_Testall, MPI\_Waitsome, MPI\_Testsome.
 MPI\_Status status; MPI\_Wait(&request, &status) /\* block \*/ MPI\_Test(&request, &status) /\* doesn't block \*/

Introduction to MPI Point-to-Point Communications Collective Communications

## Example: Non-blocking comm

```
#include <unistd.h>
#include <mpi.h>
int main(int argc, char **argv) {
  int i. mv rank. x. v:
                                                 Deadlock
 MPI Status status:
  MPI Request request:
  MPI_Init(&argc,&argv);
  MPI Comm rank (MPI COMM WORLD, &mv rank);
  if (mv rank == 0) { /* P0 */
    x=42;
   MPI_Isend(&x,1,MPI_INT,1,0,MPI_COMM_WORLD,&request);
   MPI_Recv(&y,1,MPI_INT,1,0,MPI_COMM_WORLD,&status);
   MPI Wait (&request.&status);
 } else if (mv rank == 1) { /* P1 */
   v=41;
   MPI_Isend(&y,1,MPI_INT,0,0,MPI_COMM_WORLD,&request);
   MPI Recv(&x.1.MPI INT.0.0.MPI COMM WORLD.&status):
   MPI Wait (&request.&status):
  ł
 MPI_Finalize(); exit(0);
ł
```

Synchronization Introduction to MPI PThread Point-to-Point Communications MPI Collective Communications

Use of non-blocking comms

- In the previous example, why not just swap one pair of send and receive?
- Example:
  - A logical linear array of N processors, needing to exchange data with their neighbor at each iteration of an application
  - One would need to orchestrate the communications:
    - all odd-numbered processors send first
    - all even-numbered processors receive first
  - Sort of cumbersome and can lead to complicated patterns for more complex examples
  - In this case: just use MPI\_Isend and write much simpler code
- Furthermore, using MPI\_Isend makes it possible to overlap useful work with communication delays:

```
MPI_Isend()
<useful work>
MPI_Wait()
```

Introduction to MPI Point-to-Point Communications Collective Communications

Iterative Application Example

for (iterations) update all cells send boundary values receive boundary values

| - | _ | _ | _ | _ | _ | _ | _ |
|---|---|---|---|---|---|---|---|
| - | - | - | - | - | - | - | - |
| - |   |   |   |   |   | - | - |
|   |   |   |   |   |   |   |   |

- Would deadlock with MPI\_Ssend, and maybe deadlock with MPI\_Send, so must be implemented with MPI\_Isend
- Better version that uses non-blocking communication to achieve communication/computation overlap (aka latency hidingliate sending of boundary values to neighbours; initiate receipt of boundary values from neighbours; update non-boundary cells; wait for completion of sending of boundary values;

wait for completion of receipt of boundary values; update boundary cells;

 Saves cost of boundary value communication if hardware/software can overlap comm and comp

Introduction to MPI Point-to-Point Communications Collective Communications

### Non-blocking communications

- Almost always better to use non-blocking
  - communication can be carried out during blocking system calls
  - communication and communication can overlap
  - less likely to have annoying deadlocks
  - synchronous mode is better than implementing acks by hand though
- However, everything else being equal, non-blocking is slower due to extra data structure bookkeeping
  - The solution is just to benchmark
- When you do your programming assignments, you will play around with different communication types

Synchronization Int PThread Pc MPI Cc

Introduction to MPI Point-to-Point Communications Collective Communications

# More information

- There are many more functions that allow fine control of point-to-point communication
- Message ordering is guaranteed
- Detailed API descriptions at the MPI site at ANL:
  - Google "MPI". First link.
  - Note that you should check error codes, etc.
- Everything you want to know about deadlocks in MPI communication

http://andrew.ait.iastate.edu/HPC/Papers/mpicheck2/mpicheck2.htm

Introduction to MPI Point-to-Point Communications Collective Communications

#### **Outlines: Low-level API in HPC**



#### 10 PThread

#### 🔟 MPI

- Message Passing
- Introduction to MPI
- Point-to-Point Communications
- Collective Communications

 Synchronization
 Introduction to MPI

 PThread
 Point-to-Point Communication

 MPI
 Collective Communications

## **Collective Communication**

- Operations that allow more than 2 processes to communicate simultaneously
  - barrier
  - broadcast
  - reduce
- All these can be built using point-to-point communications, but typical MPI implementations have optimized them, and it's a good idea to use them
- In all of these, all processes place the same call (in good SPMD fashion), although depending on the process, some arguments may not be used

Synchronization Introduction to MPI PThread Point-to-Point Communication MPI Collective Communications

# Barrier

- Synchronization of the calling processes
   the call blocks until all of the processes have placed the call
- No data is exchanged

. . .

. . .

Similar to an OpenMP barrier

MPI\_Barrier(MPI\_COMM\_WORLD)



- One-to-many communication
- Note that multicast can be implemented via the use of communicators (i.e., to create processor groups)



 Synchronization
 Introduction to MPI

 PThread
 Point-to-Point Communication

 MPI
 Collective Communications



 Let's say the master must send the user input to all workers

```
int main(int argc, char **argv) {
    int my_rank;
    int input;
    MPI_Init(&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
    if (argc != 2) exit(1);
    if (sscanf(argv[1],"%d",&input) != 1) exit(1);
    MPI_Bcast(&input,1,MPI_INT,0,MPI_COMM_WORLD);
    ...
}
```





Introduction to MPI Point-to-Point Communications Collective Communications

## This is actually a bit tricky

#### The root sends data to itself!



 Arguments #1, #2, and #3 are only meaningful at the root Synchronization Introduction PThread Point-to-Po MPI Collective (

```
Scatter Example
Partitioning an array of input among
  workers
int main(int argc, char **argv) {
  int *a:
  double *revbuffer;
  . . .
  MPI Comm size (MPI COMM WORLD, &n);
  <allocate array recvbuffer of size N/n>
  if (my rank == 0) { /* master */
      <allocate array a of size N>
  MPI_Scatter(a, N/n, MPI_INT,
                 recvbuffer, N/n, MPI INT,
                 0, MPI COMM WORLD);
  . . .
}
```

Synchronization In PThread Po MPI Co

Introduction to MPI Point-to-Point Communications Collective Communications



Without redundant sending at the root

```
int main(int argc, char **argv) {
   int *a:
  double *revbuffer;
   . . .
  MPI Comm size (MPI COMM WORLD, &n);
   if (my rank == 0) { /* master */
       <allocate array a of size N>
       <allocate array recvbuffer of size N/n>
       MPI Scatter(a, N/n, MPI INT,
                   MPI IN PLACE. N/n. MPI INT.
                   0, MPI COMM WORLD);
   } else { /* worker */
       <allocate array recybuffer of size N/n>
       MPI Scatter (NULL, 0, MPI INT,
                   recvbuffer, N/n, MPI INT,
                   0, MPI COMM WORLD);
ł
```





Synchronization Intro PThread Poin MPI Colla



Synchronization Introductio PThread Point-to-Po MPI Collective

Point-to-Point Communications

# **Reduction Operations**

- Used to compute a result from data that is distributed among processors
  - often what a user wants to do anyway
    - e.g., compute the sum of a distributed array
  - so why not provide the functionality as a single API call rather than having people keep reimplementing the same things
- Predefined operations:
  - MPI\_MAX, MPI\_MIN, MPI\_SUM, etc.
- Possibility to have user-defined operations

Synchronization PThread MPI Collective Con

Introduction to MPI Point-to-Point Communications Collective Communications

## MPI\_Reduce, MPI\_Allreduce

- MPI\_Reduce: result is sent out to the root
  - the operation is applied element-wise for each element of the input arrays on each processor
  - An output array is returned
- MPI\_Allreduce: result is sent out to everyone



Synchronization Intro PThread Poir MPI Coll

Introduction to MPI Point-to-Point Communications Collective Communications



MPI\_Reduce(sbuf, rbuf, 6, MPI\_INT, MPI\_SUM, 0, MPI\_COMM\_WORLD)



Synchronization Introduction to MPI PThread Point-to-Point Communication MPI Collective Communications



MPI\_Scan(sbuf, rbuf, 6, MPI\_INT, MPI\_SUM, MPI\_COMM\_WORLD)



- Most broadcast operations come with a version that allows for a stride (so that blocks do not need to be contiguous)
  - MPI\_Gatherv(), MPI\_Scatterv(), MPI\_Allgatherv(), MPI\_Alltoallv()
- MPI\_Reduce\_scatter(): functionality equivalent to a reduce followed by a scatter
- All the above have been created as they are common in scientific applications and save code
- All details on the MPI Webpage