MHPCC - Maui High Performance Computing Center

SP Parallel Programming Workshop
Parallel Code Development


© Copyright Statement

Table of Contents
  1. Prerequisites

  2. Talk Overview

  3. A Methodical Approach

  4. Case studies:

  5. References, Acknowledgements, WWW Resources


Prerequisites


Talk Overview


Obstacles to Efficient Parallel Programming

  1. Overhead due to data communication

    • Inefficient data/task partitioning that increases total communication

    • Many fine-grain communication operations as opposed to few large-grain communication operations

    • Slow communication due to overloading of communication links

  2. Load Balancing Inefficiencies

    • Poor Static Load Balancing

    • Poor Dynamic Load Balancing

  3. Inherently Sequential Algorithmic Nature


A Methodical Approach

Parallel Programming is a complex and intuitive process that does not easily lend itself to autonomous and completely mechanical methods.

We describe a method that

Drawing from the extensive work in the field of Parallel Programming research, and in particular, the work of Ian Foster our 5 step method is:

  1. Identify Computational Hotspots

  2. Partition the problem into smaller independent tasks

  3. Identify Communication requirements between such tasks

  4. Agglomerate smaller tasks into larger tasks

  5. Map tasks to actual processors

As you become more experienced with parallel programming, you'll be able to consider two, three, or all steps at the same time.


An illustration of the 5-step method:


Step 1: Identify Computational Hotspots

We begin any parallelization process by identifying the parts of the program that consume the most run time.

The goal is to know which code should be parallelized and which code should be recycled from the serial program:
Why.

The reason is obvious:

There are three basic ways to identify the hotspots in your code:

In addition, the complexity and feasibility of the parallelization process should be considered when evaluating parallelization targets.

Step 1: Hotspot Identification Checklist

Are you satisfied you have identified a code fragment (your parallelization target) that consumes a large enough percentage of your programs run time?

Is your parallelization target big enough?

Is your parallelization target too big?

Is your parallelization target known to be inherently serial?


Step 1: Hotspot Identification Examples


Step 2: Partitioning

In this step we partition our problem into smaller tasks that can execute in parallel with each other.

The intent is to expose multiple opportunities for parallel execution rather then to suggest the actual partition of tasks that will be executed in parallel.

Later, considerations of communication, memory, and task requirements may lead us to agglomerate smaller tasks into larger tasks.

For now, the goal is to partition the tasks as finely as possible so as to provide greater flexibility to make choices during later steps.


Step 2: Partitioning (cont.)

In addition to dividing the computation into separate tasks, we associate data with each task as well.

Data Decomposition: users typically concentrate on dividing the data first and then dividing the computation according to the data partition. Several issues should be considered:

Functional Decomposition: in problems where there are no obvious data structures to partition, or where the data structures are highly unstructured, start by dividing the computation into many separate tasks, and then follow with the data.

Alternatively you can combine the two strategies at different points in your problem.


Step 2: Partitioning Checklist

There are exponentially many ways to partition a problem; some may be significantly better or worse than others. The following checkpoints should help you determine if your partitioning is appropriate:

  1. Maximize flexibility; does your partition define at least an order of magnitude more tasks than there are processors?

  2. Scaling to larger problem sizes: does your partition avoid redundant storage requirements?

  3. Load balancing: Are tasks of comparable size?

  4. Problem scalability: does the number of tasks scale with problem size?

  5. Maximize flexibility; have you identified several alternative partitions?


Step 2: Partitioning Examples


Step 3: Communication

Ideally, the tasks defined in the previous step could execute completely in parallel, but in most real world programs these tasks require data computed or used by other tasks; communication is necessary.

Data communication is one of the main obstacles to efficient parallel programming.

In the communication step, we identify the data communication requirements of your partition by drawing lines between tasks that either produce or consume data from other tasks.

The intermediate goal is to identify tasks which most frequently communicate with another. In the next step, those tasks will be agglomerated into larger tasks.

Another goal is to identify those data structures which are needed by many tasks and to make decisions on whether to replicate them.


Step 3: Communication (cont.)

The process: draw lines between those tasks that must communicate.

Typically, a small number of communication patterns arise:


Step 3: Communication Checklist

Except for possibly replacing inefficient global communication patterns with more efficient ones, or duplicating commonly used data structures, the communication step requires few decisions. Nevertheless, we provide some checkpoints to ensure you are proceeding well:


Step 3: Communication Examples


Step 4: Agglomeration

At this point we have defined a data partition and identified many tasks which can execute in parallel along with their communication requirements.

We could simply divide these tasks among the processors randomly and then execute.

In the agglomeration step, we group many small tasks into larger tasks for better communication and computation performance.


Step 4: Agglomeration (cont.)

How much to agglomerate?


Step 4: Agglomeration: How To Agglomerate

Whether we are creating fully- or semi-agglomerated tasks, the basic guidelines are the same (and an exercise in satisfying conflicting demands).

It's often useful to know the communication performance capabilities of your target machine while agglomerating tasks:
SP2 Communication Performance Summary.

  1. Tasks that can run concurrently should be placed in different groups.

  2. Tasks that require the same input data should be grouped together.

  3. Tasks that communicate with each other should be grouped together.

  4. Surface-to-Volume issues: for problems that only communicate with local neighbors, we can simultaneously increases the message granularity, decrease the total communication requirements, and increase the computation granularity by agglomerating neighbors. Illustration. Mathematical analysis.

  5. A list of smaller tasks with smoothly varying load requirements should be randomly agglomerated into DRTs.


Step 4: Agglomeration Checklist

  1. Has agglomeration reduced total communication costs by grouping tightly coupled tasks together?

  2. Has agglomeration yielded tasks with similar computation and communication costs? If we have created just one task per processor, then these tasks should have nearly identical costs.

  3. Does the number of tasks still scale with problem size?

  4. Has agglomeration eliminated opportunities for concurrent execution?

  5. Can the number of tasks be reduced still further, without introducing load imbalances, increasing software engineering costs, or reducing scalability?

  6. If you expect to use a dynamic task scheduler, does your agglomeration still have about an order of magnitude more tasks than processors?


Step 4: Agglomeration Examples


Step 5: Mapping tasks to processors

At this point we should have our program partitioned into reasonable sized tasks, but not yet mapped to processors.

In the case of static problems where we agglomerated smaller tasks to P larger tasks, the mapping task is straightforward; one task per processor;
SP2 tip.


Step 5: Mapping: (cont.)

In the case where the number of tasks are known at load time, certain agglomeration/mapping techniques can be used to map data/computation to processors to ensure that each processor receives the same computational load.


Step 5: Mapping: Task-Scheduling Algorithms

In the toughest programs, the number of tasks to be executed is also unknown at design time; in such cases some form of dynamic load balancing must be used.

Task-scheduling algorithms, where tasks are dynamically mapped to processors as computation proceeds, work best when each task has little communication requirements:


Step 5: Mapping Checklist

  1. If your problem has non-even computational requirements, have you considered an algorithm based on dynamic task creation?

  2. If considering a design based on dynamic task creation, have you considered full-agglomerating tasks to a single processor?

  3. If using a centralized load-balancing scheme, have you verified that the manager will not become a bottleneck?

  4. If using a dynamic load-balancing scheme, have you evaluated the relative costs of different strategies (ie. probabilistic mappings)?

  5. If using probabilistic or cyclic methods, do you have a large enough number of tasks to ensure reasonable load balance?


Step 5: Mapping Examples


Case Study: 2-D FFT: Overview

Our first case study is the two-dimensional Fast Fourier Transform. In this problem a series of one-dimensional FFT's are applied to a two-dimensional image.

The input and output of the algorithm is a 2-D image. A brief pseudo-code description of the algorithm is:

   do i = 1, IMAGE_HEIGHT
      row(tmp_image, i) = 1-d_fft(row(input_image, i))
   end do

   do i = 1, IMAGE_WIDTH
      column(output_image, i) = 1-d_fft(column(tmp_image, i))
   end do
The only important data structures are the input, tmp, and output images. Each image is the same size; typically anywhere from 128 by 128 to 8192 by 8192 or larger.

The actual code is available
here.

2-D FFT Parallelization

Identify the Computation Hotspots: intuitive and visual inspection of the code reveals that the bulk of the run time is spent performing the N row FFT's and the N column FFT's. Running the program with prof confirms this.

Partitioning: for 2-D FFT, partitioning is the most important step and there are several choices:

Communication: the communication step was useful for eliminating many bad partitions. For the final partition, communication is required between the 1-D FFT phases, but not during the 1-D FFT's.

Agglomeration/Mapping: This step, and the mapping step, are simple for this example.


Case Study: 2-D CFD Code: Overview

Our second case study is a computational fluid dynamics example in two dimensions. In this example, we'll be simulating airflow over a wing. More specifically, this code calculates the solution of a compressible, inviscid, non-conducting flow over a body in two dimensions.

The code solves the 2D, compressible Euler equations on unstructured triangular grids:

The Euler equations, where U is the vector of conserved variables. A and B are the Jacobian matrices defined in the program.

The algorithm approximately solves the Euler equations over the domain by solving the discretized equations on the unstructured grid, or mesh. The values for density, x-momentum, y-momentum, and total energy are stored at the vertices of the mesh.

Basically, the field values at a given timestep are checked to see if they satisfy the discretized Euler equations. If not, then they are adjusted and checked again. This iterative process continues until the field values satisfy the discretized Euler equations to a specified tolerance.

The actual code is available
here.

2-D CFD: Computational Elements

The wing is divided into a triangular mesh. Each triangle is called a cell. Each cell has 3 vertices or nodes. Click here for an illustration of the 2-D mesh.

The "residual" is calculated at each cell using field values held at the vertices of that cell.

This fluctuation is then split-up and distributed to the vertices of the cell.

The sum of the fluctuation contributions from all the cells that surround a vertex make up the nodal fluctuation for that vertex.

Once the code has looped through all of the cells, the nodal residuals will have been calculated for each vertex, and the vertex values can be updated for the next timestep.


2-D CFD: Useful Information

The grid is represented by a vector of N elements; each element contains the X and Y coordinates of the vertices, along with other data. The actual vertices appear in no particular order in this vector.

CFD grids commonly have from thousands to millions of cells.

Convergence is typically reached after 1,000 to 100,000 time iterations.

Each cell needs approximately 400 floating-point operations per time iteration.


CFD Parallelization

Identify the Computation Hotspots: visual inspection of the code shows that the main loop in which the values (density, x-dir momentum, y-dir momentum and total energy) are calculated for each cell are the computationally expensive parts of the algorithm. File I/O, pre-, and post-processing steps will remain serial.

Partitioning: The cell is a natural element for data-partitioning:

We partition the unstructured 2-D grid into distinct cells.

Communication: each cell only needs data from its vertices. Each vertex only needs data from vertices that it is directly connected to. Therefore, the 2-D grid itself already identifies communication links.

Agglomeration: at this point we have a separate task for each cell and far more cells than processors. We must agglomerate to larger tasks. Mapping: the mapping step is essentially done at this point; we fully agglomerated the N cell/vertices into P separate tasks.


References, Acknowledgements, WWW Resources


This tutorial draws heavily from the book Designing and Building Parallel Programs by Ian Foster; the 5-step parallelization method, and many of the figures were borrowed from this book. Additional information can be obtained Here.


© Copyright 1996 Maui High Performance Computing Center. All rights reserved.

Documents located on the Maui High Performance Computing Center's WWW server are copyrighted by the MHPCC. Educational institutions are encouraged to reproduce and distribute these materials for educational use as long as credit and notification are provided. Please retain this copyright notice and include this statement with any copies that you make. Also, the MHPCC requests that you send notification of their use to help@@mail.mhpcc.edu.

Commercial use of these materials is prohibited without prior written permission.

Author: September 1996 George Gusciora
Revised: