SP Parallel Programming Workshop
Parallel Code Development
© Copyright Statement
Table of Contents
-
Prerequisites
-
Talk Overview
-
A Methodical Approach
- Case studies:
-
References, Acknowledgements, WWW Resources
- Previous talks focused on an introduction to parallel machines,
parallel programming issues, and SP2 specific issues.
- This talk focuses on actual techniques to convert serial algorithms to
efficient parallel algorithms
- We avoid implementation details; covered in remaining talks
- Core: an abstract 5 Step method for parallelizing serial algorithms
- Checklists that help you evaluate choices made in each step
- Apply this method to two real world programs: 2-D FFT. and 2-D CFD
Obstacles to Efficient Parallel Programming
- Overhead due to data communication
- Inefficient data/task partitioning that increases total
communication
- Many fine-grain communication operations as opposed to few
large-grain communication operations
- Slow communication due to overloading of communication links
- Load Balancing Inefficiencies
- Poor Static Load Balancing
- Poor Dynamic Load Balancing
- Inherently Sequential Algorithmic Nature
Parallel Programming is a complex and intuitive process that does not easily
lend itself to autonomous and completely mechanical methods.
We describe a method that
- Maximizes the potential for parallelism
- Provides efficient steps for exploiting this potential
- Provides explicit checklists that evaluate the effectiveness of each
step
Drawing from the extensive work in the field of Parallel Programming
research, and in particular, the work of Ian Foster our 5 step method is:
- Identify Computational Hotspots
- Partition the problem into smaller independent tasks
- Identify Communication requirements between such tasks
- Agglomerate smaller tasks into larger tasks
- Map tasks to actual processors
As you become more experienced with parallel programming, you'll be able
to consider two, three, or all steps at the same time.
An illustration of the 5-step method:
We begin any parallelization process by identifying the parts of the program
that consume the most run time.
The goal is to know which code should be parallelized and which code should
be recycled from the serial program:Why.
The reason is obvious:
- Greatly optimizing a procedure which only takes 10 percent of the total
run time can at most improve your performance by about 10 percent.
- Greatly optimizing a procedure which only takes 90 percent of the total
run time can improve your performance by a factor of 10.
There are three basic ways to identify the hotspots in your code:
- Abstract analysis of the algorithm
- Visual inspection of the code
- Profiling tools. Many tools exist which can profile run time
behavior and quickly identify hotspots. See the first part of thePerformance Tuning Tutorial
for a description of a dozen or so tools.
In addition, the complexity and feasibility of the parallelization process
should be considered when evaluating parallelization targets.
Step 1: Hotspot Identification Checklist
Are you satisfied you have identified a code fragment (your parallelization
target) that consumes a large enough percentage of your programs run
time?
Is your parallelization target big enough?
- Would only perfect linear speedup on your parallelization target code
satisfy your performance needs?
- Would a P/2 speedup satisfy your performance needs?
- Would no speedup satisfy your performance needs?
Is your parallelization target too big?
- Does your parallelization target also include extensive amounts of
code that would not significantly improve your performance?
Is your parallelization target known to be inherently serial?
Step 1: Hotspot Identification Examples
In this step we partition our problem into smaller tasks that can
execute in parallel with each other.
The intent is to expose multiple opportunities for parallel execution rather
then to suggest the actual partition of tasks that will be executed in
parallel.
Later, considerations of communication, memory, and task requirements may
lead us to agglomerate smaller tasks into larger tasks.
For now, the goal is to partition the tasks as finely as possible so as to
provide greater flexibility to make choices during later steps.
Step 2: Partitioning (cont.)
In addition to dividing the computation into separate tasks, we associate
data with each task as well.
Data Decomposition: users typically concentrate on dividing the data
first and then dividing the computation according to the data partition.
Several issues should be considered:
- Begin by focusing on the largest data structures, or the ones that
are accessed most frequently.
- Divide the data into small pieces.
- If possible, divide the data into same size units.
- Strive for more aggressive partitioning then your target computer
will allow.
- Use the data partitioning as a guideline for partitioning the
computation into separate tasks; associate a computation with each
data element.
- Take communication requirements into some account when partitioning
data.
Functional Decomposition: in problems where there are no obvious data
structures to partition, or where the data structures are highly
unstructured, start by dividing the computation into many separate tasks,
and then follow with the data.
- Try to divide the computation into as many as possible tasks.
- Generally, it is better if these tasks require the same run time.
- Associate data with each task.
Alternatively you can combine the two strategies at different points in your
problem.
- Some problems focus on different data structures at different times.
Consider different partitionings for these different phases.
- Problems where no single data or functional partitioning yields
satisfactory results.
Step 2: Partitioning Checklist
There are exponentially many ways to partition a problem; some may be
significantly better or worse than others. The following checkpoints should
help you determine if your partitioning is appropriate:
- Maximize flexibility; does your partition define at least an order of
magnitude more tasks than there are processors?
- Scaling to larger problem sizes: does your partition avoid redundant
storage requirements?
- Load balancing: Are tasks of comparable size?
- Problem scalability: does the number of tasks scale with problem
size?
- Maximize flexibility; have you identified several alternative
partitions?
Step 2: Partitioning Examples
Ideally, the tasks defined in the previous step could execute completely in
parallel, but in most real world programs these tasks require data computed
or used by other tasks; communication is necessary.
Data communication is one of the main obstacles to efficient parallel
programming.
In the communication step, we identify the data communication requirements
of your partition by drawing lines between tasks that either produce or
consume data from other tasks.
The intermediate goal is to identify tasks which most frequently communicate
with another. In the next step, those tasks will be agglomerated into
larger tasks.
Another goal is to identify those data structures which are needed by many
tasks and to make decisions on whether to replicate them.
Step 3: Communication (cont.)
The process: draw lines between those tasks that must communicate.
Typically, a small number of communication patterns arise:
- Local Communication: tasks communicate with a small number of
local neighbor tasks; global communication requires that tasks
communicate with potentially many, non-local tasks.
- Global Communication: Broadcast: a data structure must be
broadcast from one task to all other tasks. Consider duplicating
the data structure at every task: SP2
tip.
- Global Communication: reductions: an associative reduction
operation must be performed on all data elements producing a
single result: SP2 tip.
- Structured Communication: a task and its neighbors form a
regular structure, such as a tree or grid; in contrast, unstructured
communication networks may be arbitrary graphs.
- Static Communication: communication links between tasks are
fixed at compile- or load-time. Dynamic communication requires that
tasks vary their communication partners over time.
Step 3: Communication Checklist
Except for possibly replacing inefficient global communication patterns with
more efficient ones, or duplicating commonly used data structures, the
communication step requires few decisions. Nevertheless, we provide some
checkpoints to ensure you are proceeding well:
- Data Structure Duplication: have you duplicated all small data
structures that require few updates during program execution?
- Data Structure Duplication: have you duplicated any data structures
that require either a large amount of memory, or computation to keep
them up-to-date?
- Communication balance: Do tasks perform nearly the same number of
communication operations?
- Does each task communicate only with a small number of neighbors? If
each task must communicate with many other tasks, consider revisiting
step 2 and repartitioning the tasks/data to achieve smaller
communication requirements.
- Are communication operations able to proceed concurrently?
Step 3: Communication Examples
At this point we have defined a data partition and identified many tasks
which can execute in parallel along with their communication
requirements.
We could simply divide these tasks among the processors randomly and then
execute.
- This would increase total communication by failing to assign frequently
communicating tasks to the same processor.
- This would sacrifice communication locality.
- This would sacrifice the improved communication performance inherent in
large-grained messages.
- This would sacrifice compiler optimization cabilities for loops.
In the agglomeration step, we group many small tasks into larger tasks for
better communication and computation performance.
- We reduce total communication by grouping tasks that must communicate
together.
- We combine several slow small messages into a few large fast ones.
- We perform computation in loops, where compiler optimizers excel.
Step 4: Agglomeration (cont.)
How much to agglomerate?
- If the task run times are knowable, equal, and easily grouped, we can
create fully-agglomerated tasks; P total tasks where P is the number
of processors we intend to run on.
- If the tasks are not easily grouped, or if their run times vary, we
must employ some type of static load balancing scheme when
agglomerating to ensure processors have equal work load; see next
step.
- If tasks run times are unknowable, or their total number is unknown at
load time, an alternative is to create semi-agglomerated tasks with an
order of magnitude more tasks than processors and dynamically schedule
them at run time.
- Motivation: If the task run times are dynamic, or not expected to be
the same, then staticly combining smaller tasks into P larger tasks
could result in gross load imbalances.
- Why agglomerate at all? Dynamic Load-balancing techniques impose a
fixed overhead per task that they balance; less tasks reduce this
overhead.
Step 4: Agglomeration: How To Agglomerate
Whether we are creating fully- or semi-agglomerated tasks, the basic
guidelines are the same (and an exercise in satisfying conflicting
demands).
It's often useful to know the communication performance capabilities of your
target machine while agglomerating tasks: SP2 Communication Performance
Summary.
- Tasks that can run concurrently should be placed in different
groups.
- Tasks that require the same input data should be grouped together.
- Tasks that communicate with each other should be grouped together.
- Surface-to-Volume issues: for problems that only communicate with
local neighbors, we can simultaneously increases the message
granularity, decrease the total communication requirements, and
increase the computation granularity by agglomerating neighbors. Illustration. Mathematical analysis.
- A list of smaller tasks with smoothly varying load requirements should
be randomly agglomerated into DRTs.
Step 4: Agglomeration Checklist
- Has agglomeration reduced total communication costs by grouping tightly
coupled tasks together?
- Has agglomeration yielded tasks with similar computation and
communication costs? If we have created just one task per processor,
then these tasks should have nearly identical costs.
- Does the number of tasks still scale with problem size?
- Has agglomeration eliminated opportunities for concurrent execution?
- Can the number of tasks be reduced still further, without introducing
load imbalances, increasing software engineering costs, or reducing
scalability?
- If you expect to use a dynamic task scheduler, does your agglomeration
still have about an order of magnitude more tasks than processors?
Step 4: Agglomeration Examples
At this point we should have our program partitioned into reasonable sized
tasks, but not yet mapped to processors.
In the case of static problems where we agglomerated smaller tasks to P
larger tasks, the mapping task is straightforward; one task per processor;
SP2 tip.
Step 5: Mapping: (cont.)
In the case where the number of tasks are known at load time, certain
agglomeration/mapping techniques can be used to map data/computation to
processors to ensure that each processor receives the same computational
load.
- Recursive Bisection
- Used to partition unstructured data domains so that computational
load per processor is even and communication between processors is
minimized.
- The divide-and-conquer approach is taken in agglomerating the domain;
the entire domain is first divided in half, then the halves are
divided in half, etc.
- An illustration of a recursively bisected 2-D grid is available
here.
- Local Algorithms
- An initial agglomeration/mapping is specified.
- Periodically, processors with neighbor data exchange their
computational load and exchange data/tasks if an imbalance
exists.
- Probabalistic Methods:
- Tasks are mapped to processors at random.
- If enough tasks exist, an even computational load per processor can
be expected.
- Cyclic mappings are a form of a probabalistic method.
Step 5: Mapping: Task-Scheduling Algorithms
In the toughest programs, the number of tasks to be executed is also unknown
at design time; in such cases some form of dynamic load balancing must be
used.
Task-scheduling algorithms, where tasks are dynamically mapped to processors
as computation proceeds, work best when each task has little communication
requirements:
- Master/Worker: The simplest task scheduling algorithm.
- A master process assigns tasks to workers; when a worker finishes
one task, the master assigns it a new one.
- Works best if there are an order of magnitude more tasks than
processors.
- Works best if the cost of assigning tasks to workers is small
relative to their computation time.
- Performance can be improved by prefetching a new task when the
current one is about to complete.
- Hierarchical Master/Worker:
- Similar to generic master/worker; a hierarchy of masters and workers
are designed.
- Can improve performance in problems where a single master would be
overloaded with scheduling tasks to processors.
- Decentralized Task Pool:
- Each processor maintains its own task pool.
- If a processor exhausts its task pool, it requests tasks from
neighbor processors.
- Requires no central task manager.
Step 5: Mapping Checklist
- If your problem has non-even computational requirements, have you
considered an algorithm based on dynamic task creation?
- If considering a design based on dynamic task creation, have you
considered full-agglomerating tasks to a single processor?
- If using a centralized load-balancing scheme, have you verified that the
manager will not become a bottleneck?
- If using a dynamic load-balancing scheme, have you evaluated the
relative costs of different strategies (ie. probabilistic
mappings)?
- If using probabilistic or cyclic methods, do you have a large enough
number of tasks to ensure reasonable load balance?
Step 5: Mapping Examples
Our first case study is the two-dimensional Fast Fourier Transform. In this
problem a series of one-dimensional FFT's are applied to a two-dimensional
image.
The input and output of the algorithm is a 2-D image. A brief pseudo-code
description of the algorithm is:
do i = 1, IMAGE_HEIGHT
row(tmp_image, i) = 1-d_fft(row(input_image, i))
end do
do i = 1, IMAGE_WIDTH
column(output_image, i) = 1-d_fft(column(tmp_image, i))
end do
The only important data structures are the input, tmp, and output images.
Each image is the same size; typically anywhere from 128 by 128 to 8192
by 8192 or larger.
The actual code is available here.
2-D FFT Parallelization
Identify the Computation Hotspots: intuitive and visual inspection of
the code reveals that the bulk of the run time is spent performing the N row
FFT's and the N column FFT's. Running the program with prof confirms this.
Partitioning: for 2-D FFT, partitioning is the most important step
and there are several choices:
- Partition all images by pixels: although this is the most aggressive
partitioning possible, later steps would show it is
unsatisfactory.
- Visual examination of the code reveals that every output pixel in
a 1-D FFT is a function of every input pixel to that 1-D FFT.
- Communication and agglomeration steps would show that it's best
to perform each 1-D FFT entirely in the same task.
- Partition all images by rows. This works perfectly for the row FFT's
but is poor for the column FFT's.
- Partition all images by columns. See above.
- Duplicate the input image across all processors. No benefit arises
because row-partitioning already provides communication-free row FFT's
and the input image is not used for the column FFT's.
- Duplicate the tmp image on all processors; this would require either
replicating the 1-D row FFT computations in all tasks (effectively
eliminating parallelism for this phase) or gather/broadcasting the tmp
image to all tasks (which could take longer than the row FFT
computation itself): SP2
Communication Performance Summary
- Functional decomposition: define a task for each row FFT and each
column FFT.
- Use row partitioning for the row FFT's.
- Use column partitioning for the column FFT's.
- Repartition the tmp image in between 1-D FFT phases.
- Advantage: both the row and column FFT tasks can proceed without
communication.
- Disadvantage: the repartitioning of the tmp image will introduce an
AlltoAll communication overhead:SP2 Communication Performance
Summary
Communication: the communication step was useful for eliminating many
bad partitions. For the final partition, communication is required between
the 1-D FFT phases, but not during the 1-D FFT's.
Agglomeration/Mapping: This step, and the mapping step, are simple
for this example.
- The computational run time is the same for each row or column FFT.
- No communication is required during the row and column FFT's.
- We fully agglomerate N/P row and column FFT's per task.
Our second case study is a computational fluid dynamics example in two
dimensions. In this example, we'll be simulating airflow over a wing. More
specifically, this code calculates the solution of a compressible, inviscid,
non-conducting flow over a body in two dimensions.
The code solves the 2D, compressible Euler equations on unstructured
triangular grids:
The Euler equations, where U is the vector of conserved variables. A and B
are the Jacobian matrices defined in the program.
The algorithm approximately solves the Euler equations over the domain by
solving the discretized equations on the unstructured grid, or mesh. The
values for density, x-momentum, y-momentum, and total energy are stored at
the vertices of the mesh.
Basically, the field values at a given timestep are checked to see if they
satisfy the discretized Euler equations. If not, then they are adjusted and
checked again. This iterative process continues until the field values
satisfy the discretized Euler equations to a specified tolerance.
The actual code is available here.
2-D CFD: Computational Elements
The wing is divided into a triangular mesh. Each triangle is called a
cell. Each cell has 3 vertices or nodes. Click
here for an illustration of the 2-D mesh.
The "residual" is calculated at each cell using field values held at the
vertices of that cell.
This fluctuation is then split-up and distributed to the vertices of the
cell.
The sum of the fluctuation contributions from all the cells that surround a
vertex make up the nodal fluctuation for that vertex.
Once the code has looped through all of the cells, the nodal residuals will
have been calculated for each vertex, and the vertex values can be updated
for the next timestep.
2-D CFD: Useful Information
The grid is represented by a vector of N elements; each element contains the
X and Y coordinates of the vertices, along with other data. The actual
vertices appear in no particular order in this vector.
CFD grids commonly have from thousands to millions of cells.
Convergence is typically reached after 1,000 to 100,000 time iterations.
Each cell needs approximately 400 floating-point operations per time
iteration.
CFD Parallelization
Identify the Computation Hotspots: visual inspection of the code
shows that the main loop in which the values (density, x-dir momentum, y-dir
momentum and total energy) are calculated for each cell are the
computationally expensive parts of the algorithm. File I/O, pre-, and
post-processing steps will remain serial.
Partitioning: The cell is a natural element for data-partitioning:
- Each cell is essentially the same as every other cell.
- The computation performed at each cell is essentially the same as
at every other cell.
- No obvious larger, or smaller, data partition elements appear to
exist.
We partition the unstructured 2-D grid into distinct cells.
Communication: each cell only needs data from its vertices. Each
vertex only needs data from vertices that it is directly connected to.
Therefore, the 2-D grid itself already
identifies communication links.
Agglomeration: at this point we have a separate task for each cell
and far more cells than processors. We must agglomerate to larger tasks.
- We attempt to fully agglomerate cells/vertices to P separate tasks.
- The total number of tasks is fixed.
- The computation requirements of each task is the same.
- The communication requirements of each task is the same.
- Since cells/vertices only need to communicate with cells/vertices they
are directly adjacent to, we group physically close cells/vertices
into the same agglomeration (simply putting the first N/P
cells/vertices in task 0, etc. would not necessarily agglomerate
physically close cells/vertices into the same task).
- We expect to reap surface-to-volume benefits due to the local nature
of communication requirements.
- Dividing the physical 2-D space into equal sized blocks would generate
an uneven load balance; the amount of cells/vertices per unit area of
physical space is widely varying.
- We need to agglomerate cells/vertices such that the amount of cells per
task is the same, and communication requirements between tasks is
minimized.
- We choose a commercial package, METIS, which uses a recursive
bisection technique to divide the grid into halves, halves into
quarters, etc.
- The agglomerated tasks are illustrated
here.
Mapping: the mapping step is essentially done at this point; we fully
agglomerated the N cell/vertices into P separate tasks.
This tutorial draws heavily from the book Designing and Building Parallel
Programs by Ian Foster; the 5-step parallelization method, and many of the
figures were borrowed from this book. Additional information can be
obtained Here.
© Copyright 1996
Maui High Performance Computing Center. All rights reserved.
Documents located on the Maui High Performance Computing Center's WWW server
are copyrighted by the MHPCC. Educational institutions are encouraged to
reproduce and distribute these materials for educational use as long as
credit and notification are provided. Please retain this copyright notice
and include this statement with any copies that you make. Also, the MHPCC
requests that you send notification of their use to help@@mail.mhpcc.edu.
Commercial use of these materials is prohibited without prior written
permission.
Author: September 1996 George Gusciora
Revised: