MHPCC

SP Parallel Programming Workshop
Lab Exercises and Examples


© Copyright Statement

Table of Contents
  1. Exercises
    1. Parallel Operating Environment (POE)
    2. Message Passing Interface (MPI)
    3. Parallel Virtual Machine (PVM)
    4. Visualization Tool (VT)
    5. Program Marker Array
    6. XPVM
    7. High Performance Fortran (HPF)
    8. LoadLeveler
    9. Mass Storage System
    10. FORGE XHPF version 2.0 (MPL)
    11. FORGE XHPF version 2.1 (MPI)
    12. Message Passing Library (MPL)

  2. Exercise Codes
    1. Array Decomposition
    2. Matrix Multiplication
    3. pi Calculation
    4. Concurrent Wave Equation
    5. 2D Heat Equation Domain Decomposition
    6. Round Trip Timing Test
    7. Bandwidth Timing Test
    8. Prime Number Generation
    9. Computational Fluid Dynamics
    10. 2D FFT

  3. Other Example Codes
    1. Image Rotation
    2. Mandelbrot Image

  4. Acknowledgements and References

Exercise Codes


Exercise Codes
Array Decomposition

DESCRIPTION:

This is the "hello world" of parallel programming. It is a simple array decomposition used to demonstrate the distribution of data among multiple tasks and the communications required to accomplish it. In the parallel version, the master task distributes an equal portion of the array to each worker task. Each worker task receives its portion of the array and performs a simple value assignment to each of its elements. Each worker then sends its portion of the array back to the master. As the master receives back each portion of the array selected elements are displayed.

FILES:


Exercise Codes
Matrix Multiplication

DESCRIPTION:

This example is a simple matrix multiply program. In the parallel version, the data is distributed among the worker tasks who perform the actual multiplication and send back their respective results to the master task.

Note that the C and Fortran versions of this code differ because of the way arrays are stored/passed. C arrays are row-major order but FORTRAN arrays are column-major order.

FILES:

Exercise Codes
pi Calculation

DESCRIPTION:

This program calculates pi using a "dartboard" algorithm. See Fox et al.(1988) Solving Problems on Concurrent Processors, vol.1 page 207. In the parallel version, all processes contribute to the calculation, with the master averaging the values for pi.

The MPL and MPI versions have two variations, one which demonstrates point-to-point communications and one that demonstrates collective communications.

FILES:


Exercise Codes
Concurrent Wave Equation

DESCRIPTION:

This program implements the concurrent wave equation described in Chapter 5 of Fox et al., 1988, Solving Problems on Concurrent Processors, vol 1.

A vibrating string is decomposed into points. In the parallel version, each processor is responsible for updating the amplitude of a number of points over time. At each iteration, each processor exchanges boundary points with their nearest neighbors.

An X based display of the final wave is provided for the parallel versions of this code.

FILES:


Exercise Codes
2D Heat Equation

DESCRIPTION:

This example is based on a simplified two-dimensional heat equation domain decomposition. The initial temperature is computed to be high in the middle of the domain and zero at the boundaries. The boundaries are held at zero throughout the simulation. During the time-stepping, an array containing two domains is used; these domains alternate between old data and new data.

The PVM and MPL versions provide an X based display of the initial and final temperature distributions.

FILES:

  • MPI
  • PVM3
  • HPF
  • MPL
  • LINDA

    Exercise Codes
    Round Trip Timing Test

    DESCRIPTION:

    This example demonstrates a roundtrip, point-to-point communication timing test. A set number of messages are sent between the master and worker tasks. Before and after timings are made for each message and an average is calculated when completed.

    FILES:


    Exercise Codes
    Bandwidth Timing Test

    DESCRIPTION:

    This code demonstrates a point-to-point, bandwidth communications test. It begins by asking the user to supply the following parameters:
    • starting message size (J)
    • finish message size (K)
    • message increment size (I)
    • number of round trips per iteration (N)
    It then sends/receives N roundtrips of incrementally sized messages from start size J to finish size K by increment I. The bandwidth for each message size is calculated as the average of N round trips.

    FILES:


    Exercise Codes
    Prime Number Generator

    DESCRIPTION:

    This program generates primes. The approach taken is a "brute force" method which requires increasingly greater amounts of cpu as the problem size increases. It lends itself well to an embarassingly parallel solution since each prime can be computed independently of all other primes.

    FILES:


    Exercise Codes
    Computational Fluid Dynamics (CFD)

    DESCRIPTION:

    This is a Computational Fluid Dynamics code for a compressible, inviscid, non-conducting flow in two dimensions. It calculates the steady-state flow field around a body as long as the flow remains subsonic throughout. This code will NOT handle transonic, or supersonic flow regimes. This means that the freestream Mach number must be kept low enough to ensure that the flow does not cross the sonic point anywhere.

    The input file is: input.dat

    More specifically, the code solves the 2D, compressible Euler equations on unstructured triangular grids.

    
    U_t + AU_x + BU_y = 0     
    
    The Euler equations, where U is the vector of conserved variables. A and B are the Jacobian matrices defined below in the program.

    The algorithm is a generalization of the Lax-Wendroff scheme. It is based upon the idea of fluctuation-splitting, which is the approach of computing cell-based residuals which are then decomposed and distributed to the vertices where the node-based residuals are then accumulated for updating the field values for the next timestep. Thus, the scheme is a cell-vertex scheme, but the basic computational element is the cell.

    The algorithm approximately solves the Euler equations over the domain by solving the discretized equations on the unstructured grid, or mesh. The values for density, x-momentum, y-momentum, and total energy are stored at the vertices of the mesh. Basically, the field values at a given timestep are checked to see if they satisfy the discretized Euler equations. If not, then they are adjusted. Then, they are checked again. This iterative process continues until the field values satisfy the discretized Euler equations to a specified tolerance.

    The way the field values are adjusted is the core of the algorithm. By looping over each cell in the mesh, a computation is done using the field values held at the vertices of that cell. This computation produces a quantity called the cell residual, or fluctuation. This fluctuation is then split-up and distributed to the vertices of the cell. The sum of the fluctuation contributions from all the cells that surround a vertex make up the nodal fluctuation, or nodal residual for that vertex. Once the code has looped through all of the cells, the nodal residuals will have been calculated for each vertex, and the vertex values can be updated for the next timestep.

    The alogorithm is parallelized by giving each processor a set of cells to work on. A given vertex may be part of several different cells, and these cells may be located on different processors. Thus, communication will be necessary between processors that have cells which share vertices.

    The unstructured grids were produced using DELAUNDO which was developed and written by Jens Muller of the W.M. Keck Foundation CFD Laboratory at the University of Michigan.

    FILES:


    Exercise Codes
    2D FFT

    DESCRIPTION:

    A 512x512 complex matrix is initialized with a point source and then decomposed and distributed to each processor in the partition (by rows for C; by columns for Fortran). Each processor then performs a one-dimensional FFT on it's portion of the matrix which is stored locally. Each processor transposes it's portion of the matrix and performs an "All to all" distribution to the other processors in the partition. This now partitions the intermediate matrix by columns for C; rows for Fortran. Each processor then performs a one-dimensional FFT on it's portion of the matrix. Finally, the columns/rows of the matrix are gathered back at the destination processor and timing and Mflop results are displayed. C programs perform an additional test for correctness.

    Note: A straightforward unsophisticated 1D FFT kernel is used. It is sufficient to convey the general idea, but be aware that there are better 1D FFTs available on many systems.

    FILES:


    Other Example Codes

    This sections contains some miscellaneous examples. Only the parallelized code is available and usually in only one language and one parallel tool. Currently, it contains two image processing examples using X-Windows.

    Other Examples
    Image Rotation

    DESCRIPTION:

    This example is written in C using MPL and MPI. An image is read from disk and repeatedly rotated by one degree increments. After each image rotation the screen is updated.

    A single processor reads the input image from disk and broadcasts to all the other processors in the system. Each processor allocates space for a contiguous swath of rows in the output image. Processors then calculate the coordinates of the source input image pixel for each pixel in this output image swath. Individual swaths of the output image are then collected at the destination processor. The destination processor then displays the image on the local X server.

    Note: When running this program make sure that your DISPLAY environment variable is set and that the MP_PROCS environment variable is set to 4.

    FILES:


    Other Examples
    Mandelbrot Image

    DESCRIPTION:

    This program generates and Mandelbrot image and displays it using X Windows. It is written in C using MPL and MPI.

    Each processor calculates the image pixels for a swath of rows in the output image. These swaths of rows are then collected at a single processor and displayed on the screen.

    Click here to see a larger version.

    FILES:


    Acknowledgements and References


    © Copyright 1995 Maui High Performance Computing Center. All rights reserved.

    Documents located on the Maui High Performance Computing Center's WWW server are copyrighted by the MHPCC. Educational institutions are encouraged to reproduce and distribute these materials for educational use as long as credit and notification are provided. Please retain this copyright notice and include this statement with any copies that you make. Also, the MHPCC requests that you send notification of their use to help@mail.mhpcc.edu.

    Commercial use of these materials is prohibited without prior written permission.

    Revised: 26 September 1996 blaise@mhpcc.edu