MHPCC

SP Parallel Programming Workshop
Introduction to Parallel Programming


© Copyright Statement

Table of Contents
  1. Overview
    1. What is Parallelism?
    2. Sequential Programming
    3. The Need for Faster Machines
    4. Parallel Computing
    5. Parallel Programming Overview

  2. Architecture Taxonomy
    1. SISD Model
    2. SIMD Model
    3. MIMD Model

  3. Processor Communication
    1. Shared Memory
    2. Distributed Memory
    3. Memory Hierarchies
    4. Communications Network on the SP2

  4. Parallel Programming Paradigms
    1. Various Methods
    2. Message Passing
    3. Data Parallel
    4. Implementations
    5. Paradigm Comparisons
      1. Maturity
      2. Ease of programming
      3. Flexibility
      4. Efficiency
      5. Scalability
      6. Portability
      7. I/O
      8. Cost

  5. Steps for Creating a Parallel Program
    1. Communication
      1. Point to Point
      2. One to All Broadcast
      3. All to All Broadcast
      4. One to All Personalized
      5. All to all Personalized
      6. Shifts
      7. Collective Computation

  6. Design and Performance Considerations
    1. Amdahl's Law
    2. Load Balancing
    3. Granularity
    4. Data Dependency
    5. Communication Patterns and Bandwidth
    6. I/O Patterns
    7. Machine Configuration
    8. Fault Tolerance and Restarting
    9. Deadlock
    10. Debugging
    11. Performance Monitoring and Analysis

  7. Parallel Examples
    1. Essentials of Loop Parallelism
    2. Calculating PI
      1. Serial Problem Description
      2. Parallel Solution
    3. Calculating Array Elements
      1. Serial Problem Description
      2. Data Parallelism
      3. Pool of Tasks
      4. Load Balancing and Granularity
    4. Simple Heat Equation
      1. Serial Problem Description
      2. Data Parallelism
      3. Overlapping Communication and Computation
    5. Application Case Study

  8. References, Acknowledgements, WWW Resources


Overview
What is Parallelism?

A strategy for performing large, complex tasks faster.

A large task can either be performed serially, one step following another, or can be decomposed into smaller tasks to be performed simultaneously, i.e., in parallel.

Parallelism is done by:

Parallel problem solving is common. Examples: building construction; operating a large organization; automobile manufacturing plant

The automobile analogy.


Overview
Sequential Programming

Traditionally, programs have been written for serial computers:


Overview
The Need for Faster Machines

You might think that one instruction executed in 9 billionths of a second would be fast enough. You'd be wrong.

There are several classes of problems that require faster processing:

  • Grand Challenge Problems:


    Overview
    Parallel Computing

    Traditional Supercomputers

    Technology

    Benefits

    Limitations

    Parallel Supercomputers

    Technology

    Benefits

    Limitations


    Overview
    Parallel Computing (cont)

    Parallel computing requires:


    Overview
    Parallel Programming


    Architecture Taxonomy


    Architecture Taxonomy
    SISD Model

    Single Instruction, Single Data Stream

    Automobile analogy


    Architecture Taxonomy
    SIMD Model

    Single Instruction, Multiple Data Stream

    Automobile analogy


    Architecture Taxonomy
    SIMD Model
    Vector SIMD

    Single instruction results in multiple operands being updated


    Architecture Taxonomy
    SIMD Model
    Parallel SIMD


    Architecture Taxonomy
    MIMD Model

    Multiple Instructions, Multiple Data

    Automobile analogy


    Processor Communications


    Processor Communications
    Shared Memory


    Processor Communications
    Distributed Memory


    Processor Communications
    Memory Hierarchies


    Processor Communications
    Communications Network on the SP2


    Parallel Programming Paradigms
    Various Methods

    There are many methods of programming parallel computers. Two of the most common are message passing and data parallel.

    Note: these models are machine/architecture independent, any of the models can be implemented on any hardware given appropriate operating system support.
    An effective implementation is one which closely matches its target hardware and provides the user ease in programming.


    Parallel Programming Paradigms
    Message Passing

    The message passing model is defined as:
    Programming with message passing is done by linking with and making calls to libraries which manage the data exchange between processors. Message passing libraries are available for most modern programming languages.


    Parallel Programming Paradigms
    Data Parallel

    The data parallel model is defined as:


    Programming with data parallel model is accomplished by writing a program with data parallel constructs and compiling it with a data parallel compiler.

    The compiler converts the program into standard code and calls to a message passing library to distribute the data to all the processes.


    Parallel Programming Paradigms
    Implementations


    Parallel Programming Paradigms
    Message Passing Interface


    Parallel Programming Paradigms
    Parallel Virtual Machine


    Parallel Programming Paradigms
    Message Passing Library


    Parallel Programming Paradigms
    F90 / High Perfomance Fortran


    Parallel Programming Paradigms
    Paradigm Comparisons

    Before choosing a parallel programming paradigm and a particular implementation there are many issues to be considered. Some of the most important are:


    Parallel Programming Paradigm
    Maturity

    The maturity of the compiler or message passing library is a major concern when choosing a paradigm and a particular implementation. Developers creating research codes may wish ease in development and a variety of functionality. Developers creating production codes might be most concerned with the level of support or lack of bugs in the tools.


    Parallel Programming Paradigm
    Ease of Programming

    Ability to develop robust software quickly and efficiently is a major concern to industry. Much of the software effort is spent in maintenance. The paradigm must make it easy to implement algorithms and maintain the resulting code.


    Parallel Programming Paradigm
    Flexibility

    Several approaches to making parallel code.

    Developers choose an approach that:

    They choose from:

    Not all paradigms support all approaches.


    Parallel Programming Paradigm
    Efficiency

    As in all of computing the further you get from the systems level the less efficient you get. Over the years serial compilers have gotten very efficient.


    Parallel Programming Paradigm
    Scalability

    Both Data Parallel and Message passing are solutions for scalable parallel programming.


    Parallel Programming Paradigm
    Portability

    To ensure that the software you develop is portable to other architectures be sure to choose a standard.


    Parallel Programming Paradigm
    I/O

    One area of Parallel Programming that is still very immature is Input / Output (I/O) support. I/O support is dependent not on a particular paradigm as much as a particular implementation.

    It is important to understand your I/O requirements and choose a message passing library or data parallel compiler which supports your needs.


    Parallel Programming Paradigm
    Cost

    Cost is always an important factor when choosing any software tool.


    Steps for Creating a Parallel Program

    1. If you are starting with an existing serial program, debug the serial code completely

    2. Identify the parts of the program that can be executed concurrently:

      • Requires a thorough understanding of the algorithm

      • Exploit any inherent parallelism which may exist.

      • May require restructuring of the program and/or algorithm. May require an entirely new algorithm.

    3. Decompose the program:

      • Functional Parallelism
      • Data Parallelism
      • Combination of both

    4. Code development

      • Code may be influenced/determined by machine architecture

      • Choose a programming paradigm

      • Determine communication

      • Add code to accomplish task control and communications

    5. Compile, Test, Debug

    6. Optimization

      • Measure Performance
      • Locate Problem Areas
      • Improve them


    Parallel Programming Steps
    Decomposing the Program

    There are three methods for decomposing a problem into smaller tasks to be performed in parallel: Functional Decomposition, Domain Decomposition, or a combination of both


    Parallel Programming Steps
    Communication

    Understanding the interprocessor communications of your program is essential. The types of communications for message passing and data parallel are exactly the same. In fact most data parallel compilers simply use one of the standard message passing libraries to achieve data movement.

    Communications on distributed memory computers:


    Communication
    Point to Point

    The most basic method of communication between two processors is the point to point message. The originating processor "sends" the message to the destination processor. The destination processor then "receives" the message.

    The message commonly includes the information, the length of the message, the destination address and possibly a tag.

    Typical message passing libraries subdivide the basic sends and receives into two types:


    Communication
    One to All Broadcast

    A node may have information which all the others require. A broadcast is a message sent to many other nodes.

    A One to All broadcast occurs when one processor sends the same information to many other nodes.


    Communication
    All to All Broadcast

    With an All to All broadcast each processor sends its unique information to all the other processors.


    Communication
    One to All Personalized

    Personalized communication send a unique message to each processor.

    In One to All personalized communication one processor sends a unique message to every other processor.


    Communication
    All to All Personalized

    In All to All Personalized communication each processor sends a unique message to all other processors.


    Communications
    Shifts

    Shifts are permutations of information. Information is exchanged in one logical direction or the other. Each processor exchanges the same amount of information with its neighbor processor.

    There are two types of shifts:


    Communication
    Collective Computation

    In collective computation (reductions), one member of the group collects data from the other members. Commonly a mathematical operation like a min, max, add, multiple etc. is performed.


    Design and Performance Considerations
    Amdahl's Law


    Design and Performance Considerations
    Load Balancing


    Design and Performance Considerations
    Granularity


    Design and Performance Considerations
    Data Dependency


    Design and Performance Considerations
    Communication Patterns and Bandwidth


    Design and Performance Considerations
    I/O Patterns


    Design and Performance Considerations
    Machine Configuration


    Design and Performance Considerations
    Fault Tolerance and Restarting


    Design and Performance Considerations
    Deadlock


    Design and Performance Considerations
    Debugging


    Design and Performance Considerations
    Performance Monitoring and Analysis


    Parallel Examples
    Essentials of Loop Parallelism

    Some concrete problems will help illustrate the methods of parallel programming and the design and performance issues involved.

    Each of the problems has a loop construct that forms the main computational component of the code. Loops are a main target for parallelizing and vectorizing code. A program often spends much of its time in loops. When it can be done, parallelizing these sections of code can have dramatic benefits.

    A step-wise refinement procedure for developing the parallel algorithms will be employed. An initial solution for each problem will be presented and improved by considering performance issues.

    Pseudo-code will be used to describe the solutions. The solutions will address the following issues:

    Note the difference in approaches between message passing and data parallel programming. Message passing explicitly parallelizes the loops where data parallel replaces loops by working on entire arrays in parallel.


    Calculating PI
    Serial Problem Description


    Calculating PI
    Parallel Solutions

    Message passing solution:

    Data parallel solution:


    Calculating Array Elements
    Serial Problem Description


    Calculating Array Elements
    Parallel Solutions

    Message Passing

    Data Parallel


    Calculating Array Elements
    Pool of Tasks


    Calculating Array Elements
    Load Balancing and Granularity


    Simple Heat Equation
    Serial Problem Description


    Simple Heat Equation
    Data Parallelism

    Data Parallel

  • Loops are not used. The entire array is processed in parallel.

  • The distribute statements layout the data in parallel.

  • A SHIFT is used to increment or decrement an array element.
    DISTRIBUTE u1 (block,cyclic)
    DISTRIBUTE u2 (block,cyclic)
     
    u2 = u1 +  
     cx * (SHIFT (u1,1,dim 1) + SHIFT (u1,-1,dim 1) - 2.*u1) + 
     cy * (SHIFT (u1,1,dim 2) + SHIFT (u1,-1,dim 2) - 2.*u1)
    
    
    


    Simple Heat Equation
    Overlapping Communication and Computation


    References, Acknowledgements, WWW Resources

    References and Acknowledgements


    © Copyright 1995 Maui High Performance Computing Center. All rights reserved.

    Documents located on the Maui High Performance Computing Center's WWW server are copyrighted by the MHPCC. Educational institutions are encouraged to reproduce and distribute these materials for educational use as long as credit and notification are provided. Please retain this copyright notice and include this statement with any copies that you make. Also, the MHPCC requests that you send notification of their use to help@mail.mhpcc.edu.

    Commercial use of these materials is prohibited without prior written permission.

    Last revised: 02 July 1996 Blaise Barney