MHPCC

SP Parallel Programming Workshop
Parallel Operating Environment (POE)


© Copyright Statement

Table of Contents

  1. Prerequisites

  2. What is the Parallel Operating Environment?

  3. POE Definitions

  4. Executing Parallel Programs Using the POE
    1. Setting your Path
    2. Creating a .rhosts File
    3. Compiling and Linking a Parallel Program
    4. Setting Environment Variables
    5. Some Preset POE Environments
    6. Creating a Host List File
    7. Invoking the Executable

  5. CPU and Communications Adapter Usage

  6. Miscellaneous Environment Variables

  7. Parallel File Copy Utilities

  8. Recommendations for Running on the MHPCC SP2

  9. References, Acknowledgements, WWW Resources

  10. Exercises

Prerequisites


What is the Parallel Operating Environment?


POE Definitions

Partition
The group of processor nodes on which you run your program is called your partition. There may be multiple active partitions for multiple users across the SP system.

Partition Manager
The Partition Manager establishes and controls your partition. It consists of a set of subroutines that are linked into your program and an Internet daemon process called pmd2. The Partition Manager is responsible for:

Resource Manager
The Resource Manager keeps track of all nodes and all POE jobs on an SP system. When your job requests nodes - via your Partition Manager process, the Resource Manager will allocate nodes for your use. It attempts to enforce a "one parallel task per node" rule. Generally, there is only one Resource Manager process for an entire SP system. The Resource Manager may also be referred to as "JM" or Job Manager.

The jm_status command allows you to query the Resource Manager for information about SP pools, nodes and POE jobs currently running on the system:

     
    jm_status -Pv      - verbose list of all nodes and POE jobs
    jm_status -j       - list of all POE jobs 
    

Specific Node Allocation
Refers to the user explicitly choosing the nodes on which the job will run. This is most typically done by placing the node names in a host list file.

Non-Specific Node Allocation
Refers to the Resource Manager automatically selecting the nodes which will be used to run a user job.

Home Node
The SP node where you start your POE job. Generally, this is the SP node where you are currently logged in.

Remote Node
Any other non-home node in your partition.

RM Agent
A pmd2 process (called the Partition Manager daemon) is automatically started for each remote task in the partition. It receives the user's environment, validates the user, and executes the program requested for the particular task. The daemon terminates when the partition is released.

Processor Pool
The local systems administrator may divide the processor nodes into disjoint pools of processors for management purposes. If this is the case, you may then request that your parallel tasks run on specific pools. You may determine your system's pools and associated nodes by using the command:
    
    jm_status -P 
    

Communication SubSystem (CSS)
The Communication SubSystem (CSS) is the set of library routines which implement one of the two protocols for message passing communications between SP nodes. The two different protocols are User Space Protocol and Internet Protocol.

User Space Protocol
The fastest method of communication between nodes. Can be used only with the high performance switch. Often referred to simply as US protocol.

Internet Protocol
A slower method of communication between nodes. Can be used with ethernet or the high performance switch. Often referred to simply as IP protocol.

Executing Parallel Programs Using POE

In order to execute a parallel program, you need to:

  1. Set your path to include the necessary POE executables.

  2. Create a .rhosts file

  3. Compile and link the program using one of the POE compile scripts.

  4. Set up your execution environment by setting the necessary POE environment variables.

  5. Create a host list file (optional)

  6. Invoke the executable

Setting Your Path


Creating a .rhosts File


Compiling and Linking a Parallel Program


Setting Environment Variables


Some Preset POE Environments


The Host List File


Invoking the Executable

Once the environment is setup and the executables are created, invoking the executables is relatively easy.

  • For serial programs or commands, use the poe command followed by your program name or the command you wish to run across your partition. For example:
    
        poe  cp ~/input.file  /tmp/input.file
        poe  my_serial_job
        poe  rm /tmp/input.file
    

    CPU and Communications Adapter Usage


    Miscellaneous Environment Variables

    A complete list of the POE environment variables can be viewed in the POE man page. A list of the more commonly used "miscellaneous" variables appears below.

    MP_CPU_USE
    Specifies CPU usage. Valid values are "unique" and "multiple". The default for IP communications is "multiple". US communications defaults are "multiple" if specific node allocation is used, "unique" otherwise.
    
        C Shell:    setenv MP_CPU_USE multiple
        Korn Shell: export MP_CPU_USE=multiple
        

    MP_RETRY
    The period (in seconds) of time between processor node allocation retries if there are not enough processor nodes immediately available.
    
        C Shell:    setenv MP_RETRY 10
        Korn Shell: export MP_RETRY=10
        

    MP_RETRYCOUNT
    The number of times that the Partition Manager should attempt to allocate processor nodes before returning without running your program.
    
        C Shell:     setenv MP_RETRYCOUNT 15
        Korn Shell:  export MP_RETRYCOUNT=15
        

    MP_SAVEHOSTFILE
    The name of an output host list file to be generated by the Partition Manager.
    
        C Shell:     setenv MP_SAVEHOSTFILE progname.hosts.used
        Korn Shell:  export MP_SAVEHOSTFILE=progname.hosts.used
        

    MP_PGMMODEL
    Determines the programming model you are using. Valid values are spmd or mpmd. If not set, the default is spmd. If set to "mpmd" you will be enabled to load different executables individually on the nodes of your partition.
    
        C Shell:     setenv MP_PGMMODEL mpmd
        Korn Shell:  export MP_PGMMODEL=mpmd
        

    MP_CMDFILE
    Determines the name of a POE commands file used to load the nodes of your partition. If set, POE will read the commands file rather than STDIN. Valid values are any file specifier. Generally used only when MP_PGMMODEL=mpmd.
    
        C Shell:     setenv MP_CMDFILE mpmd.hosts
        Korn Shell:  export MP_CMDFILE=mpmd.hosts
        

    MP_EUIDEVELOP
    Causes MPL to do more detailed checking during program execution.
    
        C Shell:     setenv MP_EUIDEVELOP yes
        Korn Shell:  export MP_EUIDEVELOP=yes
        

    MP_STDOUTMODE
    Enables you to manage the STDOUT from your parallel tasks. If set to "unordered" all tasks write output data to STDOUT asynchronously. If set to "ordered" output data from each parallel task is written to its own buffer. Later, all buffers are flushed, in task order, to stdout. If a task id is specified, only the task indicated writes output data to stdout. The default is unordered.
    
        C Shell:     setenv MP_STDOUTMODE ordered
        Korn Shell:  export MP_STDOUTMODE=ordered
                -or-
        C Shell:     setenv MP_STDOUTMODE unordered
        Korn Shell:  export MP_STDOUTMODE=unordered
                -or-
        C Shell:     setenv MP_STDOUTMODE 6
        Korn Shell:  export MP_STDOUTMODE=6
        

    MP_INFOLEVEL
    Determines the level of message reporting. Default is 1. Valid values are:
      0 = error
      1 = warning and error
      2 = informational, warning, and error
      3 = informational, warning, and error. Also reports diagnostic messages for use by the IBM Support Center.
      3,4,5,6 = Informational, warning, and error. Also reports high- and low-level diagnostic messages for use by the IBM Support Center.

    Note that use of this feature can consume significant system resources and should not be set above zero routinely.

    
        C Shell:     setenv MP_INFOLEVEL 2
        Korn Shell:  export MP_INFOLEVEL=2
        

    SP_NAME
    Specifies the name of the Control Workstation. This variable is used by the System Status Array tool.

    Parallel File Copy Utilities

    • POE provides four utilities which may be used to copy file(s) to and from a number of nodes.

    • Three of these utilities (mcp, mcpscat, mcpgath) are actually message passing applications designed for efficiency. One utility (mprcp) is a shell script which uses the standard UNIX utility rcp to perform the copy operations and may not be as efficient.

      Note: see the associated hyperlinked man page for examples of each utility's use.

      mcp
      Copies a single file from the home node to a number of remote nodes.

      mcpscat
      Copies a number of files from task 0 and scatter them in sequence to all tasks, in a round robin order.

      mcpgath
      Copies a number of files from all tasks back to task 0.

      mprcp
      Copies a file from the home node to a list of remote hosts.

    Recommendations for Running on the MHPCC SP2

    • Set MP_PROCS to the number of nodes you wish to use.

    • Inform the Resource Manager about your job (MP_RESD = yes).

    • Use dynamic linking of the communication libraries instead of static linking. This will give you more flexibility at run time - you can alternate between IP and US communications simply by changing the MP_EUILIB variable.

    • Always use the high performance switch for SP communications (MP_EUIDEVICE = css0).

    • Write scratch files to local scratch space whenever possible. At the MHPCC, each node has a /localscratch directory - it is faster than writing scratch files to your home directory.

    • Interactive Users:

      • Always use IP communications (MP_EUILIB = ip). This is "good neighbor" use of the limited number of interactive nodes. It will permit other POE jobs to run at the same time.

      • If you need (for testing purposes) to run with US communications in the Interactive pool, be sure to share the CPU with other users (MP_CPU_USE = multiple). If this is not done, even IP jobs can not run on the same node as your job.

      • Unless there is a particular reason for selecting certain nodes, do not use a host list file. Allow the Resource Manager to automatically allocate your nodes (MP_HOSTFILE = NULL or "").

      • During program development it may be helpful to set MP_INFOLEVEL to a higher value (>3).

      • If you are having trouble getting the requested number of nodes for your job, be sure that you are using IP communications (MP_EUILIB = ip). If you still have problems, check what other users are doing in the Interactive pool with the command: jm_status -Pv | more . Most likely, somebody is running a User Space job without MP_CPU_USE set to multiple. Then notify the MHPCC User Services staff by sending email to "help@mhpcc.edu".

    • LoadLeveler (batch) Users:

      • POE environment variables and adapter usage are specified in your LoadLeveler command file. For example:
        
            #@ requirements = (Adapter == "hps_user")
            #@ environment = MP_EUILIB=us;MP_INFOLEVEL=3;MP_LABELIO=yes
            

      • Alway use US communication over the switch. The above example demonstrates how to do this.

      • Not all POE variables are used by LoadLeveler. The following variables will have no effect if used in a LoadLeveler job command file.
        
            MP_PROCS
            MP_RMPOOL
            MP_EUIDEVICE
            MP_HOSTFILE
            MP_SAVEHOSTFILE
            MP_PMDSUFFIX
            MP_RESD
            MP_RETRY
            MP_RETRYCOUNT
            MP_ADAPTER_USE
            MP_CPU_USE 
            

      • Don't forget that certain POE environment variables in your "dot" files (.cshrc, .profile, .login) will override the settings you specify in LoadLeveler.

      • LoadLeveler always acquires nodes for POE jobs directly from the Resource Manager and ignores user host list files.

    References, Acknowledgements and WWW Resources

    Additional Information on the WWW

    References and Acknowledgements

    • "IBM Parallel Environment for AIX: Operation and Use, Version 2.1.0". IBM Corporation. Available online at the MPHCC by using InfoExplorer with the command: info -l pe .

    • "IBM AIX Parallel Environment Operation and Use, Release 2.0". IBM Corporation.

    • We gratefully acknowledge the IBM Corporation for providing much of the original material included in this document.

    © Copyright 1995,1996 Maui High Performance Computing Center. All rights reserved.

    Documents located on the Maui High Performance Computing Center's WWW server are copyrighted by the MHPCC. Educational institutions are encouraged to reproduce and distribute these materials for educational use as long as credit and notification are provided. Please retain this copyright notice and include this statement with any copies that you make. Also, the MHPCC requests that you send notification of their use to help@mail.mhpcc.edu.

    Commercial use of these materials is prohibited without prior written permission.

    Revised: 17 July 1996 blaise@mhpcc.edu