Skip to content

Instantly share code, notes, and snippets.

@vsoch

vsoch/README.md Secret

Last active July 18, 2024 06:24
Show Gist options
  • Save vsoch/9ac7c4448dffe656d946edceaa58bd9e to your computer and use it in GitHub Desktop.
Save vsoch/9ac7c4448dffe656d946edceaa58bd9e to your computer and use it in GitHub Desktop.
MPI Commands (mpirun) Cheat Sheet

mpirun

While we normally interact with workload managers directly that bootstrap MPI, it's helpful to know how to use vanilla MPI (with a hosts file and ssh, which is a bare bones setup). Examples for using mpirun with different mpi implementations. Note that this is just for one node (testing in a docker container) but I try to provide examples with multiple and a hosts file. Note that you'd need ssh configured properly for this.

intel mpi

docker run -it ghcr.io/rse-ops/lammps-matrix:intel-mpi-rocky-9-amd64
# vanilla example
mpirun -n 1 lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxc.hns -nocite

# Is the same as
mpirun -np 1 lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxc.hns -nocite

# For multiple hosts (nodes), use comma separated hosts on the command line
mpirun -np 1 -hosts $(hostname) lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxc.hns -nocite

# or in a file with -f
hostname > hosts.txt
mpirun -np 1 -f ./hosts.txt lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxc.hns -nocite

# It's also common to ask for a total number of tasks, and the processes per node (ppn)
mpirun -n {{tasks}} -ppn {{tasks_per_node}} <application>

# For example, if we want to evenly distribute 200 tasks across 4 nodes (50 per node) we would do:
# The lammps problem size is also larger, given more resources.
mpirun --hostfile ./hostlist.txt -np 200 -ppn 50 lmp -v x 64 -v y 16 -v z 16 -in in.reaxc.hns -nocite

Full help:

mpirun intel MPI
# mpirun --help

Usage: ./mpiexec [global opts] [local opts for exec1] [exec1] [exec1 args] : [local opts for exec2] [exec2] [exec2 args] : ...

Global options (passed to all executables):

  Global environment options:
    -genv {name} {value}             environment variable name and value
    -genvlist {env1,env2,...}        environment variable list to pass
    -genvnone                        do not pass any environment variables
    -genvall                         pass all environment variables not managed
                                          by the launcher (default)

  Other global options:
    -f {name}                        file containing the host names
    -hosts {host list}               comma separated host list


Local options (passed to individual executables):

  Other local options:
    -n/-np {value}                   number of processes
    {exec_name} {args}               executable name and arguments


Hydra specific options (treated as global):

  Launch options:
    -launcher                        launcher to use (ssh slurm rsh ll sge pbs pbsdsh pdsh srun lsf blaunch qrsh fork)
    -launcher-exec                   executable to use to launch processes
    -enable-x/-disable-x             enable or disable X forwarding

  Resource management kernel options:
    -rmk                             resource management kernel to use (slurm ll lsf sge pbs cobalt)

  Processor topology options:
    -bind-to                         process binding
    -map-by                          process mapping
    -membind                         memory binding policy

  Other Hydra options:
    -verbose                         verbose mode
    -info                            build information
    -print-all-exitcodes             print exit codes of all processes
    -ppn                             processes per node
    -prepend-rank                    prepend rank to output
    -prepend-pattern                 prepend pattern to output
    -outfile-pattern                 direct stdout to file
    -errfile-pattern                 direct stderr to file
    -nameserver                      name server information (host:port format)
    -disable-auto-cleanup            don't cleanup processes on error
    -disable-hostname-propagation    let MPICH auto-detect the hostname
    -localhost                       local hostname for the launching node
    -usize                           universe size (SYSTEM, INFINITE, <value>)

Intel(R) MPI Library specific options:

  <option> -help                     show help message for the specific option

  Global options:
    -aps                             Intel(R) Application Performance Snapshot profile
    -mps                             Intel(R) Application Performance Snapshot profile (MPI, OpenMP only)
    -gtool                           tool and rank set
    -gtoolfile                       file containing tool and rank set
    -hosts-group {groups of hosts}   allows to set node ranges (like in Slurm* Workload Manager)

  Other Hydra options:
    -iface                           network interface to use
    -s <spec>                        redirect stdin to all or 1,2 or 2-4,6 MPI processes (0 by default)
    -silent-abort                    do not print abort warning message
    -nolocal                         avoid running the application processes on the node where mpiexec.hydra started
    -tune {binary file}              defines the name of binary tuning file
    -print-rank-map                  print rank mapping
    -prepend-timestamp               prepend time stamp to stdout
    -prot                            print the communication protocol between each host and process pin status

Intel(R) MPI Library, Version 2021.8  Build 20221129 (id: 339ec755a1)
Copyright 2003-2022 Intel Corporation.

OpenMPI

docker run -it --entrypoint bash ghcr.io/rse-ops/lammps-matrix:openmpi-ubuntu-22.04-amd64 
# most containers with user "root" will need --allow-run-as-root
mpirun --allow-run-as-root -n 1 lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxc.hns -nocite

# this allows any of -c, -n, -np, or --np
mpirun --allow-run-as-root --np 1 lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxc.hns -nocite

# But to explicitly say number of nodes
mpirun --allow-run-as-root -N 1 lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxc.hns -nocite

# example controlling topology with saying "map by processes per resource, 48 per node" so 96 total, 2 nodes
mpirun -np 96 -map-by ppr:48:node --hostfile ./hostfile.txt <application>
Help for OpenMPI
# mpirun --allow-run-as-root --help
mpirun (Open MPI) 4.1.2

Usage: mpirun [OPTION]...  [PROGRAM]...
Start the given program using Open RTE

-c|-np|--np <arg0>       Number of processes to run
-h|--help <arg0>         This help message
   -n|--n <arg0>         Number of processes to run
-q|--quiet               Suppress helpful messages
-v|--verbose             Be verbose
-V|--version             Print version and exit

For additional mpirun arguments, run 'mpirun --help <category>'

The following categories exist: general (Defaults to this option), debug,
    output, input, mapping, ranking, binding, devel (arguments useful to OMPI
    Developers), compatibility (arguments supported for backwards compatibility),
    launch (arguments to modify launch options), and dvm (Distributed Virtual
    Machine arguments).

Report bugs to http://www.open-mpi.org/community/help/
# mpirun --allow-run-as-root --help mapping
mpirun (Open MPI) 4.1.2

Usage: mpirun [OPTION]...  [PROGRAM]...
Start the given program using Open RTE

   -cf|--cartofile <arg0>  
                         Provide a cartography file
   -cpus-per-proc|--cpus-per-proc <arg0>  
                         Number of cpus to use for each process [default=1]
   -cpus-per-rank|--cpus-per-rank <arg0>  
                         Synonym for cpus-per-proc
-H|-host|--host <arg0>   List of hosts to invoke processes on
   --map-by <arg0>       Mapping Policy [slot | hwthread | core | socket
                         (default) | numa | board | node]
   -N <arg0>             Launch n processes per node on all allocated nodes
                         (synonym for 'map-by node')
   -nolocal|--nolocal    Do not run any MPI applications on the local node
   -nooversubscribe|--nooversubscribe 
                         Nodes are not to be oversubscribed, even if the
                         system supports such operation
   -oversubscribe|--oversubscribe 
                         Nodes are allowed to be oversubscribed, even on a
                         managed system, and overloading of processing
                         elements
   --ppr <arg0>          Comma-separated list of number of processes on a
                         given resource type [default: none]
   -rf|--rankfile <arg0>  
                         Provide a rankfile file
   -use-hwthread-cpus|--use-hwthread-cpus 
                         Use hardware threads as independent cpus
# mpirun --allow-run-as-root --help launch
mpirun (Open MPI) 4.1.2

Usage: mpirun [OPTION]...  [PROGRAM]...
Start the given program using Open RTE

   -allow-run-as-root|--allow-run-as-root 
                         Allow execution as root (STRONGLY DISCOURAGED)
   -am <arg0>            Aggregate MCA parameter set file list
   --app <arg0>          Provide an appfile; ignore all other command line
                         options
   -default-hostfile|--default-hostfile <arg0>  
                         Provide a default hostfile
   -enable-instant-on-support|--enable-instant-on-support 
                         Enable PMIx-based instant on launch support
                         (experimental)
   -fwd-mpirun-port|--fwd-mpirun-port 
                         Forward mpirun port to compute node daemons so all
                         will use it
   -hostfile|--hostfile <arg0>  
                         Provide a hostfile
   -launch-agent|--launch-agent <arg0>  
                         Command used to start processes on remote nodes
                         (default: orted)
   -machinefile|--machinefile <arg0>  
                         Provide a hostfile
   --noprefix            Disable automatic --prefix behavior
   -path|--path <arg0>   PATH to be used to look for executables to start
                         processes
   -personality|--personality <arg0>  
                         Comma-separated list of programming model,
                         languages, and containers being used
                         (default="ompi")
   --prefix <arg0>       Prefix where Open MPI is installed on remote nodes
   --preload-files <arg0>  
                         Preload the comma separated list of files to the
                         remote machines current working directory before
                         starting the remote process.
-s|--preload-binary      Preload the binary on the remote machine before
                         starting the remote process.
   -set-cwd-to-session-dir|--set-cwd-to-session-dir 
                         Set the working directory of the started processes
                         to their session directory
   -show-progress|--show-progress 
                         Output a brief periodic report on launch progress
   -use-regexp|--use-regexp 
                         Use regular expressions for launch
   -wd|--wd <arg0>       Synonym for --wdir
   -wdir|--wdir <arg0>   Set the working directory of the started processes
-x <arg0>                Export an environment variable, optionally
                         specifying a value (e.g., "-x foo" exports the
                         environment variable foo and takes its value from
                         the current environment; "-x foo=bar" exports the
                         environment variable name foo and sets its value to
                         "bar" in the started processes)

Report bugs to http://www.open-mpi.org/community/help/

MPICH

docker run -it --entrypoint bash ghcr.io/rse-ops/lammps-matrix:mpich-ubuntu-22.04-amd64 
mpirun -np 1 lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxc.hns -nocite 

# The remainder of commands are the same as intel MPI. See the help below for complete details.
Mpich mpirun help
# mpirun --help

Usage: ./mpiexec [global opts] [local opts for exec1] [exec1] [exec1 args] : [local opts for exec2] [exec2] [exec2 args] : ...

Global options (passed to all executables):

  Global environment options:
    -genv {name} {value}             environment variable name and value
    -genvlist {env1,env2,...}        environment variable list to pass
    -genvnone                        do not pass any environment variables
    -genvall                         pass all environment variables not managed
                                          by the launcher (default)

  Other global options:
    -f {name}                        file containing the host names
    -hosts {host list}               comma separated host list
    -wdir {dirname}                  working directory to use
    -configfile {name}               config file containing MPMD launch options


Local options (passed to individual executables):

  Local environment options:
    -env {name} {value}              environment variable name and value
    -envlist {env1,env2,...}         environment variable list to pass
    -envnone                         do not pass any environment variables
    -envall                          pass all environment variables (default)

  Other local options:
    -n/-np {value}                   number of processes
    {exec_name} {args}               executable name and arguments


Hydra specific options (treated as global):

  Launch options:
    -launcher                        launcher to use (ssh rsh fork slurm ll lsf sge manual persist)
    -launcher-exec                   executable to use to launch processes
    -enable-x/-disable-x             enable or disable X forwarding

  Resource management kernel options:
    -rmk                             resource management kernel to use (user slurm ll lsf sge pbs cobalt)

  Processor topology options:
    -topolib                         processor topology library (hwloc)
    -bind-to                         process binding
    -map-by                          process mapping
    -membind                         memory binding policy

  Demux engine options:
    -demux                           demux engine (poll select)

  Other Hydra options:
    -verbose                         verbose mode
    -info                            build information
    -print-all-exitcodes             print exit codes of all processes
    -iface                           network interface to use
    -ppn                             processes per node
    -profile                         turn on internal profiling
    -prepend-rank                    prepend rank to output
    -prepend-pattern                 prepend pattern to output
    -outfile-pattern                 direct stdout to file
    -errfile-pattern                 direct stderr to file
    -nameserver                      name server information (host:port format)
    -disable-auto-cleanup            don't cleanup processes on error
    -disable-hostname-propagation    let MPICH auto-detect the hostname
    -order-nodes                     order nodes as ascending/descending cores
    -localhost                       local hostname for the launching node
    -usize                           universe size (SYSTEM, INFINITE, <value>)
    -pmi-port                        use the PMI_PORT model
    -skip-launch-node                do not run MPI processes on the launch node
    -gpus-per-proc                   number of GPUs per process (default: auto)

Please see the instructions provided at
http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
for further details

spectrum mpi

The mpirun and mpiexec commands are identical in their functionality, and are both symbolic links to orterun, which is the job launching command of IBM SpectrumMPI's underlying Open Runtime Environment. Therefore, although this material refers only to the mpirun command, all references to it are considered synonymous with the mpiexec and orterun commands.

mpirun -np 4 --hostfile ./hosts.txt <application>
mpirun -np 4 -host h1,h2,h2 <application> 

Documentation

Spectrum MPI help
mpirun (Open MPI) 10.03.01.03rtm0

Usage: mpirun [OPTION]...  [PROGRAM]...
Start the given program using Open RTE

   -allow-run-as-root|--allow-run-as-root 
                         Allow execution as root (STRONGLY DISCOURAGED)
   -am <arg0>            Aggregate MCA parameter set file list
   --app <arg0>          Provide an appfile; ignore all other command line
                         options
   --bind-to <arg0>      Policy for binding processes. Allowed values: none,
                         hwthread, core, l1cache, l2cache, l3cache, socket,
                         numa, board, cpu-list ("none" is the default when
                         oversubscribed, "core" is the default when np<=2,
                         and "socket" is the default when np>2). Allowed
                         qualifiers: overload-allowed, if-supported,
                         ordered
   -bind-to-core|--bind-to-core 
                         Bind processes to cores
   -bind-to-socket|--bind-to-socket 
                         Bind processes to sockets
   -bycore|--bycore      Whether to map and rank processes round-robin by
                         core
   -bynode|--bynode      Whether to map and rank processes round-robin by
                         node
   -byslot|--byslot      Whether to map and rank processes round-robin by
                         slot
-c|-np|--np <arg0>       Number of processes to run
   -cf|--cartofile <arg0>  
                         Provide a cartography file
   -continuous|--continuous 
                         Job is to run until explicitly terminated
   -cpu-list|--cpu-list <arg0>  
                         List of processor IDs to bind processes to
                         [default=NULL]
   -cpu-set|--cpu-set <arg0>  
                         Comma-separated list of ranges specifying logical
                         cpus allocated to this job [default: none]
   -cpus-per-proc|--cpus-per-proc <arg0>  
                         Number of cpus to use for each process [default=1]
   -cpus-per-rank|--cpus-per-rank <arg0>  
                         Synonym for cpus-per-proc
-d|-debug-devel|--debug-devel 
                         Enable debugging of OpenRTE
   -debug|--debug        Invoke the user-level debugger indicated by the
                         orte_base_user_debugger MCA parameter
   -debug-daemons|--debug-daemons 
                         Enable debugging of any OpenRTE daemons used by
                         this application
   -debug-daemons-file|--debug-daemons-file 
                         Enable debugging of any OpenRTE daemons used by
                         this application, storing output in files
   -debugger|--debugger <arg0>  
                         Sequence of debuggers to search for when "--debug"
                         is used
   -default-hostfile|--default-hostfile <arg0>  
                         Provide a default hostfile
   -disable-recovery|--disable-recovery 
                         Disable recovery (resets all recovery options to
                         off)
   -display-allocation|--display-allocation 
                         Display the allocation being used by this job
   -display-devel-allocation|--display-devel-allocation 
                         Display a detailed list (mostly intended for
                         developers) of the allocation being used by this
                         job
   -display-devel-map|--display-devel-map 
                         Display a detailed process map (mostly intended for
                         developers) just before launch
   -display-diffable-map|--display-diffable-map 
                         Display a diffable process map (mostly intended for
                         developers) just before launch
   -display-map|--display-map 
                         Display the process map just before launch
   -display-topo|--display-topo 
                         Display the topology as part of the process map
                         (mostly intended for developers) just before
                         launch
   -do-not-launch|--do-not-launch 
                         Perform all necessary operations to prepare to
                         launch the application, but do not actually launch
                         it
   -do-not-resolve|--do-not-resolve 
                         Do not attempt to resolve interfaces
   -dvm|--dvm            Create a persistent distributed virtual machine
                         (DVM)
   -enable-instant-on-support|--enable-instant-on-support 
                         Enable PMIx-based instant on launch support
                         (experimental)
   -enable-recovery|--enable-recovery 
                         Enable recovery from process failure [Default =
                         disabled]
   -fwd-mpirun-port|--fwd-mpirun-port 
                         Forward mpirun port to compute node daemons so all
                         will use it
   -get-stack-traces|--get-stack-traces 
                         Get stack traces of all application procs on
                         timeout
   -gmca|--gmca <arg0> <arg1>  
                         Pass global MCA parameters that are applicable to
                         all contexts (arg0 is the parameter name; arg1 is
                         the parameter value)
-h|--help <arg0>         This help message
-H|-host|--host <arg0>   List of hosts to invoke processes on
   -hnp|--hnp <arg0>     Specify the URI of the HNP, or the name of the file
                         (specified as file:filename) that contains that
                         info
   -hostfile|--hostfile <arg0>  
                         Provide a hostfile
   -index-argv-by-rank|--index-argv-by-rank 
                         Uniquely index argv[0] for each process using its
                         rank
   -launch-agent|--launch-agent <arg0>  
                         Command used to start processes on remote nodes
                         (default: orted)
   -leave-session-attached|--leave-session-attached 
                         Enable debugging of OpenRTE
   -machinefile|--machinefile <arg0>  
                         Provide a hostfile
   --map-by <arg0>       Mapping Policy [slot | hwthread | core | socket
                         (default) | numa | board | node]
   -max-restarts|--max-restarts <arg0>  
                         Max number of times to restart a failed process
   -max-vm-size|--max-vm-size <arg0>  
                         Number of processes to run
   -mca|--mca <arg0> <arg1>  
                         Pass context-specific MCA parameters; they are
                         considered global if --gmca is not used and only
                         one context is specified (arg0 is the parameter
                         name; arg1 is the parameter value)
   -merge-stderr-to-stdout|--merge-stderr-to-stdout 
                         Merge stderr to stdout for each process
   -N <arg0>             Launch n processes per node on all allocated nodes
                         (synonym for 'map-by node')
   -n|--n <arg0>         Number of processes to run
   -nolocal|--nolocal    Do not run any MPI applications on the local node
   -nooversubscribe|--nooversubscribe 
                         Nodes are not to be oversubscribed, even if the
                         system supports such operation
   --noprefix            Disable automatic --prefix behavior
   -novm|--novm          Execute without creating an allocation-spanning
                         virtual machine (only start daemons on nodes
                         hosting application procs)
   -npernode|--npernode <arg0>  
                         Launch n processes per node on all allocated nodes
   -npersocket|--npersocket <arg0>  
                         Launch n processes per socket on all allocated
                         nodes
   -ompi-server|--ompi-server <arg0>  
                         Specify the URI of the publish/lookup server, or
                         the name of the file (specified as file:filename)
                         that contains that info
   -output-filename|--output-filename <arg0>  
                         Redirect output from application processes into
                         filename/job/rank/std[out,err,diag]. A relative
                         path value will be converted to an absolute path
   -output-proctable|--output-proctable 
                         Output the debugger proctable after launch
   -oversubscribe|--oversubscribe 
                         Nodes are allowed to be oversubscribed, even on a
                         managed system, and overloading of processing
                         elements
   -path|--path <arg0>   PATH to be used to look for executables to start
                         processes
   -pernode|--pernode    Launch one process per available node
   -personality|--personality <arg0>  
                         Comma-separated list of programming model,
                         languages, and containers being used
                         (default="ompi")
   --ppr <arg0>          Comma-separated list of number of processes on a
                         given resource type [default: none]
   --prefix <arg0>       Prefix where Open MPI is installed on remote nodes
   --preload-files <arg0>  
                         Preload the comma separated list of files to the
                         remote machines current working directory before
                         starting the remote process.
-q|--quiet               Suppress helpful messages
   --rank-by <arg0>      Ranking Policy [slot (default) | hwthread | core |
                         socket | numa | board | node]
   -report-bindings|--report-bindings 
                         Whether to report process bindings to stderr
   -report-child-jobs-separately|--report-child-jobs-separately 
                         Return the exit status of the primary job only
   -report-events|--report-events <arg0>  
                         Report events to a tool listening at the specified
                         URI
   -report-pid|--report-pid <arg0>  
                         Printout pid on stdout [-], stderr [+], or a file
                         [anything else]
   -report-state-on-timeout|--report-state-on-timeout 
                         Report all job and process states upon timeout
   -report-uri|--report-uri <arg0>  
                         Printout URI on stdout [-], stderr [+], or a file
                         [anything else]
   -rf|--rankfile <arg0>  
                         Provide a rankfile file
-s|--preload-binary      Preload the binary on the remote machine before
                         starting the remote process.
   -set-cwd-to-session-dir|--set-cwd-to-session-dir 
                         Set the working directory of the started processes
                         to their session directory
   -show-progress|--show-progress 
                         Output a brief periodic report on launch progress
   -stdin|--stdin <arg0>  
                         Specify procs to receive stdin [rank, all, none]
                         (default: 0, indicating rank 0)
   -tag-output|--tag-output 
                         Tag all output with [job,rank]
   -timeout|--timeout <arg0>  
                         Timeout the job after the specified number of
                         seconds
   -timestamp-output|--timestamp-output 
                         Timestamp all application process output
   -tune <arg0>          Application profile options file list
   -tv|--tv              Deprecated backwards compatibility flag; synonym
                         for "--debug"
   -use-hwthread-cpus|--use-hwthread-cpus 
                         Use hardware threads as independent cpus
   -use-regexp|--use-regexp 
                         Use regular expressions for launch
-v|--verbose             Be verbose
-V|--version             Print version and exit
   -wd|--wd <arg0>       Synonym for --wdir
   -wdir|--wdir <arg0>   Set the working directory of the started processes
-x <arg0>                Export an environment variable, optionally
                         specifying a value (e.g., "-x foo" exports the
                         environment variable foo and takes its value from
                         the current environment; "-x foo=bar" exports the
                         environment variable name foo and sets its value to
                         "bar" in the started processes)
   -xml|--xml            Provide all output in XML format
   -xml-file|--xml-file <arg0>  
                         Provide all output in XML format to the specified
                         file
   -xterm|--xterm <arg0>  
                         Create a new xterm window and display output from
                         the specified ranks there

For additional mpirun arguments, run 'mpirun --help <category>'

The following categories exist: general (Defaults to this option), debug,
    output, input, mapping, ranking, binding, devel (arguments useful to OMPI
    Developers), compatibility (arguments supported for backwards compatibility),
    launch (arguments to modify launch options), and dvm (Distributed Virtual
    Machine arguments).

Report bugs to https://www.ibm.com/mysupport/s/

Extra options from Spectrum-MPI (that translate to similar Open MPI options):

  [Container behavior]
    -container rank   : Use $MPIRUN_CONTAINER_CMD to launch ranks within
                        individual container instances. Automatically inserts
                        the container assistant script in front of the program
                        name for environment modifications.
    -container all    : Use $MPIRUN_CONTAINER_CMD to relaunch mpirun within
                        a container and to launch orteds within individual
                        container instances. The container assistant script is
                        automatically inserted in front of the relaunched
                        mpirun command. All ranks assigned to a node will share
                        the same container instance on that node with the orted.
    -container orted  : Use $MPIRUN_CONTAINER_CMD to launch orteds within
                        individual container instances. This is like 'all' mode
                        except mpirun does -not- relaunch iteself within a
                        container. The user is responsible for establishing the
                        container instance then launching mpirun from within
                        that container instance. No container assistant script
                        is used in this mode. As such the 'assist' and 'root'
                        options, and SMPI_CONTAINERENV_ prefixed environment
                        variables have no impact in this mode.
    -container root:<dir> : By default the container is assumed to set its
                        own MPI_ROOT environment variable inside the container.
                        If this is not the case or if a different value for
                        MPI_ROOT is needed then this option can be used to
                        specify that MPI_ROOT value inside the container.
    -container assist:<path> : Set the full path to the container assistant
                        script that is valid inside the container at runtime.
                        Default: $MPI_ROOT/container/bin/incontainer.pl
    -container <option>,<option>,.. : Comma separated list of the above options
    env MPIRUN_CONTAINER_OPTIONS=<options> : Same as -container <options>
    env MPIRUN_CONTAINER_CMD=<cmd>  : Specify the container runtime command to
                        launch a container instance. This can be any executable
                        including a script for ease of use.
                        Example: "singularity exec myapp.sif"
    env SMPI_CONTAINERENV_*         : Pass an environment variable from outside
                        of the container to inside the container by prefixing
                        the variable with this string.
    env SMPI_CONTAINERENV_PREPEND_* : Prepend values to the beginning of an
                        environment variable inside of the container by
                        prefixing the variable name with this string.
    env SMPI_CONTAINERENV_APPEND_*  : Append values to the end of an environment
                        variable inside of the container by prefixing the
                        variable name with this string.

  [Interconnect selection]
    -PAMI / -pami  : use IBM PAMI via the pami PML (default)
    -UCX / -ucx    : use UCX (Tech Preview) via the ucx PML
    -MXM / -mxm    : use Mellanox MXM via the yalla PML
    -TCP / -tcp    : use TCP/IP via the PML ob1 and the BTL tcp
    -IBV / -ibv    : use OpenFabrics infiniband via the
                     PML ob1 and the BTL openib
                       aliases: -ib / -openib

    In all of the above the capital option equates to forcing the specified
    PML / MTL / BTL, and the lower case option only equates to specifying a
    higher priority for the selected interconnect.

  [Additional PAMI options]
    -verbsbypass <ver> : use PAMI's verbs bypass.  <ver> reflects
                         Mellanox OFED version installed:
                            (ver=auto, off and x.y* (* installed MOFED version in the cluster ))
                         auto find out the installed compatible MOFED version on the mpirun node
                         (auto assumes complete cluster installed with same MOFED level)
    -pami_noib         : use PAMI on a single node with no Infiniband card. [ppc64 only]
    -async             : use PAMI Asyncronous progress thread [ selected pml must be pami ] 
    -hwtm              : use IB hardware tag matching [ selected pml must be pami ]

  [On-host communication method]
    -intra nic     : use the off-host BTL for on-host traffic as well
    -intra vader   : use BTL=vader (shared memory) for on-host traffic
                     (only applies if the PML is already ob1)
    -intra shm     : equivalent to -intra=vader

  [Display interconnect]
    -prot          : display a table of what interconnect type each host uses
                     (first rank on each host connects to all peer hosts to
                     establish connections that might otherwise be on-demand)
    -protlazy      : less aggressive version of -prot that runs at finalize
                     and without establishing connections, so many peers
                     might be unconnected.

  [Stdio options]
    -stdio p                       : prefix each rank's output with [job,rank]
    -stdio t                       : add timestamp to output
    -stdio i[+|all|-|none|<rank>]  : send stdin to all ranks (+), no ranks (-)
                                     or a single specific rank
    -stdio file:prefix             : send output to files named <prefix>.<rank>
    -stdio <option>,<option>,..    : comma separated list of the above options

  [IP network selection]
    -netaddr <spec>,<spec>,..         :  specify what network(s) to use for
                                         IP traffic. This option applies
                                         to both control messages, and the
                                         regular MPI rank traffic
    -netaddr <type>:<spec>,<spec>,..  :  individually specify the networks
                                         for different types of traffic
    <type> can be any of
        rank       :  specify network for regular MPI rank-to-rank traffic
        control    :  specify network for control messages, eg launching
        mpirun     :  synonym for "control"
    <spec> can be either
        interface name  :  eg eth0 or ib0 etc
        CIDR notation   :  eg 10.10.1.0/24

  [libnl / libnl3 collision avoidance]
    -restrict_libs nl / libnl /         : only load libraries compatible with
                   ^nl3 / ^libnl3         libnl (skip libnl3)
    -restrict_libs nl3 / libnl3 /       : only load libraries compatible with
                   ^nl / ^libnl           libnl3 (skip libnl)
    -restrict_libs consistent           : reject inconsistency vs current state
    -restrict_libs none                 : no restrictions
    -restrict_libs default              : "none"
    -restrict_libs v / vv / vvv         : print a message when rejecting an MCA
    -restrict_libs <option>,<option>,.. : comma separated list of the above
                                          options
    The levels of verboseness are
      v : print a message when rejecting a library due to a libnl conflict
      vv : for each library that uses libnl/nl3 print what it was detected as
      vvv : for every library print what it was detected as

  [Affinity options]
    -aff on                 : turn on affinity with default option (bandwidth)
    -aff off                : turn off affinity (unbind)
    -aff v / -aff vv        : verbose
    -aff bandwidth          : interleave sockets, use natural hardware order
    -aff latency            : pack ranks across the natural hardware order
    -aff cycle:<unit>       : interleave binding over the specified element
    -aff width:<unit>       : bind each rank to an element of this size
         <unit> can be hwthread, core, socket, numa, or board.

    -aff default               : same as "bandwidth" above
    -aff auto[matic]           : same as "bandwidth" above
    -aff none                  : same as "off" above
    -aff <option>,<option>,..  : comma separated list of the above options

  [GPU support]
    -gpu           : Enable GPU awareness in PAMI.
    -disable_gdr   : Disable GPU Direct RDMA support for Power8 systems

  [Dynamic MPI Profiling interface]
    -entry <lib>,.. : list of PMPI wrapping libraries.  Each <lib> can be
                      of the form libfoo.so, /path/to/libfoo.so, or just
                      foo, which will be automatically expanded into
                      libfoo.so for simple strings consisting only of the
                      characters [a-zA-Z0-9_-]  (expansion not applicable
                      for "fort", "fortran", "v", and "vv")
    -entry fort     : included in a list of <lib> above, this indicates what
                      layer to install the base MPI product's fortran
                      calls which minimally wrap the C calls (by default
                      this is put at the top)
    -entry fortran  : same as fort
    -entrybase <lib>,.. : optionally specify what library(s) to get the
                      bottom level MPI calls from, by default RTLD_NEXT
                      which would be the libmpi the executable is linked
                      against.
    -baseentry      : same as -entrybase
    -entry v        : verbose (show the layering of the MPI entrypoints)
    -entry vv       : more verbose - the difference is 'v' shows what levels
                      are intended to be used, 'vv' happens further inside
                      the library and confirms what libraries are being opened.
                      'vv' output is less readable, but more visibly confirms
                      interception is taking place.
    -entry mpe      : turn on the pre-built MPE logging library (version
                      mpe2-2.4.9b) from Argonne National Laboratory. The
                      output .clog2 file is viewable with jumpshot.

  [Manual spin to wait for debugger attachment]
    -dbgspin early                 : uses LD_PRELOAD to put processes to sleep
                                     very early, before main(). The process
                                     being put to sleep isn't necessarily the
                                     MPI rank who calls MPI_Init though. For
                                     example in "mpirun -np 2 env A=B app.x"
                                     the first process started as a "rank" is
                                     "env" and that would be the process put
                                     to sleep
    -dbgspin rank                  : puts ranks to sleep at the bottom of
                                     MPI_Init, this way only true MPI rank
                                     processes are put to sleep
    -dbgspin barrier or nobarrier  : in 'rank' mode when selected ranks sleep
                                     at the end of MPI_Init, the other ranks
                                     can wait in a barrier (the default) or not
                                     with the option 'nobarrier'
    -dbgspin #  or  #-#            : specifies which ranks to sleep. Multiple
                                     ranks and ranges can be specified, eg
                                     -dbgspin 0,20-25,32
                                     Defaults:
                                     'early' mode: all ranks are slept
                                     'rank' mode: only rank 0 is slept
    -dbgspin <option>,<option>,... : comma separated list of the above options

    The ranks are slept until a debugger is attached and a global spin
    variable is set to 0: "set dbgspin=0".

  [Spectrum MPI specific environment variables]
    Spectrum MPI supports both PREPEND and POSTPEND versions of PATH,
       LD_LIBRARY_PATH, and LD_PRELOAD environment variables.  The value
       passed will be propagated and applied on the compute node.
    
    OMPI_PATH_PREPEND             : Prepend value to PATH
    OMPI_PATH_POSTPEND            : Postpend value to PATH
    OMPI_LD_LIBRARY_PATH_PREPEND  : Prepend value to LD_LIBRARY_PATH
    OMPI_LD_LIBRARY_PATH_POSTPEND : Postpend value to LD_LIBRARY_PATH
    OMPI_LD_PRELOAD_PREPEND       : Prepend value to LD_PRELOAD
    OMPI_LD_PRELOAD_POSTPEND      : Postpend value to LD_PRELOAD

  [Help options]
    -show          : display the resulting modified mpirun command line
                     as well as run the resulting mpirun command
    -showonly      : like -show but only displays the result,
                     doesn't run the resulting mpirun command
    -onlyshow      : same as -showonly
    -show_as <syntax>    : specifies what syntax to output environment settings
                           where <syntax> can be
        sh               : write env settings for the sh shell (the default)
        csh              : write env settings for the csh shell
        keyval           : write env settings in a simple VAR VALUE syntax
    -write_env <file>        : The -write_env* options are similar to -showonly
    -write_env_sh <file>     : with a corresponding -show_as <syntax> option,
    -write_env_csh <file>    : but the generated environment is written to
    -write_env_keyval <file> : <file> instead of to stdout.
    -generate_env <file>         : The -generate_env* options are all
    -generate_env_sh <file>      : equivalent to the corresponding
    -generate_env_csh <file>     : -write_env* options. Note that the base
    -generate_env_keyval <file>  : -write_env and base -generate_env use
                                   the default <syntax> of sh
    -help          : display this message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment