MPI Support
De e-Ciencia
Tabla de contenidos |
Site configuration for MPI
This document is intended to help EGEE site administrators to properly support MPI deployments. The recomendations for supporting MPI on EGEE where drafted by the Technical Coordination Group in MPI and can be found on on this link. Currently, only manual installation and configuration is described, so there are no YAIM modules nor instructions for this tool here.
Installation
MPI
It is up to you to decide wich version of MPI you'll support. At IFCA we are
supporting Open-MPI 1.2.5, compiled with Intel C/Fortran Compiler. You can
build your own package by downloading the source RPM from the
Open MPI downloads page. If
you whish to use any special interconnect, like IB, ensure to enable it when
compiling by passing the --with-openib option to the configure
script.Some more details about the configuration of Infiniband Switches for glite
clusters can be found here
Also, it is recommended to enable the Torque TM subsystem (thus
getting proper accounting numbers) by passing the
--enable-mca-dso=pls-tm option. Furthermore, if you wish to use
the Intel Compiler, you must define the CC,
CXX, FC and F77 variables to point to the
correct compilers.
# export F77=ifort # export FC=ifort # export CC=icc # export CXX=icpc # rpmbuild -ba openmpi-1.2.5.spec -D 'name openmpi' -D '_packager aloga' -D 'configure_options --with-openib --enable-mca-dso=pls-tm'
Once compiled, you can deploy this package to all of your WNs.
MPI-Start
MPI-Start is a recommended solution to hide the implementation details for the submitted jobs. It was developed inside the Int.EU.Grid project and can be found here. It should be installed on every node involved with MPI.
Configuration
Batch system
PBS-based schedulers such as Torque do not deal properly with CPU allocations,
because they assume homogeneous systems with the same number of CPUs for all
the nodes (machines). $cpu_per_node can be set in the jobmanager,
but it has to be the same for all the machines. Furthermore, PBS doesn't seem
to understand that there might be processes running in 1 CPU of each machine
of 2 CPUs in a farm, so there are still half the capacity free for more jobs.
For these reasons, it is needed to add some special configuration to the batch
system.
Torque
Edit your configuration file (usually under
/var/spool/pbs/torque.cfg and add a line containing:
SUBMITFILTER /var/spool/pbs/submit_filter.pl
Then download the submit_filter.pl from here and put it in the above location.
This filter modifies the script coming from the submission, rewriting the
-l nodes=XX option with specific requests, based on the
information given by pbsnodes -a command.
The submit filter is crucial. Failing to use the submit filter translates in the job being submitted to only one node, where all the MPI processes are allocated too, instead of distributing the job across several nodes.
Warning: glite updates tend to rewrite torque.cfg. Check that the submit filter line is still there after performing an update
Maui
Edit your configuration file (usually under
/var/spool/maui/maui.cfg) and check that it contains the
following lines:
ENABLEMULTINODEJOBS TRUE ENABLEMULTIREQJOBS TRUE
These parameter allows a job to span to more than one node and to specify multiple independent resource requests.
Worker nodes
You should consider using either a shared storage area (i.e. $HOME or a scratch dir) in all the nodes or set up a passwordless SSH access (i.e. hostbased access) between them. Each one has its pros and cons, so its up to you which one to choose.
Information System
You should publish some values to let the world know which flavour of MPI you
are supporting, as well as the interconnect and some other things. Everything
related with MPI should be published as
GlueHostApplicationSoftwareRunTimeEnvironment<code> in the corresponding
sections.
MPI-START
If you support MPI-START publish it with:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
MPI Flavour
Publish which flavour, version of MPI you are using. Eventually you can also specify the compiler. MPI flavours are MPICH, MPICH2, LAM and OPENMPI. For example:
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.5 GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.5-ICC
MPI Interconnects
If you have any special Interconnect (like Infiniband, or Myrinet) you can publish it like:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-Infiniband
MPI (other)
If you have a shared filesystem area between WNs, you can publish with:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME
Job Submission
In order to invoke MPI-START you need a wrapper script that sets the environment variables that define your job. This script is generic and should not need to have significant modifications made to it.
#!/bin/bash
# Pull in the arguments.
MY_EXECUTABLE=`pwd`/$1
MPI_FLAVOR=$2
# Convert flavor to lowercase for passing to mpi-start.
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`
# Pull out the correct paths for the requested flavor.
eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH`
# Ensure the prefix is correctly set. Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX
# Touch the executable. It must exist for the shared file system check.
# If it does not, then mpi-start may try to distribute the executable
# when it shouldn't.
touch $MY_EXECUTABLE
chmod +x $MY_EXECUTABLE
# Setup for mpi-start.
export I2G_MPI_APPLICATION=$MY_EXECUTABLE
export I2G_MPI_APPLICATION_ARGS=
export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER
# optional hooks
#export I2G_MPI_PRE_RUN_HOOK=mpi-hooks.sh
#export I2G_MPI_POST_RUN_HOOK=mpi-hooks.sh
# If these are set then you will get more debugging information.
#export I2G_MPI_START_VERBOSE=1
#export I2G_MPI_START_DEBUG=1
# Invoke mpi-start.
$I2G_MPI_START
In your JDL file you should set the jobtype as <code>Normal and then set
the NodeNumber to the number of desired nodes. The Executable should be your wrapper script for MPI-START (mpi-start-wrapper.sh in this case) and the Arguments are your MPI binary and the MPI flavour that it uses. MPI-START allows user defined extensions via hooks, check the MPI-START Hook CookBook for examples. Here is an example JDL for the submission of the cpi application using 10 processes:
JobType = "Normal";
VirtualOrganisation = "dteam";
NodeNumber = 10;
Executable = "mpi-start-wrapper.sh";
Arguments = "cpi OPENMPI";
StdOutput = "cpi.out";
StdError = "cpi.err";
InputSandbox = {"cpi", "mpi-start-wrapper.sh"};
OutputSandbox = {"cpi.out", "cpi.err"};
Requirements = Member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& Member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& Member("OPENMPI-1.2.5", other.GlueHostApplicationSoftwareRunTimeEnvironment);
Please note that the NodeNumber variable refers to the number of
CPUs you are requiring. The new EGEE MPI WG is discussing how to implement a
fine-grained selection of the nodes and/or CPUs (i.e. to specify the number of
processors per node and the number of nodes, not only the number of CPUs).
Known issues
MPI job support is a necessity for many application areas, however the configuration is highly dependent of the cluster architecture. The design of mpi-start was focussed in making the MPI job submission as transparent as possible from the cluster details.
However there are still issues to be adressed, please feel free to send an e-mail to complete this list:
Selection of the proper combination core/CPU
Fine-grained selection of the nodes and/or cores per CPU. This affects the efficiency of the code as the scaling properties of MPI codes are highly depending on it.
Identification of the interconnect technology in the Information System
Need for an univocal name to identify via the information system the available cluster interconnects to be used in GlueHostApplicationSoftwareRunTimeEnvironment.
Resource Reservation
Reservation of resources for MPI jobs. MPI jobs should not be sharing the same node with a serial job. In the case of those MPI applications which are using intensively the interconexion, sharing the node with a serial or with another MPI job, is not an option.
