MPI Support

De e-Ciencia

Tabla de contenidos

Site configuration for MPI

This document is intended to help EGEE site administrators to properly support MPI deployments. The recomendations for supporting MPI on EGEE where drafted by the Technical Coordination Group in MPI and can be found on on this link. Currently, only manual installation and configuration is described, so there are no YAIM modules nor instructions for this tool here.


Installation

MPI

It is up to you to decide wich version of MPI you'll support. At IFCA we are supporting Open-MPI 1.2.5, compiled with Intel C/Fortran Compiler. You can build your own package by downloading the source RPM from the Open MPI downloads page. If you whish to use any special interconnect, like IB, ensure to enable it when compiling by passing the --with-openib option to the configure script.Some more details about the configuration of Infiniband Switches for glite clusters can be found here

Also, it is recommended to enable the Torque TM subsystem (thus getting proper accounting numbers) by passing the --enable-mca-dso=pls-tm option. Furthermore, if you wish to use the Intel Compiler, you must define the CC, CXX, FC and F77 variables to point to the correct compilers.

# export F77=ifort
# export FC=ifort
# export CC=icc
# export CXX=icpc
  
# rpmbuild -ba openmpi-1.2.5.spec -D 'name openmpi' -D '_packager aloga' -D 'configure_options --with-openib --enable-mca-dso=pls-tm'

Once compiled, you can deploy this package to all of your WNs.

MPI-Start

MPI-Start is a recommended solution to hide the implementation details for the submitted jobs. It was developed inside the Int.EU.Grid project and can be found here. It should be installed on every node involved with MPI.

Configuration

Batch system

PBS-based schedulers such as Torque do not deal properly with CPU allocations, because they assume homogeneous systems with the same number of CPUs for all the nodes (machines). $cpu_per_node can be set in the jobmanager, but it has to be the same for all the machines. Furthermore, PBS doesn't seem to understand that there might be processes running in 1 CPU of each machine of 2 CPUs in a farm, so there are still half the capacity free for more jobs. For these reasons, it is needed to add some special configuration to the batch system.

Torque

Edit your configuration file (usually under /var/spool/pbs/torque.cfg and add a line containing:

 SUBMITFILTER /var/spool/pbs/submit_filter.pl 

Then download the submit_filter.pl from here and put it in the above location.

This filter modifies the script coming from the submission, rewriting the -l nodes=XX option with specific requests, based on the information given by pbsnodes -a command.

The submit filter is crucial. Failing to use the submit filter translates in the job being submitted to only one node, where all the MPI processes are allocated too, instead of distributing the job across several nodes.


Warning: glite updates tend to rewrite torque.cfg. Check that the submit filter line is still there after performing an update

Maui

Edit your configuration file (usually under /var/spool/maui/maui.cfg) and check that it contains the following lines:

 ENABLEMULTINODEJOBS TRUE
 ENABLEMULTIREQJOBS TRUE

These parameter allows a job to span to more than one node and to specify multiple independent resource requests.

Worker nodes

You should consider using either a shared storage area (i.e. $HOME or a scratch dir) in all the nodes or set up a passwordless SSH access (i.e. hostbased access) between them. Each one has its pros and cons, so its up to you which one to choose.

Information System

You should publish some values to let the world know which flavour of MPI you are supporting, as well as the interconnect and some other things. Everything related with MPI should be published as GlueHostApplicationSoftwareRunTimeEnvironment<code> in the corresponding sections.

MPI-START

If you support MPI-START publish it with:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
MPI Flavour

Publish which flavour, version of MPI you are using. Eventually you can also specify the compiler. MPI flavours are MPICH, MPICH2, LAM and OPENMPI. For example:

GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI 
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.5
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.5-ICC
MPI Interconnects

If you have any special Interconnect (like Infiniband, or Myrinet) you can publish it like:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-Infiniband 
MPI (other)

If you have a shared filesystem area between WNs, you can publish with:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME

Job Submission

In order to invoke MPI-START you need a wrapper script that sets the environment variables that define your job. This script is generic and should not need to have significant modifications made to it.


#!/bin/bash

# Pull in the arguments.
MY_EXECUTABLE=`pwd`/$1
MPI_FLAVOR=$2

# Convert flavor to lowercase for passing to mpi-start.
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`

# Pull out the correct paths for the requested flavor.
eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH`

# Ensure the prefix is correctly set.  Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX

# Touch the executable.  It must exist for the shared file system check.
# If it does not, then mpi-start may try to distribute the executable
# when it shouldn't.
touch $MY_EXECUTABLE
chmod +x $MY_EXECUTABLE

# Setup for mpi-start.
export I2G_MPI_APPLICATION=$MY_EXECUTABLE
export I2G_MPI_APPLICATION_ARGS=
export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER
# optional hooks
#export I2G_MPI_PRE_RUN_HOOK=mpi-hooks.sh
#export I2G_MPI_POST_RUN_HOOK=mpi-hooks.sh

# If these are set then you will get more debugging information.
#export I2G_MPI_START_VERBOSE=1
#export I2G_MPI_START_DEBUG=1

# Invoke mpi-start.
$I2G_MPI_START

In your JDL file you should set the jobtype as <code>Normal and then set the NodeNumber to the number of desired nodes. The Executable should be your wrapper script for MPI-START (mpi-start-wrapper.sh in this case) and the Arguments are your MPI binary and the MPI flavour that it uses. MPI-START allows user defined extensions via hooks, check the MPI-START Hook CookBook for examples. Here is an example JDL for the submission of the cpi application using 10 processes:


JobType = "Normal";
VirtualOrganisation = "dteam";
NodeNumber = 10;
Executable = "mpi-start-wrapper.sh";
Arguments = "cpi OPENMPI";
StdOutput = "cpi.out";
StdError = "cpi.err";
InputSandbox = {"cpi", "mpi-start-wrapper.sh"};
OutputSandbox = {"cpi.out", "cpi.err"};
Requirements = Member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment)
            && Member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment)
            && Member("OPENMPI-1.2.5",  other.GlueHostApplicationSoftwareRunTimeEnvironment);


Please note that the NodeNumber variable refers to the number of CPUs you are requiring. The new EGEE MPI WG is discussing how to implement a fine-grained selection of the nodes and/or CPUs (i.e. to specify the number of processors per node and the number of nodes, not only the number of CPUs).

Known issues

MPI job support is a necessity for many application areas, however the configuration is highly dependent of the cluster architecture. The design of mpi-start was focussed in making the MPI job submission as transparent as possible from the cluster details.

However there are still issues to be adressed, please feel free to send an e-mail to complete this list:

Selection of the proper combination core/CPU

Fine-grained selection of the nodes and/or cores per CPU. This affects the efficiency of the code as the scaling properties of MPI codes are highly depending on it.

Identification of the interconnect technology in the Information System

Need for an univocal name to identify via the information system the available cluster interconnects to be used in GlueHostApplicationSoftwareRunTimeEnvironment.

Resource Reservation

Reservation of resources for MPI jobs. MPI jobs should not be sharing the same node with a serial job. In the case of those MPI applications which are using intensively the interconexion, sharing the node with a serial or with another MPI job, is not an option.

Herramientas personales
Grid Administration
Users Support