This section explains how SALOME is used with batch managers used in the operation of clusters. The objective is to run a SALOME application with a command script on a cluster starting from a SALOME session running on the user’s personal machine. The script contains the task that the user wants SALOME to execute. The most usual task is to start a YACS scheme.
The start principle is as follows: starting from a first SALOME session, a SALOME application is started on a cluster using the batch manager. Therefore there are two SALOME installations: one on the user’s machine and the other on the cluster. The user must have an account on the cluster, and must have a read/write access to it. He must also have correctly configured the connection protocol from his own station, regardless of whether he uses rsh or ssh.
The remainder of this chapter describes the different run steps. Firstly, a SALOME application is run on the user’s machine using a CatalogResources.xml file containing the description of the target batch machine (see Description of the cluster using the CatalogResource.xml file). The user then calls the SALOME service to run it on the batch machine. The user does this by describing input and output files for the SALOME application running in batch and for the Python script to be run (see Using the Launcher service). This service then starts the SALOME application defined in the CatalogResources.xml file on the batch machine and executes the Python command file (see SALOME on the batch cluster).
The CatalogResources.xml file contains the description of the different distributed calculation resources (machines) that SALOME can use to launch its containers. It can also contain the description of clusters administered by batch managers.
The following is an example of description of a cluster:
<machine hostname="clusteur1"
alias="frontal.com"
protocol="ssh"
userName="user"
mode="batch"
batch="lsf"
mpi="prun"
appliPath="/home/user/applis/batch_exemples"
batchQueue="mpi1G_5mn_4p"
userCommands="ulimit -s 8192"
preReqFilePath="/home/ribes/SALOME5/env-prerequis.sh"
OS="LINUX"
CPUFreqMHz="2800"
memInMB="4096"
nbOfNodes="101"
nbOfProcPerNode="2"/>
The Launcher service is a CORBA server started by the SALOME kernel. Its interface is described in the SALOME_ContainerManager.idl file of the kernel.
Its interface is as follows:
interface SalomeLauncher
{
long submitJob( in string xmlExecuteFile,
in string clusterName ) raises (SALOME::SALOME_Exception);
long submitSalomeJob( in string fileToExecute,
in FilesList filesToExport,
in FilesList filesToImport,
in BatchParameters batch_params,
in MachineParameters params ) raises (SALOME::SALOME_Exception);
string queryJob ( in long jobId, in MachineParameters params ) raises (SALOME::SALOME_Exception);
void deleteJob( in long jobId, in MachineParameters params ) raises (SALOME::SALOME_Exception);
void getResultsJob( in string directory, in long jobId, in MachineParameters params )
raises (SALOME::SALOME_Exception);
boolean testBatch(in MachineParameters params) raises (SALOME::SALOME_Exception);
void Shutdown();
long getPID();
};
The submitSalome.job method launches a SALOME application on a batch manager. This method returns a job identifier that is used in the query.Job, delete.Job and getResults.Job methods.
The following is an example using this method:
# Initialisation
import os
import Engines
import orbmodule
import SALOME
clt = orbmodule.client()
cm = clt.Resolve('SalomeLauncher')
# The python script that will be launched on the cluster
script = '/home/user/Dev/Install/BATCH_EXEMPLES_INSTALL/tests/test_Ex_Basic.py'
# Preparation of arguments for submitSalomeJob
filesToExport = []
filesToImport = ['/home/user/applis/batch_exemples/filename']
batch_params = Engines.BatchParameters('', '00:05:00', '', 4)
params = Engines.MachineParameters('','clusteur1','','','','',[],'',0,0,1,1,0,'prun','lsf','','',4)
# Using submitSalomeJob
jobId = cm.submitSalomeJob(script, filesToExport, filesToImport, batch_params, params)
The following is a description of the different arguments of submitSalomeJob:
The queryJob method should be used to determine the state of the Job. There are three possible states, namely waiting, running and terminated. The following is an example of how this method is used:
status = cm.queryJob(jobId, params)
print jobId,' ',status
while(status != 'DONE'):
os.system('sleep 10')
status = cm.queryJob(jobId, params)
print jobId,' ',status
The job identifier supplied by the submitSalomeJob method is used in this method together with the params structure.
Finally, the getResultsJob method must be used to retrieve application results. The following is an example of how to use this method:
cm.getResultsJob('/home/user/Results', jobId, params)
The first argument contains the directory in which the user wants to retrieve the results. The user automatically receives logs from the SALOME application and the different containers that have been started, in addition to those defined in the filesToImport list.
SALOME does not provide a service for automatic installation of the platform from the user’s personal machine, for the moment. Therefore, SALOME (KERNEL + modules) and a SALOME application have to be installed beforehand on the cluster. In the example used in this documentation, the application is installed in the directory /home/user/applis/batch_exemples.
When the submitSalomeJob method is being used, SALOME creates a directory in $HOME/Batch/run_date. The various input files are copied into this directory.
SALOME needs some functions that the batch manager must authorise before SALOME applications can be run.
SALOME runs several processor threads for each CORBA server that is started. Some batch managers can limit the number of threads to a number that is too small, or the batch manager may configure the size of the thread stack so that it is too high. In our example, the user fixes the size of the thread stack in the userCommands field in the CatalogResources.xml file.
SALOME starts processes in the session on machines allocated by the batch manager. Therefore, the batch manager must authorise this. Finally, SALOME is based on the use of dynamic libraries and the dlopen function. The system must allow this.