Users schedule their jobs to run on Aaditya HPC cluster by submitting them through Platform LSF.
Most production computing jobs run in batch queues on the 790-treaflops Aaditya high-performance computing (HPC) system.
Use of login nodes
Users can run short, non-memory-intensive processes interactively on the system's login nodes. These include tasks such as text editing or running small serial scripts or programs.
You can compile models and programs. The number of simultaneously executing compilation process threads may not exceed eight (8). Typically this is controlled by an argument following the "-j" option for your GNU make command.
All tasks that you run on login nodes are run "at risk." If any task or multiple concurrent tasks being run by an individual user consumes excessive resources, the task or tasks will be killed and you will be notified.
Do not run programs or models that consume excessive amounts of CPU time, more than a few GB of memory, or excessive I/O resources. Instead, use the compute nodes within the Aaditya HPC cluster.
Select the most appropriate queue for each job and provide accurate wall-clock times in your job script. This will help us fit your job into the earliest possible run opportunity.
Note the system's usable memory and configure your job script to maximize performance.
- Interactive jobs should not be run on login nodes.
- Login nodes are for compiling, editing and submiting the jobs.
- Use utility nodes for interactive and graphically intensive jobs.
- Access utility nodes as follows:
From any login node type
ssh -X iitmutil01 or iitmutil02 iitmutil03 iitmutil04
See Platform LSF examples for additional sample scripts.
To submit a batch job, use the command bsub with the redirect sign (<) and the name of your batch script file.
bsub < script_name
We recommend passing the options to bsub in a batch script file rather than with numerous individual commands.
Include these options in your script:
-R with "span[ptile=n]" for tasks per node
-n number of tasks
-w dependency_expression (if applicable)
-B (if you want to receive an email when the job starts)
-N (if you want to receive the job report by email when the job finishes)
Use the same name for your output and error files if you want the data stored in a single file rather than separately.
Loading modules in a batch script
Users sometimes need to execute module commands from within a batch job—to load an application such as NCL, for example, or to load or remove other modules.
To ensure that the module commands are available, insert the following in your batch script if you need to include module commands.
In a tcsh script:
In a bash script:
Once that is included, you can add the module purge command if you need to and then load just the modules that are needed to establish the software environment that your job requires.
Batch script for pure MPI job
Here is a batch script example that will use four nodes (16 MPI tasks per node) for six minutes on Aaditya HPC in the regular queue.
# LSF batch script to run an MPI application
#BSUB -P project_code # project code
#BSUB -W 00:06 # wall-clock time (hrs:mins)
#BSUB -n 64 # number of tasks in job
#BSUB -R "span[ptile=16]" # run 16 MPI tasks per node
#BSUB -J myjob # job name
#BSUB -o myjob.%J.out # output file name in which %J is replaced by the job ID
#BSUB -e myjob.%J.err # error file name in which %J is replaced by the job ID
#BSUB -q regular # queue
#run the executable
mpirun.lsf: LSF_PJL_TYPE is undefined. Exit ...
What to do
Use the bsub command as described here.
A common mistake that leads to this error message is trying to execute ./job.lsf rather than bsub < job.lsf. The error results when the mpirun.lsf command in a job script is executed in the absence of environment variables that are provided when you submit a job correctly.
Your job has been rejected.
You must declare a wall-clock time with the bsub -W directive.
If you have specified this directive and your job is still
rejected, verify that you have not exceeded your GLADE quotas
(use "gladequota") and that you are properly redirecting job
file input (e.g., bsub < jobfile).
To take advantage of backfill, the declared wall-clock time
should be less than the maximum wall-clock limit for the queue
to which you are submitting the job.
What to do
Check each factor noted in the message. Simply forgetting to include the < in bsub < jobfile is a common mistake. Review the documentation above regarding how to submit jobs.
Jobs submitted with 32 tasks per node using batch option -R "span[ptile=32]" are killed with this error message:
ERROR: 0031-758 AFFINITY: [ys0116] Oversubscribe: 32 tasks in total, each task requires 1 resource, but
there are only 16 available resource. Affinity cannot be applied
What to do
Submit your job with environment variable MP_TASK_AFFINITY set to cpu as shown here:
export MP_TASK_AFFINITY=cpu (for bash/sh/ksh users)
setenv MP_TASK_AFFINITY cpu (for csh/tcsh users)
To get information about your unfinished jobs, use the command bjobs.
To get information regarding unfinished jobs for a user group, add -u and the group name.
To list all unfinished jobs, use all.
bjobs -u all
You can suppress lines that show each individual node used in large jobs by piping the output through grep as follows.
bjobs -u all | grep -v "^"
For information about your own unfinished jobs in a queue, use -q and the queue name.
bjobs -q queue_name
For a summary of batch jobs that have already run, use bhist. Sample output:
Other useful commands include:
bpeek – Allows you to watch the error and output files of a running batch job. This is particularly useful for monitoring a long-running job; if the job isn't running as you expected, you may want to kill it to preserve computing time and resources.
bkill – Removes a queued job from LSF, or stops and removes a running job. Use it with the Job ID, which you can get from the output of bjobs.
tail – When used with the -f option to monitor a log file, this enables you to view the log as it changes. To use it in the Aaditya HPC environment, also disable inotify as shown in this example to ensure that your screen output gets updated properly.
tail ---disable-inotify -f /path/for/file/filename.log