- See available modules:
module spider
- See available versions of a module:
module spider <name>
- Load specified version of the module:
module load <name>/<version>
Scheduled jobs will run on one of the following partitions:
Partition | GPU Type | GPUs/node | CPUs/node | RAM/node | Preempted |
---|---|---|---|---|---|
t4 |
T4 (16GB VRAM) | 8 | 32 | 152GB | Yes |
rtx6000 |
RTX6000 (24GB) | 4 | 40 | 172GB | Yes |
a40 |
A40 (48GB VRAM) | 4 | 32 | 172GB | Yes |
cpu |
- | - | 64 | 300GB | No |
For the full list, run sinfo
.
Running squeue [-u <username>]
will print out running / scheduled jobs.
sbatch
can be used to submit a job to the cluster. The following is a basic job that will run main.py
with a GPU:
#!/bin/bash
#SBATCH --job-name=JobName # name of the job
#SBATCH --cpus-per-task=16 # numper of cpu cores
#SBATCH --mem=16G # minimum amount of memory
#SBATCH --gres=gpu:1 # generic resources
#SBATCH --time 8:00:00 # time limit
#SBATCH --mail-type=ALL # notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --output=slurm-%j.log # where logs will be directed
#SBATCH --open-mode=append # to avoid rewrites when preempted
python main.py
The directives can also be passed to sbatch
as flags.
GPU requirements are expressed through the --gres
flag. E.g.:
--gres=gpu:2
will ask for two GPUs of any type.--gres=gpu:a40:1
will ask for an A40 GPU.
You can distribute several individual runs or tasks among several nodes. They might, however, not run in parallel due to resource limits. The following is an example which runs 4 jobs on each node:
#SBATCH --nodes=2 # number of nodes to request
#SBATCH --ntasks=8 # total number of parallel runs
srun
, which accepts all the same flags as sbatch
, can be used to initiate an interactive session:
srun [OPTIONS] --pty /bin/bash
- Home Directory: 50GB
- Scratch Directory: 100GB
Submitted jobs will fall under one of the following QoS levels (for most users), which dictate how long they can run:
- Normal (1 or 2 GPUs): 16 Hours
- M (3 or 4 GPUs): 12 Hours
- M2 (5-8 GPUs): 8 Hours
- M3 (9-16 GPUs): 4 Hours
- Only one GPU per session.
- GPU sessions can only run up to 8 hours.
Job scheduler may preempt a running task due to job priorities. If it is, it will be automatically re-queued to run until it hits the time limit. Proper checkpointing is crucial for the process to resume.
The scheduler will mount a directory under /checkpoint/${USER}/${SLURM_JOB_ID}
with 50GB of storage for checkpointing. The directory will be purged in 48 hours (or 7 days! The docs are contradictory!) We can create a symlink to simplify the path:
ln -sfn /checkpoint/${USER}/${SLURM_JOB_ID} $PWD/checkpoint
The program can then probe the directory and resume from the latest checkpoint it can find.
By adding the following directive to the job:
#SBATCH --signal=B:USR1@60
the process will receive the SIGUSR1
signal 60 seconds before it is suspended. We can then re-queue the job when the signal is received by adding the following routine to the job:
resubmit() {
echo "Resubmitting job"
scontrol requeue $SLURM_JOB_ID
exit 0
}
trap resubmit SIGUSR1