Skip to content

Instantly share code, notes, and snippets.

@aryan-f
Created September 24, 2024 22:17
Show Gist options
  • Save aryan-f/4d4d0cfde6f489abd3aef3c61a4a1619 to your computer and use it in GitHub Desktop.
Save aryan-f/4d4d0cfde6f489abd3aef3c61a4a1619 to your computer and use it in GitHub Desktop.
Vector Cluster Quick Guide

Vector Cluster

Modules

  • See available modules: module spider
  • See available versions of a module: module spider <name>
  • Load specified version of the module: module load <name>/<version>

Partitions

Scheduled jobs will run on one of the following partitions:

Partition GPU Type GPUs/node CPUs/node RAM/node Preempted
t4 T4 (16GB VRAM) 8 32 152GB Yes
rtx6000 RTX6000 (24GB) 4 40 172GB Yes
a40 A40 (48GB VRAM) 4 32 172GB Yes
cpu - - 64 300GB No

For the full list, run sinfo.

Jobs

Queued Jobs

Running squeue [-u <username>] will print out running / scheduled jobs.

Scheduling Jobs

sbatch can be used to submit a job to the cluster. The following is a basic job that will run main.py with a GPU:

#!/bin/bash

#SBATCH --job-name=JobName    # name of the job
#SBATCH --cpus-per-task=16    # numper of cpu cores
#SBATCH --mem=16G             # minimum amount of memory
#SBATCH --gres=gpu:1          # generic resources
#SBATCH --time 8:00:00        # time limit
#SBATCH --mail-type=ALL       # notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --output=slurm-%j.log # where logs will be directed
#SBATCH --open-mode=append    # to avoid rewrites when preempted

python main.py 

The directives can also be passed to sbatch as flags.

GPUs

GPU requirements are expressed through the --gres flag. E.g.:

  • --gres=gpu:2 will ask for two GPUs of any type.
  • --gres=gpu:a40:1 will ask for an A40 GPU.

Parallel Runs

You can distribute several individual runs or tasks among several nodes. They might, however, not run in parallel due to resource limits. The following is an example which runs 4 jobs on each node:

#SBATCH --nodes=2       # number of nodes to request
#SBATCH --ntasks=8      # total number of parallel runs

Interactive Sessions

srun, which accepts all the same flags as sbatch, can be used to initiate an interactive session:

srun [OPTIONS] --pty /bin/bash

Limits

Storage

  • Home Directory: 50GB
  • Scratch Directory: 100GB

GPUs

Submitted jobs will fall under one of the following QoS levels (for most users), which dictate how long they can run:

  • Normal (1 or 2 GPUs): 16 Hours
  • M (3 or 4 GPUs): 12 Hours
  • M2 (5-8 GPUs): 8 Hours
  • M3 (9-16 GPUs): 4 Hours

Interactive Sessions

  • Only one GPU per session.
  • GPU sessions can only run up to 8 hours.

Preemption

Job scheduler may preempt a running task due to job priorities. If it is, it will be automatically re-queued to run until it hits the time limit. Proper checkpointing is crucial for the process to resume.

The scheduler will mount a directory under /checkpoint/${USER}/${SLURM_JOB_ID} with 50GB of storage for checkpointing. The directory will be purged in 48 hours (or 7 days! The docs are contradictory!) We can create a symlink to simplify the path:

ln -sfn /checkpoint/${USER}/${SLURM_JOB_ID} $PWD/checkpoint

The program can then probe the directory and resume from the latest checkpoint it can find.

Automatic Re-queuing on Time Limit

By adding the following directive to the job:

#SBATCH --signal=B:USR1@60

the process will receive the SIGUSR1 signal 60 seconds before it is suspended. We can then re-queue the job when the signal is received by adding the following routine to the job:

resubmit() {
    echo "Resubmitting job"
    scontrol requeue $SLURM_JOB_ID
    exit 0
}
trap resubmit SIGUSR1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment