aryan-f/VECTOR.md

## VECTOR.md

      
    Raw
  

              VECTOR.md
            
          
    Vector Cluster

Modules


See available modules: module spider
See available versions of a module: module spider <name>
Load specified version of the module: module load <name>/<version>

Partitions

Scheduled jobs will run on one of the following partitions:


Partition
GPU Type
GPUs/node
CPUs/node
RAM/node
Preempted


t4
T4 (16GB VRAM)
8
32
152GB
Yes


rtx6000
RTX6000 (24GB)
4
40
172GB
Yes


a40
A40 (48GB VRAM)
4
32
172GB
Yes


cpu
-
-
64
300GB
No


For the full list, run sinfo.
Jobs

Queued Jobs

Running squeue [-u <username>] will print out running / scheduled jobs.
Scheduling Jobs

sbatch can be used to submit a job to the cluster. The following is a basic job that will run main.py with a GPU:
#!/bin/bash

#SBATCH --job-name=JobName    # name of the job
#SBATCH --cpus-per-task=16    # numper of cpu cores
#SBATCH --mem=16G             # minimum amount of memory
#SBATCH --gres=gpu:1          # generic resources
#SBATCH --time 8:00:00        # time limit
#SBATCH --mail-type=ALL       # notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --output=slurm-%j.log # where logs will be directed
#SBATCH --open-mode=append    # to avoid rewrites when preempted

python main.py 
The directives can also be passed to sbatch as flags.
GPUs

GPU requirements are expressed through the --gres flag. E.g.:

--gres=gpu:2 will ask for two GPUs of any type.
--gres=gpu:a40:1 will ask for an A40 GPU.

Parallel Runs

You can distribute several individual runs or tasks among several nodes. They might, however, not run in parallel due to resource limits. The following is an example which runs 4 jobs on each node:
#SBATCH --nodes=2       # number of nodes to request
#SBATCH --ntasks=8      # total number of parallel runs
Interactive Sessions

srun, which accepts all the same flags as sbatch, can be used to initiate an interactive session:
srun [OPTIONS] --pty /bin/bash
Limits

Storage


Home Directory: 50GB
Scratch Directory: 100GB

GPUs

Submitted jobs will fall under one of the following QoS levels (for most users), which dictate how long they can run:

Normal (1 or 2 GPUs): 16 Hours
M (3 or 4 GPUs): 12 Hours
M2 (5-8 GPUs): 8 Hours
M3 (9-16 GPUs): 4 Hours

Interactive Sessions


Only one GPU per session.
GPU sessions can only run up to 8 hours.

Preemption

Job scheduler may preempt a running task due to job priorities. If it is, it will be automatically re-queued to run until it hits the time limit. Proper checkpointing is crucial for the process to resume.
The scheduler will mount a directory under /checkpoint/${USER}/${SLURM_JOB_ID} with 50GB of storage for checkpointing. The directory will be purged in 48 hours (or 7 days! The docs are contradictory!) We can create a symlink to simplify the path:
ln -sfn /checkpoint/${USER}/${SLURM_JOB_ID} $PWD/checkpoint
The program can then probe the directory and resume from the latest checkpoint it can find.
Automatic Re-queuing on Time Limit

By adding the following directive to the job:
#SBATCH --signal=B:USR1@60
the process will receive the SIGUSR1 signal 60 seconds before it is suspended. We can then re-queue the job when the signal is received by adding the following routine to the job:
resubmit() {
    echo "Resubmitting job"
    scontrol requeue $SLURM_JOB_ID
    exit 0
}
trap resubmit SIGUSR1
Partition	GPU Type	GPUs/node	CPUs/node	RAM/node	Preempted
`t4`	T4 (16GB VRAM)	8	32	152GB	Yes
`rtx6000`	RTX6000 (24GB)	4	40	172GB	Yes
`a40`	A40 (48GB VRAM)	4	32	172GB	Yes
`cpu`	-	-	64	300GB	No