FAQ
How to Select the Node and Partition for My Job
In your script, you must specify the following parameters:
#!/bin/bash
#SBATCH --nodelist=diufrd204 # Specify the node to use
#SBATCH --partition=GPU # Choose the partition (CPU or GPU)
#SBATCH --gres=gpu:1 # Specify the number of GPUs required
Currently, we have two options for partitions: CPU or GPU.
How Can I Accelerate the Execution of My Job?
If your job processes a large amount of data, performance may be affected by continuous data transfers between the master and compute nodes. In particular, if your job frequently reads small chunks of data, it can significantly slow down execution.
To improve performance, you can copy your data from /HOME/<login>/ to the local /tmp/ directory on the compute node, and then process the data directly from /tmp/. This reduces data transfer overhead and can accelerate execution.
Important: Don’t forget to delete your data from /tmp/ after the job completes to avoid unnecessarily occupying space on the compute node.
Can I run my job directly on a compute node such as diufrd204?
Yes, but only for debugging purposes.
To fine-tune your jobs and scripts, you may run short tests directly on the compute node via SSH. However, in general, all scripts should be executed through the SLURM queue management system using sbatch or srun
Several of my jobs were cancelled unexpectedly after running for over 10 hours? the message "CANCELLED+ by root with ExitCode 0:0" in the output file.
My guess is that the Slurm master terminated these jobs because they were requesting more CPU or memory than specified. The computing node has limited amount of CPUs (threads) and memory, and it seems that at some point your jobs are requesting more resources than it is specified in the parameters of your job.
You can SSH into the node to monitor your script in real time (please set up your SSH key if you haven’t already). I recommend adjusting your job submission parameters accordingly.