Managing jobs

The lifecycle of a job can be managed with as little as three different commands:

  1. Submit the job with sbatch <script_name>.
  2. Check the job status with squeue. (to limit the display to only your jobs use squeue -u <user_name>.)
  3. (optional) Delete the job with scancel <job_id>.

You can also hold the start of a job:

scontrol hold <job_id>
Put a hold on the job. A job on hold will not start or block other jobs from starting until you release the hold.
scontrol release <job_id>
Release the hold on a job.

Job status descriptions in squeue

When you run squeue (probably limiting the output with squeue -u <user_name>), you will get a list of all jobs currently running or waiting to start. Most of the columns should be self-explaining, but the ST and NODELIST (REASON) columns can be confusing.

ST stands for state. The most important states are listed below. For a more comprehensive list, check the squeue help page section Job State Codes.

The job is running
The job is pending (i.e. waiting to run)
The job is completing, meaning that it will be finished soon

The column NODELIST (REASON) will show you a list of computing nodes the job is running on if the job is actually running. If the job is pending, the column will give you a reason why it still pending. The most important reasons are listed below. For a more comprehensive list, check the squeue help page section Job Reason Codes.

There is another pending job with higher priority
The job has the highest priority, but is waiting for some running job to finish.
This should only happen if you run your job with --qos=devel. In developer mode you may only have one single job in the queue.
launch failed requeued held
Job launch failed for some reason. This is normally due to a faulty node. Please contact us via stating the problem, your user name, and the jobid(s).
Job cannot start before some other job is finished. This should only happen if you started the job with --dependency=...
Same as Dependency, but that other job failed. You must cancel the job with scancel JOBID.