The lifecycle of a job can be managed with as little as three different commands:
- Submit the job with
- Check the job status with
squeue. (to limit the display to only your jobs use
squeue -u <user_name>.)
- (optional) Delete the job with
You can also hold the start of a job:
- scontrol hold <job_id>
- Put a hold on the job. A job on hold will not start or block other jobs from starting until you release the hold.
- scontrol release <job_id>
- Release the hold on a job.
Job status descriptions in squeue¶
When you run
squeue (probably limiting the output with
squeue -u <user_name>), you will get a list of all jobs currently running or waiting to start. Most of the columns should be self-explaining, but the ST and NODELIST (REASON) columns can be confusing.
ST stands for state. The most important states are listed below. For a more comprehensive list, check the squeue help page section Job State Codes.
- The job is running
- The job is pending (i.e. waiting to run)
- The job is completing, meaning that it will be finished soon
The column NODELIST (REASON) will show you a list of computing nodes the job is running on if the job is actually running. If the job is pending, the column will give you a reason why it still pending. The most important reasons are listed below. For a more comprehensive list, check the squeue help page section Job Reason Codes.
- There is another pending job with higher priority
- The job has the highest priority, but is waiting for some running job to finish.
- This should only happen if you run your job with
--qos=devel. In developer mode you may only have one single job in the queue.
- launch failed requeued held
- Job launch failed for some reason. This is normally due to a faulty node. Please contact us via email@example.com stating the problem, your user name, and the jobid(s).
- Job cannot start before some other job is finished. This should only happen if you started the job with
- Same as Dependency, but that other job failed. You must cancel the job with