This page contains information in no particular order. If a lot more information is added, one should think about organizing it.

SLURM auto-completion tool

For anyone using Slurm tool often, the following utilities is really helpful: https://github.com/SchedMD/slurm/tree/master/contribs/slurm_completion_help

See what jobs are in the queue

squeue
squeue -u <username>
squeue --reservation <reservation_name>

Detailed information about a specific running or recent job

scontrol show jobid -dd <jobID>

(scontrol does not show information about jobs that have completed more than a few minutes ago)


For detailed reporting and stats on jobs, use sacct, e.g.  getting all jobs for user <USER> that started after starttime (e.g.: 2024-06-15):

export FMT="reservation,jobid,jobname,User,reqcpus,ntasks,reqmem,averss,maxrss,elapsed,state%20,exitcode,Submit,Start,End,Account%17,Partition,AveCpu,NodeList%30 --unit=M"
sacct --format=${FMT} -u <USER>  --starttime 2024-06-15 

or for specific job(s) and/or account(s) using additional format options

sacct -a -j <JOBID> -A <ACCOUNT> -o JobID,JobName,Partition,Account%18,AllocCPUS,Nodelist%24,NNodes,start,elapsed,workdir%60,submitline%160

or for specific user and account with some additional format options to compare runtimes ("elapsed") and resources between similar jobs

sacct -a -u <USER> -A <ACCOUNT> -o JobID,JobName,Partition,Account%18,AllocCPUS,Nodelist%24,NNodes,AveRSS,MaxRSS,AveDiskRead,start,elapsed,submitline%160

Show all format options with `-e`

$ sacct -e
Account             AdminComment        AllocCPUS           AllocNodes         
AllocTRES           AssocID             AveCPU              AveCPUFreq         
AveDiskRead         AveDiskWrite        AvePages            AveRSS             
AveVMSize           BlockID             Cluster             Comment            
Constraints         Container           ConsumedEnergy      ConsumedEnergyRaw  
CPUTime             CPUTimeRAW          DBIndex             DerivedExitCode    
Elapsed             ElapsedRaw          Eligible            End                
ExitCode            Flags               GID                 Group              
JobID               JobIDRaw            JobName             Layout             
MaxDiskRead         MaxDiskReadNode     MaxDiskReadTask     MaxDiskWrite       
MaxDiskWriteNode    MaxDiskWriteTask    MaxPages            MaxPagesNode       
MaxPagesTask        MaxRSS              MaxRSSNode          MaxRSSTask         
MaxVMSize           MaxVMSizeNode       MaxVMSizeTask       McsLabel           
MinCPU              MinCPUNode          MinCPUTask          NCPUS              
NNodes              NodeList            NTasks              Priority           
Partition           QOS                 QOSRAW              Reason             
ReqCPUFreq          ReqCPUFreqMin       ReqCPUFreqMax       ReqCPUFreqGov      
ReqCPUS             ReqMem              ReqNodes            ReqTRES            
Reservation         ReservationId       Reserved            ResvCPU            
ResvCPURAW          Start               State               Submit             
SubmitLine          Suspended           SystemCPU           SystemComment      
Timelimit           TimelimitRaw        TotalCPU            TRESUsageInAve     
TRESUsageInMax      TRESUsageInMaxNode  TRESUsageInMaxTask  TRESUsageInMin     
TRESUsageInMinNode  TRESUsageInMinTask  TRESUsageInTot      TRESUsageOutAve    
TRESUsageOutMax     TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin    
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot     UID                
User                UserCPU             WCKey               WCKeyID            
WorkDir            


Get information about current reservation

scontrol show res

User and experiment accounts' associations

sacctmgr show associations users=espov format=cluster,account%25,partition # list account that the user belongs to. %25 make the column larger so that the full account name is displayed.
sacctmgr list associations -p account=lcls:xpp1234 # list accounts associated with xpp1234 format=user,account%25,partition

The "format" argument can be modified to see more details. Remove it to see all (can be messy).

Partition and node information

sinfo is used to view partition and node information for a system running Slurm. 

Examples

sinfo -o "%C" -n sdfmilan[021-022,040,202-204,210-213,226,232]  
CPUS(A/I/O/T)
991/545/0/1536


( %C shows "allocated/idle/other/total") So 991 cores are still in use. With -o "%n %C"  one gets the usage per node:

sinfo -o "%n %C"  -n sdfmilan[021-022,040,202-204,210-213,226,232]
HOSTNAMES CPUS(A/I/O/T)
sdfmilan021 120/8/0/128
sdfmilan022 45/83/0/128
sdfmilan040 8/120/0/128
sdfmilan202 116/12/0/128
sdfmilan203 120/8/0/128
sdfmilan204 120/8/0/128
sdfmilan210 120/8/0/128
sdfmilan211 113/15/0/128
sdfmilan212 105/23/0/128
sdfmilan213 104/24/0/128
sdfmilan226 9/119/0/128
sdfmilan232 7/121/0/128

Priorities

Show priorities for an account: sacctmgr list associations -p accounts=<accounts>

Show priority level for a job: sprio -j <jobID>

Show priority coefficients: sacctmgr show qos format=name,priority,usagefactor

  • No labels