For anyone using Slurm tool often, the following utilities is really helpful: https://github.com/SchedMD/slurm/tree/master/contribs/slurm_completion_help
squeue squeue -u <username> squeue --reservation <reservation_name>
scontrol show jobid -dd <jobID>
(scontrol does not show information about jobs that have completed more than a few minutes ago)
For detailed reporting and stats on jobs, use sacct, e.g. getting all jobs for user <USER> that started after starttime (e.g.: 2024-06-15):
export FMT="reservation,jobid,jobname,User,reqcpus,ntasks,reqmem,averss,maxrss,elapsed,state%20,exitcode,Submit,Start,End,Account%17,Partition,AveCpu,NodeList%30 --unit=M" sacct --format=${FMT} -u <USER> --starttime 2024-06-15
or for specific job(s) and/or account(s) using additional format options
sacct -a -j <JOBID> -A <ACCOUNT> -o JobID,JobName,Partition,Account%18,AllocCPUS,Nodelist%24,NNodes,start,elapsed,workdir%60,submitline%160
or for specific user and account with some additional format options to compare runtimes ("elapsed") and resources between similar jobs
sacct -a -u <USER> -A <ACCOUNT> -o JobID,JobName,Partition,Account%18,AllocCPUS,Nodelist%24,NNodes,AveRSS,MaxRSS,AveDiskRead,start,elapsed,submitline%160
Show all format options with `-e`
$ sacct -e Account AdminComment AllocCPUS AllocNodes AllocTRES AssocID AveCPU AveCPUFreq AveDiskRead AveDiskWrite AvePages AveRSS AveVMSize BlockID Cluster Comment Constraints Container ConsumedEnergy ConsumedEnergyRaw CPUTime CPUTimeRAW DBIndex DerivedExitCode Elapsed ElapsedRaw Eligible End ExitCode Flags GID Group JobID JobIDRaw JobName Layout MaxDiskRead MaxDiskReadNode MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask MaxPages MaxPagesNode MaxPagesTask MaxRSS MaxRSSNode MaxRSSTask MaxVMSize MaxVMSizeNode MaxVMSizeTask McsLabel MinCPU MinCPUNode MinCPUTask NCPUS NNodes NodeList NTasks Priority Partition QOS QOSRAW Reason ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqCPUS ReqMem ReqNodes ReqTRES Reservation ReservationId Reserved ResvCPU ResvCPURAW Start State Submit SubmitLine Suspended SystemCPU SystemComment Timelimit TimelimitRaw TotalCPU TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot UID User UserCPU WCKey WCKeyID WorkDir
scontrol show res
sacctmgr show associations users=espov format=cluster,account%25,partition # list account that the user belongs to. %25 make the column larger so that the full account name is displayed. sacctmgr list associations -p account=lcls:xpp1234 # list accounts associated with xpp1234 format=user,account%25,partition
The "format" argument can be modified to see more details. Remove it to see all (can be messy).
sinfo is used to view partition and node information for a system running Slurm.
Examples
|
( %C shows "allocated/idle/other/total") So 991 cores are still in use. With -o "%n %C"
one gets the usage per node:
|
Show priorities for an account: sacctmgr list associations -p accounts=<accounts>
Show priority level for a job: sprio -j <jobID>
Show priority coefficients: sacctmgr show qos format=name,priority,usagefactor