Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A Partition is a logical grouping of compute servers. These may be servers of a similar technical specification (eg Cascade Lake CPUs, Telsa GPUs etc), or by ownership of the servers - eg SUNCAT group may have purchased so many servers, so we put them all into a Partition.

Generally, all servers will be placed in the shared partition that everyone with a slac computer account will have access to (although at a low priority).

Users should contact their Coordinators to be added to appropriate group Partitions to get priority access to resources.

You can view the active Partitions on SDF with

Code Block
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu*         up 5-00:00:00      3  idle* cryoem-gpu03,hep-gpu01,nu-gpu02
gpu*     STATE NODELIST
shared*      up 57-00:00:00     21   downunk* cryoem-gpu[02,04-09,11-15],ml-gpu[02-10]
gpushared*         up 57-00:00:00     10 7   idle cryoem-gpu[01,03,10,50],hep-gpu01,ml-gpu[01,11],nu-gpu[01,-03]
neutrinoml           up   infinite      19   idleunk* nu-gpu02
neutrinoml-gpu[02-10]
ml           up   infinite      2   idle numl-gpu[01,0311]
cryoem  neutrino     up   infinite      3 1  idle* cryoem-gpu03nu-gpu[01-03]
cryoem       up   infinite     12  down unk* cryoem-gpu[02,04-09,11-15]
cryoem       up   infinite      34   idle cryoem-gpu[01,03,10,50]

 

 

What is a Slurm Allocation?

...

Note that when you 'exit' the interactive session, it will relenquish relinquish the resources for someone else to use. This also means that if your terminal is disconnected (you turn your laptop off, loose network etc), then the Job will also terminate (similar to ssh)will also terminate (similar to ssh).

 

Warning

If your interactive request doesn't immediately find resources, it will currently not actually return you a pty - even though the job actually does run. This results in what looks like a hanging process. We are investigating.

 

How do I submit a Batch Job?

...

Partition NamePurposeContact
gpusharedGeneral GPU resourcesresources; this contains all shareable reasources, including GPUsYee / Daniel
cryoemCryoEM GPU serversYee
neutrinoNeutrino GPU serversKazu
 suncat   
hps   
fermi  
   
   
   

 

 

Help! My Job takes a long time before it starts!

This is often due to limited resources. The simplest way is to request less CPU (-N) or less memory for your Job. However, this will also likely increase the amount of time that you need for the Job to complete. Note that perfect scaling is often very difficult (ie using 16 CPUs will run twice as fast as 8 CPUs), so it may be beneficial to submit many smaller Jobs where possible. You can also set the --time option to specify that your job will only run upto that amount of time so that the scheduler can better fit your job in.

The more expensive option is to buy more hardware to SDF and have it added to your group/teams Partition.

...

Code Block
$ module load slurm
$ scontrol show node ml-gpu01
NodeName=ml-gpu01 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUTot=48 CPULoad=1.41
   AvailableFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:GTX2080TIRTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
   ActiveFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:GTX2080TIRTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
   Gres=gpu:geforce_rtx_2080_ti:10(S:0)
   NodeAddr=ml-gpu01 NodeHostName=ml-gpu01 Version=19.05.2
   OS=Linux 3.10.0-1062.4.1.el7.x86_64 #1 SMP Fri Oct 18 17:15:30 UTC 2019
   RealMemory=191552 AllocMem=0 FreeMem=182473 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu
   BootTime=2019-11-12T11:18:04 SlurmdStartTime=2019-12-06T16:42:16
   CfgTRES=cpu=48,mem=191552M,billing=48,gres/gpu=10
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

...