...
A Partition is a logical grouping of compute servers. These may be servers of a similar technical specification (eg Cascade Lake CPUs, Telsa GPUs etc), or by ownership of the servers - eg SUNCAT group may have purchased so many servers, so we put them all into a Partition.
Generally, all servers will be placed in the shared partition that everyone with a slac computer account will have access to (although at a low priority).
Users should contact their Coordinators to be added to appropriate group Partitions to get priority access to resources.
You can view the active Partitions on SDF with
Code Block |
---|
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up 5-00:00:00 3 idle* cryoem-gpu03,hep-gpu01,nu-gpu02 gpu* STATE NODELIST shared* up 57-00:00:00 21 downunk* cryoem-gpu[02,04-09,11-15],ml-gpu[02-10] gpushared* up 57-00:00:00 10 7 idle cryoem-gpu[01,03,10,50],hep-gpu01,ml-gpu[01,11],nu-gpu[01,-03] neutrinoml up infinite 19 idleunk* nu-gpu02 neutrinoml-gpu[02-10] ml up infinite 2 idle numl-gpu[01,0311] cryoem neutrino up infinite 3 1 idle* cryoem-gpu03nu-gpu[01-03] cryoem up infinite 12 down unk* cryoem-gpu[02,04-09,11-15] cryoem up infinite 34 idle cryoem-gpu[01,03,10,50] |
What is a Slurm Allocation?
...
Note that when you 'exit' the interactive session, it will relenquish relinquish the resources for someone else to use. This also means that if your terminal is disconnected (you turn your laptop off, loose network etc), then the Job will also terminate (similar to ssh)will also terminate (similar to ssh).
Warning |
---|
If your interactive request doesn't immediately find resources, it will currently not actually return you a pty - even though the job actually does run. This results in what looks like a hanging process. We are investigating. |
How do I submit a Batch Job?
...
Partition Name | Purpose | Contact |
---|---|---|
gpushared | General GPU resourcesresources; this contains all shareable reasources, including GPUs | Yee / Daniel |
cryoem | CryoEM GPU servers | Yee |
neutrino | Neutrino GPU servers | Kazu |
suncat | ||
hps | ||
fermi | ||
Help! My Job takes a long time before it starts!
This is often due to limited resources. The simplest way is to request less CPU (-N) or less memory for your Job. However, this will also likely increase the amount of time that you need for the Job to complete. Note that perfect scaling is often very difficult (ie using 16 CPUs will run twice as fast as 8 CPUs), so it may be beneficial to submit many smaller Jobs where possible. You can also set the --time option to specify that your job will only run upto that amount of time so that the scheduler can better fit your job in.
The more expensive option is to buy more hardware to SDF and have it added to your group/teams Partition.
...
Code Block |
---|
$ module load slurm $ scontrol show node ml-gpu01 NodeName=ml-gpu01 Arch=x86_64 CoresPerSocket=12 CPUAlloc=0 CPUTot=48 CPULoad=1.41 AvailableFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:GTX2080TIRTX2080TI,GPU_MEM:11GB,GPU_CC:7.5 ActiveFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:GTX2080TIRTX2080TI,GPU_MEM:11GB,GPU_CC:7.5 Gres=gpu:geforce_rtx_2080_ti:10(S:0) NodeAddr=ml-gpu01 NodeHostName=ml-gpu01 Version=19.05.2 OS=Linux 3.10.0-1062.4.1.el7.x86_64 #1 SMP Fri Oct 18 17:15:30 UTC 2019 RealMemory=191552 AllocMem=0 FreeMem=182473 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpu BootTime=2019-11-12T11:18:04 SlurmdStartTime=2019-12-06T16:42:16 CfgTRES=cpu=48,mem=191552M,billing=48,gres/gpu=10 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s |
...