Procmgr To Slurm

Slurm manages resources (CPUs, GPUs)

First person to talk to is Ric

In principle we have our own customizable slurm installation on drp-srcf-*, but might still need work/tweaking from IT.

Can we extend slurm to manage our resources, determined by kcu firmware? e.g.

Conceptually want:

"sbatch tmo.cnf" (instead of "procmgr start tmo.cnf")
tmo.cnf has
- typical timing system on cmp028: "drp -P tmo -D ts ..."
- typical control.py on mon001: "control -P tmo ..."
- special localhost: "control_gui -P tmo -B DAQ:NEH" (this is unusual because it runs on the localhost and has a gui, only localhost processes have gui's with procmgr)
first short-term approach (if we could do this we may be able to replace procmgr):
- use the resources in ~cpo/git/lcls2/psdaq/psdaq/cnf/slurm.cnf: gui's on cmp036 ("localhost") and timing system (and other processes on cmp035)
- keep node allocations hardwired (like existing .cnf)
- run control/timing processes
- need a replacement for procstat (could start with a command line version, do gui later)
- if one job crashes, don't want the whole job to exit
- different daq processes need different environment
- would like python, not a bash interface for the users
- need per-process log files
- issue to think about: how do we remember the previously selected detectors? (could we use "sacct"? maybe we could add info to sacct?)
long-term ideally slurm would:
- understand which nodes are camlink nodes ("resource management")
- dynamically allocate requested types of nodes
- (hard, do as a second step?) would change the BOS connections so the right detectors were connected to the allocated nodes

Page tree