Slurm manages resources (CPUs, GPUs)
First person to talk to is Ric
In principle we have our own customizable slurm installation on drp-srcf-*, but might still need work/tweaking from IT.
Can we extend slurm to manage our resources, determined by kcu firmware? e.g.
- rix timing system fiber connection
- cameralink node
- high-rate timing nodes
- low-rate timing nodes (epics)
- hsd nodes
- generic (wave8) nodes
Conceptually want:
- "sbatch tmo.cnf" (instead of "procmgr start tmo.cnf")
- tmo.cnf has
- typical timing system on cmp028: "drp -P tmo -D ts ..."
- typical control.py on mon001: "control -P tmo ..."
- special localhost: "control_gui -P tmo -B DAQ:NEH" (this is unusual because it runs on the localhost and has a gui, only localhost processes have gui's with procmgr)
- first short-term approach (if we could do this we may be able to replace procmgr):
- use the resources in ~cpo/git/lcls2/psdaq/psdaq/cnf/slurm.cnf: gui's on cmp036 ("localhost") and timing system (and other processes on cmp035)
- keep node allocations hardwired (like existing .cnf)
- run control/timing processes
- need a replacement for procstat (could start with a command line version, do gui later)
- if one job crashes, don't want the whole job to exit
- different daq processes need different environment
- would like python, not a bash interface for the users
- need per-process log files
- issue to think about: how do we remember the previously selected detectors? (could we use "sacct"? maybe we could add info to sacct?)
- long-term ideally slurm would:
- understand which nodes are camlink nodes ("resource management")
- dynamically allocate requested types of nodes
- (hard, do as a second step?) would change the BOS connections so the right detectors were connected to the allocated nodes
Overview
Content Tools