Procmgr To Slurm

Slurm manages resources (CPUs, GPUs)

First person to talk to is Ric

In principle we have our own customizable slurm installation on drp-srcf-*, but might still need work/tweaking from IT.

Can we extend slurm to manage our resources, determined by kcu firmware? e.g.

rix timing system fiber connection
cameralink node
high-rate timing nodes
low-rate timing nodes (epics)
hsd nodes
generic (wave8) nodes

Conceptually want:

"sbatch tmo.cnf" (instead of "procmgr start tmo.cnf")
tmo.cnf has
- typical timing system on cmp028: "drp -P tmo -D ts ..."
- typical control.py on mon001: "control -P tmo ..."
- special localhost: "control_gui -P tmo -B DAQ:NEH" (this is unusual because it runs on the localhost and has a gui, only localhost processes have gui's with procmgr)
first short-term approach (if we could do this we may be able to replace procmgr):
- use the resources in ~cpo/git/lcls2/psdaq/psdaq/cnf/slurm.cnf: gui's on cmp036 ("localhost") and timing system (and other processes on cmp035)
- keep node allocations hardwired (like existing .cnf)
- run control/timing processes
- need a replacement for procstat (could start with a command line version, do gui later)
- if one job crashes, don't want the whole job to exit
- different daq processes need different environment
- would like python, not a bash interface for the users
- need per-process log files
- issue to think about: how do we remember the previously selected detectors? (could we use "sacct"? maybe we could add info to sacct?)
long-term ideally slurm would:
- understand which nodes are camlink nodes ("resource management")
- dynamically allocate requested types of nodes
- (hard, do as a second step?) would change the BOS connections so the right detectors were connected to the allocated nodes

Note from Ric 3/1/2024:

I thought that maybe the first thing to try would be to figure out how to launch a process that brings up a GUI, e.g., groupca or xpmpva, or maybe even start simpler with xeyes or xclock. The main idea was to test the ability of telling slurm that the process that you want to run is an X11 application, which I read in the docs it can do.
The next thing might be to try to bring up the DAQ using slurm and thus thinking about what the slurm description file would look like. Can we use something like the .cnf? Can we automatically convert the .cnfs to whatever slurm requires? Or do we need to start from scratch? For this step I’m thinking we would still have to specify everything, like the node each process runs on.
The last thing I looked into a little bit was the idea of defining resources to slurm. For this I thought I’d need some setup to try things out on, which resulted in Jira ECS-4017 (I don’t think anything was done though). Chris Ford was also working on this project and he suggested setting up a virtual machine with a private slurm setup I could tinker with (I haven’t figured out how to do that, yet). Anyway, the idea of the resources is that based on what each DRP needs (e.g., detector type, KCU firmware type, a GPU, X11, etc.), resources would be defined to slurm so that when you launch a DAQ, it would allocate the nodes according to the resources needed and start the processes on them. Perhaps at some point in the future we could even have it modify the connections in the BOS to connect a detector to an available host that has the right KCU firmware, thus making RIX hosts available to TMO and vice versa.
I think that’s about as far as I got. Let me know if you have questions. I have to take Rachel to a doctor’s appointment at 1 so I think I’ll be out until 3 or so. We can talk later, if you prefer. I’ll take a look at the link as soon as I can. Feel free to add the above to that if you think it would be helpful.

Page tree

Procmgr To Slurm