You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Slurm manages resources (CPUs, GPUs)

First person to talk to is Ric

In principle we have our own customizable slurm installation on drp-srcf-*, but might still need work/tweaking from IT.

Can we extend slurm to manage our resources, determined by kcu firmware? e.g.

  • rix timing system fiber connection
  • cameralink node
  • high-rate timing nodes
  • low-rate timing nodes (epics)
  • hsd nodes
  • generic (wave8) nodes

Conceptually want:

  • "sbatch tmo.cnf" (instead of "procmgr start tmo.cnf")
  • tmo.cnf has
    • typical timing system on cmp028: "drp -P tmo -D ts ..." 
    • typical control.py on mon001: "control -P tmo ..."
    • special localhost: "control_gui -P tmo -B DAQ:NEH" (this is unusual because it runs on the localhost and has a gui, only localhost processes have gui's with procmgr)
  • first short-term approach (if we could do this we may be able to replace procmgr):
    • use the resources in ~cpo/git/lcls2/psdaq/psdaq/cnf/slurm.cnf: gui's on cmp036 ("localhost") and timing system (and other processes on cmp035)
    • keep node allocations hardwired (like existing .cnf)
    • run control/timing processes
    • need a replacement for procstat (could start with a command line version, do gui later)
    • if one job crashes, don't want the whole job to exit
    • different daq processes need different environment
    • would like python, not a bash interface for the users
    • need per-process log files
    • issue to think about: how do we remember the previously selected detectors?  (could we use "sacct"? maybe we could add info to sacct?)
  • long-term ideally slurm would:
    • understand which nodes are camlink nodes ("resource management")
    • dynamically allocate requested types of nodes
    • (hard, do as a second step?) would change the BOS connections so the right detectors were connected to the allocated nodes
  • No labels