Notes from mtg with Thorsten, Murali, Wilko, Yee, Silke, Valerio on Jan 19, 2024
yee's two points:
(1) need to limit the number of non-preemptable jobs, in particular the
single-core jobs which can scatter across nodes
(2) how do we preempt the jobs and limit the resources used by preemptable
jobs
yee's proposal:
(1) limit every facilities use of non-preemptable jobs to what
they have purchased (e.g. LCLS is limited to 88 nodes of
non-preemptable jobs, or half of 176 nodes)
- coact can help with this
- is this a hard limit, or does it have a long time-constant?
o a hard limit for a repo (enforced by slurm), but not at the
multiple-repo level
(multi-repo is not enforced by slurm, but coact so a long
time-constant to see if we've crossed the 88 node
threshold)
(2) LCLS defines the order of preemption of experiment repos
(add-on) could extend this to support cross-facility preemption
concrete example of yee's proposal:
4 expts: exp1 (on-shift), exp2 (off-shift), exp3 (normal), exp4 (normal)
- set exp1 to have high-priority for queue-placement
o implemented by setting QOS=on-shift for exp1
- have a sequence of job-QOS's: preemptable, normal, off-shift, on-shift
o could have "normal" jobs be preemptable, although could create
issues with sharing with others like rubin
o reuse murali's stuff to automatically get on-shift/off-shift settings
- uses the experiment runs within calendar URAWI start/end time
- can this handle last-minute changes? could add some buffer at
the edges or manually override (sub-czars could do this?)
o setting "on-shift" setting for a repo is a permission, the
on-shift expts could specify on-shift or normal QOS in
job submission script
- there can be a default QOS
- if a job is set to a non-permissible QOS job will
currently fail (Silke would like it to switch to lower QOS
automatically)
- we will try suspend preemption within the milan partition
o ****** memory ****** is a worrisome issue, but nodes will get larger
SSDs in 3 months (several TB)
- expect 2-10GB/s maybe limited by kernel, so 512GB would take 1.5min
but happens in parallel, so hopefully OK
- each experiment repo would set an allocation (a hard limit!) that
would limit the number of cores (enables multiple on-shift expts)
- beamline staff could tweak repo allocations? may need sub-czar
(operator) privileges