LCLS Needs

(highest priority) On-shift: jobs start in ~1 minute
(middle priority) Off-shift: jobs start "soon"
(standard priority) Normal analysis: standard non-killable jobs (LCLS doesn't checkpoint)

Previous implementations:

existing lcls resources: 3000 cores on roma, 11000 cores on milano (ready in April)

On-shift Options

(current default plan) reservations (wasteful, difficult to manage)
amedeo suggests high/middle/standard "queue position" QOS for this with hope that we will have enough resources for 1-minute start time, with reservations as a backup plan
kill-preemption (hard on users, less so if jobs can be automatically resubmitted?)
suspend-preemption (reduces available memory, difficult in a shared environment like S3DF)
- would require tweaks to the existing queues that would affect everyone
(longer term R&D) amedeo and yee ask: can we use QOS?
- multiple QOS? user specifies if their job is killable, suspendable, standard, off-shift, on-shift
- have swap space so we can use suspend-preemption

(current default plan) move-to-head-of-queue (high/middle/standard) QOS (amedeo agrees with this)
- no fairshare-qos in s3df at the moment
suspend or kill preemption

Turn on PriorityWeightQOS in slurm.conf. Not clear when this can be done.
- backup solution will be reservations
- Yee and Renata will make this happen. Not clear when.
Should the QOS be defined by the job or a partition?