LCLS Needs
- (highest priority) On-shift: jobs start in ~1 minute
- (middle priority) Off-shift: jobs start "soon"
- (standard priority) Normal analysis: standard non-killable jobs (LCLS doesn't checkpoint)
Previous implementations:
- at SLAC: multiple levels of suspend-preemption
- at NERSC: reservations
existing lcls resources: 3000 cores on roma, 11000 cores on milano (ready in April)
On-shift Options
- (current default plan) reservations (wasteful, difficult to manage)
- amedeo suggests high/middle/standard "queue position" QOS for this with hope that we will have enough resources for 1-minute start time, with reservations as a backup plan
- kill-preemption (hard on users, less so if jobs can be automatically resubmitted?)
- suspend-preemption (reduces available memory, difficult in a shared environment like S3DF)
- would require tweaks to the existing queues that would affect everyone
- (longer term R&D) amedeo and yee ask: can we use QOS?
- multiple QOS? user specifies if their job is killable, suspendable, standard, off-shift, on-shift
- have swap space so we can use suspend-preemption
Off-shift Options
- (current default plan) move-to-head-of-queue (high/middle/standard) QOS (amedeo agrees with this)
- no fairshare-qos in s3df at the moment
- suspend or kill preemption
To Do
- Turn on PriorityWeightQOS in slurm.conf. Not clear when this can be done.
- backup solution will be reservations
- Yee and Renata will make this happen. Not clear when.
- Should the QOS be defined by the job or a partition?
Overview
Content Tools