LCLS Needs

  • (highest priority) On-shift: jobs start in ~1 minute
  • (middle priority) Off-shift: jobs start "soon"
  • (standard priority) Normal analysis: standard non-killable jobs (LCLS doesn't checkpoint)

Previous implementations:

  • at SLAC: multiple levels of suspend-preemption
  • at NERSC: reservations

existing lcls resources: 3000 cores on roma, 11000 cores on milano (ready in April)

On-shift Options

  • (current default plan) reservations (wasteful, difficult to manage)
  • amedeo suggests high/middle/standard "queue position" QOS for this with hope that we will have enough resources for 1-minute start time, with reservations as a backup plan
  • kill-preemption (hard on users, less so if jobs can be automatically resubmitted?)
  • suspend-preemption (reduces available memory, difficult in a shared environment like S3DF)
    • would require tweaks to the existing queues that would affect everyone
  • (longer term R&D) amedeo and yee ask: can we use QOS?
    • multiple QOS? user specifies if their job is killable, suspendable, standard, off-shift, on-shift
    • have swap space so we can use suspend-preemption

Off-shift Options

  • (current default plan) move-to-head-of-queue (high/middle/standard) QOS (amedeo agrees with this)
    • no fairshare-qos in s3df at the moment
  • suspend or kill preemption

To Do

  • Turn on PriorityWeightQOS in slurm.conf.  Not clear when this can be done.
    • backup solution will be reservations
    • Yee and Renata will make this happen.  Not clear when.
  • Should the QOS be defined by the job or a partition?
  • No labels