Below please find the wishlist and questions the main SLAC batch system users and administrators posed.
Followup Questions
- What kind of support (transition/longterm) do we get for the quoted price?
- How much downtime do we incur if we need to "restart the system"
- What sort of activities require such a "restart" (e.g. creating a new queue?)
- Do large numbers of short jobs (about 1 minute) cause problems?
- Please provide additional information about support for virtualization, and any plans for future enhancements to that support.
- What would be involved in adding support for Mac OSX or Windows? Would that be considered normal "support"?
- Where are the bottlenecks in the system likely to occur? For example, can a user stress the system by repeatedly and frequently querying job stats?
Original Wishlist
User Group 1
- Automatic job preemption/suspend/resume?
- Support for multiple-levels of job preemption (e.g. 3-queue hierarchy)?
- Job environment propagation (including limits like "stacksize")?
- Subgroup-specific priority calculation (queue-specific priority formula)?
- Capability to delegate subgroup administration privileges (adjust job priorities, suspend, resume, kill) to subgroup administrators?
- Cross-queue fairshare (with cpu-speed weighting)?
- CPU advanced reservations for MPI?
- GPU support?
- Ability to submit jobs to hosts where we don't have accounts/home-directories?
- Avoid bad behavior when MPI head node reboots: slave node processes get "forgotten" ?
- How well does the system scale?
- Number of cores, queues, queued and running jobs?
...