Below please find the wishlist and questions the main SLAC batch system users and administrators posed.
User Group 1
- Automatic job preemption/suspend/resume?
- Support for multiple-levels of job preemption (e.g. 3-queue hierarchy)?
- Job environment propagation (including limits like "stacksize")?
- Subgroup-specific priority calculation (queue-specific priority formula)?
- Capability to delegate subgroup administration privileges (adjust job priorities, suspend, resume, kill) to subgroup administrators?
- Cross-queue fairshare (with cpu-speed weighting)?
- CPU advanced reservations for MPI?
- GPU support?
- Ability to submit jobs to hosts where we don't have accounts/home-directories?
- Avoid bad behavior when MPI head node reboots: slave node processes get "forgotten" ?
- How well does the system scale?
- Number of cores, queues, queued and running jobs?
User Group 2
- Please list supported operating systems (for submission hosts and for execution hosts)
- Explicit, site-specific-naming resource specification (e.g., rhel5-64, amount of /scratch space, etc.) at job submission time?
- Is there an API for "time remaining" query (to save state near end of job)?
- Do submission / management hosts require a license?
- ability to submit/monitor jobs from any machine at SLAC (not just those with licenses as is the case with LSF)
User Group 3
- Resource arbitration (i.e. ability to easily control number of running jobs to prevent overloading of disks and other resouces, such as memory)
- Reliable notification to job when CPU, other resources, about to be exhausted
- Built-in support for virtualization?
- Ability to suspend/resume/move running jobs (MPI and single-core)
- API (preferably rest/xml/json) for submitting, monitoring, controlling jobs, in addition to command-line control
- Can resource allocation within a subgroup be delegated to subgroup administrators?
- e.g. today I want 90% of resources dedicated to one subgroup within my allocation, tomorrow only 10%
- Hooks for adding extensions (e.g. kerberos tickets or afs token support) without introducing maintainance problems
User Group 4
- Knowledge of cluster topology to assign large jobs across nodes in optimized way?
- For example an MPI job with 256 cores would best be assigned on as few nodes as possible and those nodes should span as few Infiniband/ethernet switches as possible.
- Ability to request exclusive access to nodes: this is to support using the full node's memory on a subset of the nodes cores?
- Ability to schedule using node properties and resources: memory, gpu cores, cpu cores, i/o connectivity, other user defined properties?
- Job priority schemes that support coexistence of small and large jobs?
- This likely includes issues of pre-emption, backfill, checkpointing, migration and restarting. Also includes aggregation of single core jobs onto nodes with other similar jobs.
- Ability to partition the cluster / support multiple cluster (by host capabilities, such as IB-capable, GPU nodes, other per-node attributes)?
- Ability to automatically enable preemption at certain times of day?
User Group 5
- Is the batch system supported by Open Science Grid?
User Group 6
- Software should be installed via RPM or similar, both in terms of reducing the reliance on a shared file system, and in terms of simplification of system install process
- Per-node configuration should be fairly simple, and easily manageable via a configuration management tool (e.g. chef).
- Servers must support semi-high-availability
- at least as much as our existing LSF setup (at a minimum 2 redundant batch-manager nodes).
- Can there be multiple simultaneous sources for authorization information that are applied in a well-defined order?
- e.g. NIS, LDAP, flat files as source where a host-local file can override central LDAP
- Where possible, we want to be able to perform meaningful queries regarding current queue and job states and get pretty-and-useful reports back.
- This is true both for command-line operations *and* web operations.
- From the Unix side, interaction should occur over well-documented and consistent ports (so that we can handle firewalls properly).
User Group 7
- Integration with AFS (i.e. it must be possible for single-node and MPI jobs to acquire the user's kerberos and AFS identities)?
- Does the system (at a minimum) provide hooks to implement such a feature?
- Real-time and historical monitoring , accounting and analytics (performance, utilization, resource consumption, etc) available through web interface, command line, other APIs (e.g. for embedding in other web applications).
- It is fine if this is an add-on capability
- Short list of resource allocation / scheduling schemes (such as fairshare, backfill, etc.) with brief explanations
- Job submission and management through a web portal?
- Duration and limitations of trial licenses?
User Group 8
- How are licenses managed (i.e. is there a requirement for a license server, if yes, does it allow for redundancy)?
- Ability to schedule jobs based on file server load?