Wishlist and Questions for Vendors

Below please find the wishlist and questions the main SLAC batch system users and administrators posed.

What kind of support (transition/longterm) do we get for the quoted price?
How much downtime do we incur if we need to "restart the system"
What sort of activities require such a "restart" (e.g. creating a new queue?)
Do large numbers of short jobs (about 1 minute) cause problems?
Please provide additional information about support for virtualization, and any plans for future enhancements to that support.
What would be involved in adding support for Mac OSX or Windows? Would that be considered normal "support"?
Where are the bottlenecks in the system likely to occur? For example, can a user stress the system by repeatedly and frequently querying job stats?

Automatic job preemption/suspend/resume?
Support for multiple-levels of job preemption (e.g. 3-queue hierarchy)?
Job environment propagation (including limits like "stacksize")?
Subgroup-specific priority calculation (queue-specific priority formula)?
Capability to delegate subgroup administration privileges (adjust job priorities, suspend, resume, kill) to subgroup administrators?
Cross-queue fairshare (with cpu-speed weighting)?
CPU advanced reservations for MPI?
GPU support?
Ability to submit jobs to hosts where we don't have accounts/home-directories?
Avoid bad behavior when MPI head node reboots: slave node processes get "forgotten" ?
How well does the system scale?
- Number of cores, queues, queued and running jobs?

Please list supported operating systems (for submission hosts and for execution hosts)
Explicit, site-specific-naming resource specification (e.g., rhel5-64, amount of /scratch space, etc.) at job submission time?
Is there an API for "time remaining" query (to save state near end of job)?
Do submission / management hosts require a license?
- ability to submit/monitor jobs from any machine at SLAC (not just those with licenses as is the case with LSF)

Resource arbitration (i.e. ability to easily control number of running jobs to prevent overloading of disks and other resouces, such as memory)
Reliable notification to job when CPU, other resources, about to be exhausted
Built-in support for virtualization?
- Ability to suspend/resume/move running jobs (MPI and single-core)
API (preferably rest/xml/json) for submitting, monitoring, controlling jobs, in addition to command-line control
Can resource allocation within a subgroup be delegated to subgroup administrators?
- e.g. today I want 90% of resources dedicated to one subgroup within my allocation, tomorrow only 10%
Hooks for adding extensions (e.g. kerberos tickets or afs token support) without introducing maintainance problems

Knowledge of cluster topology to assign large jobs across nodes in optimized way?
- For example an MPI job with 256 cores would best be assigned on as few nodes as possible and those nodes should span as few Infiniband/ethernet switches as possible.
Ability to request exclusive access to nodes: this is to support using the full node's memory on a subset of the nodes cores?
Ability to schedule using node properties and resources: memory, gpu cores, cpu cores, i/o connectivity, other user defined properties?
Job priority schemes that support coexistence of small and large jobs?
- This likely includes issues of pre-emption, backfill, checkpointing, migration and restarting. Also includes aggregation of single core jobs onto nodes with other similar jobs.
Ability to partition the cluster / support multiple cluster (by host capabilities, such as IB-capable, GPU nodes, other per-node attributes)?
Ability to automatically enable preemption at certain times of day?

Software should be installed via RPM or similar, both in terms of reducing the reliance on a shared file system, and in terms of simplification of system install process
Per-node configuration should be fairly simple, and easily manageable via a configuration management tool (e.g. chef).
Servers must support semi-high-availability
- at least as much as our existing LSF setup (at a minimum 2 redundant batch-manager nodes).
Can there be multiple simultaneous sources for authorization information that are applied in a well-defined order?
- e.g. NIS, LDAP, flat files as source where a host-local file can override central LDAP
Where possible, we want to be able to perform meaningful queries regarding current queue and job states and get pretty-and-useful reports back.
- This is true both for command-line operations *and* web operations.
From the Unix side, interaction should occur over well-documented and consistent ports (so that we can handle firewalls properly).

Integration with AFS (i.e. it must be possible for single-node and MPI jobs to acquire the user's kerberos and AFS identities)?
- Does the system (at a minimum) provide hooks to implement such a feature?
Real-time and historical monitoring , accounting and analytics (performance, utilization, resource consumption, etc) available through web interface, command line, other APIs (e.g. for embedding in other web applications).
- It is fine if this is an add-on capability
Short list of resource allocation / scheduling schemes (such as fairshare, backfill, etc.) with brief explanations
Job submission and management through a web portal?
Duration and limitations of trial licenses?

How are licenses managed (i.e. is there a requirement for a license server, if yes, does it allow for redundancy)?
Ability to schedule jobs based on file server load?