Page History
Red Hat Discussion
Resource Management Guide (RHEL 7)
SLAC Discussion
To support a shared DRP. One process that understands all the resources that are in-use and prevents conflicts. Two possibilities:
...
What Resources are Managed/Shared
Resource Dependency Graph
Introductory slides available here: ppt pdf
Dynamic Resource Allocation Discussion
Aug. 17, 2021 with caf, valmar, cpo, weaver, snelson
Simple-minded picture:
- (straw man big-idea) cnf-authors (pcds-poc's and data-systems) write a .cnf file but leave node-name and pgp-lane(s) as "templated" parameters
- should we have a lookup that takes cnf-author-supplied trigger-rate and detector type (opal, epix) and computes a number of lanes? i.e. number-of-pgp-lanes not user-supplied, but computed (alternative is a "sanity check" on trigger-rate and number of lanes/nodes). in principle resources needed depend on computing/bandwidth, which makes it more complex.
- for simplicity: we could have detector types which have trigger-rate/computing-resources built into the detector-type name, e.g. epix10k2M_120Hz, epix10k2M_20kHz, epix10k2M_120Hz_high_computing_load. to start, only a "shorthand" for allocating resource: wouldn't enforce consistency with timing-system-configured-trigger-rate, but in principle could do that in the future (could use ts_120Hz category as a first attempt to enforce consistency).
- for ami/shmem the cnf-author would specify number of nodes, since there aren't specific patterns like there are with detectors
- this number could be determined by the cnf-author with a "show resources" command
- we should provide guidance for what ami resources are needed for a particular analysis (this is complex since scientists can do anything)
- some node-names could be hard-coded (e.g. control.py)
- consider this for user-shmem, so it doesn't move too often. could have two different timescale for filling in cnf templated node-names: "slow" timescale for user-shmem nodes (e.g. once per experiment) and a "fast" timescale for detector nodes (e.g. every daq restart)
- could be useful for debugging hardware-specific problems
- need a mechanism to indicate if this cnf line is opal/control/ami/etc.
- some .cnf files have loops and associated lists of nodes (e.g. for ami) which adds complexity. maybe could understand the dictionary that these loops generate
- some chance that with resource management we could eliminate the loops for the templated-cnf (the "loops" would be generated by resource manager)
- how do scientists know where their shmem is running?
- we can provide a tool that tells the scientists where it is running, but we can't change it very often
- maybe we a mechanism to "pin" some resources (shouldn't be templated)
- consider jinja for handling templated parameters?
- dynamic allocation would require control of the BOS
- for the camlink-converter the BOS needs to manage both the timing/data fibers
- all other detectors just need the data fibers
- need to fix serious any firmware glitches that would happen when fibers are moved (hard)
- api's: REST, T1 or TL1, SNMP
- feels like resources would be managed via a database
- for each node database would have a "detector type" and number of lanes that are free
- question: with low-rate detectors we could squeeze many into one node (e.g. 120Hz epics/timing-system/pvadetectors)
- (heuristic: not precise) perhaps cnf-authors (e.g. pcds-poc, data-systems) would indicate "exclusive access", "as many you like", "somewhere in the middle". Maybe just a number-of-lanes to reserve? this is error prone, e.g. if the scientists change the trigger rate
- to help avoid resource over-utilization would be good if power-on defaults (e.g. for trigger rates) are conservative
- can we do a sanity check
- currently the .cnf files are wasteful of resources: people leave detectors in .cnf that they do not use. make it easier to comment-out items in cnf? a big change ("selecting detectors before start of cnf") but still maybe still worth doing.
Possible Workflow
- at the beginning of the shift expt does "resource_manager_alloc tmo_template.cnf > tmo.cnf"
- this would ideally move all the BOS fibers
- at the end of the shift: "resource_manager_dealloc tmo.cnf". show who is responsible for conflicts.
- provide a "resource_manager_kill" command so that someone can seize control if they need it.
- provide a "resource_manager_list" command to show available/allocated resources
Overview
Content Tools