Red Hat Discussion

SLAC Discussion

To support a shared DRP. One process that understands all the resources that are in-use and prevents conflicts. Two possibilities:

...

What Resources are Managed/Shared

Resource Dependency Graph

Introductory slides available here: ppt pdf

Dynamic Resource Allocation Discussion

Aug. 17, 2021 with caf, valmar, cpo, weaver, snelson

Simple-minded picture:

(straw man big-idea) cnf-authors (pcds-poc's and data-systems) write a .cnf file but leave node-name and pgp-lane(s) as "templated" parameters
- should we have a lookup that takes cnf-author-supplied trigger-rate and detector type (opal, epix) and computes a number of lanes? i.e. number-of-pgp-lanes not user-supplied, but computed (alternative is a "sanity check" on trigger-rate and number of lanes/nodes). in principle resources needed depend on computing/bandwidth, which makes it more complex.
- for simplicity: we could have detector types which have trigger-rate/computing-resources built into the detector-type name, e.g. epix10k2M_120Hz, epix10k2M_20kHz, epix10k2M_120Hz_high_computing_load. to start, only a "shorthand" for allocating resource: wouldn't enforce consistency with timing-system-configured-trigger-rate, but in principle could do that in the future (could use ts_120Hz category as a first attempt to enforce consistency).
- for ami/shmem the cnf-author would specify number of nodes, since there aren't specific patterns like there are with detectors
  - this number could be determined by the cnf-author with a "show resources" command
  - we should provide guidance for what ami resources are needed for a particular analysis (this is complex since scientists can do anything)
some node-names could be hard-coded (e.g. control.py)
- consider this for user-shmem, so it doesn't move too often. could have two different timescale for filling in cnf templated node-names: "slow" timescale for user-shmem nodes (e.g. once per experiment) and a "fast" timescale for detector nodes (e.g. every daq restart)
- could be useful for debugging hardware-specific problems
need a mechanism to indicate if this cnf line is opal/control/ami/etc.
- some .cnf files have loops and associated lists of nodes (e.g. for ami) which adds complexity. maybe could understand the dictionary that these loops generate
- some chance that with resource management we could eliminate the loops for the templated-cnf (the "loops" would be generated by resource manager)
how do scientists know where their shmem is running?
- we can provide a tool that tells the scientists where it is running, but we can't change it very often
- maybe we a mechanism to "pin" some resources (shouldn't be templated)
consider jinja for handling templated parameters?
dynamic allocation would require control of the BOS
- for the camlink-converter the BOS needs to manage both the timing/data fibers
- all other detectors just need the data fibers
- need to fix serious any firmware glitches that would happen when fibers are moved (hard)
- api's: REST, T1 or TL1, SNMP
feels like resources would be managed via a database
- for each node database would have a "detector type" and number of lanes that are free
question: with low-rate detectors we could squeeze many into one node (e.g. 120Hz epics/timing-system/pvadetectors)
- (heuristic: not precise) perhaps cnf-authors (e.g. pcds-poc, data-systems) would indicate "exclusive access", "as many you like", "somewhere in the middle". Maybe just a number-of-lanes to reserve? this is error prone, e.g. if the scientists change the trigger rate
- to help avoid resource over-utilization would be good if power-on defaults (e.g. for trigger rates) are conservative
- can we do a sanity check
currently the .cnf files are wasteful of resources: people leave detectors in .cnf that they do not use. make it easier to comment-out items in cnf? a big change ("selecting detectors before start of cnf") but still maybe still worth doing.

Possible Workflow

at the beginning of the shift expt does "resource_manager_alloc tmo_template.cnf > tmo.cnf"
- this would ideally move all the BOS fibers
at the end of the shift: "resource_manager_dealloc tmo.cnf". show who is responsible for conflicts.
- provide a "resource_manager_kill" command so that someone can seize control if they need it.
- provide a "resource_manager_list" command to show available/allocated resources

Page tree

Versions Compared

Old Version 4

New Version Current

Key

Red Hat Discussion

SLAC Discussion

Resource Dependency Graph

Dynamic Resource Allocation Discussion

Possible Workflow

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

Red Hat Discussion

SLAC Discussion

Resource Dependency Graph

Dynamic Resource Allocation Discussion

Possible Workflow