Red Hat Discussion

SLAC Discussion

To support a shared DRP. One process that understands all the resources that are in-use and prevents conflicts. Two possibilities:

make the collection software long-lived and supports learning about multiple partitions/platformsmulti-platform
- cpo currently leaning this way: "fewer moving parts"
- suggested naming: "collection"
add a layer on top of the collection codecontrol.py (merged collection/control code)

Requirements

resources managed: node/kcu-variant/pgplane/cores
a service: two cnf files can have same node/kcu/pgplane resources but would be prevented from starting if another cnf already has it allocated.
make sure right process ("opal") runs on node with right firmware
allocation request is made before the processes are launched
- maybe a new "resource check" transition in the state machine before processes are launched?
- implementation possibility: parse the cnf file to determine node/kcu-variant/pgplane/cores
  - cnf: "cmp001: drp -l 0x5 /dev/datadev_0 -D opal" would be translated to the resource request (need to add "cores")
  - would tweak cnf syntax to make easier, but cpo believes we should leave procmgr roughly as-is
  - procmgr is a static configuration, if we use it the resource manager would essentially be checking for conflicts and not for dynamic allocation
which resources arewhich nodes/kcu/pgplanes are allocated
- allocated
- - don't want to start the process until we know there is node/pgplane for it. an ordering problem.
- deallocated
crash
- crashed
killPartition
try to keep the idea of "resource" generic. today: node/kcu/pgplane but could add more resources in the future (ram, network I/O). like slurm.
there would be limits on node resources: e.g. 2 kcu's, 8 pgplanes, 64 cores that would be enforce by the resource manager.killPartition

"Risk Management" thought: if the sharing doesn't work we can buy DRP nodes and fibers to keep .cnf files orthogonal (obviates the need for the BOS). Expensive and harder to maintain.

...

BOS (dynamically changes connections, of devices of the same type (fim, opal, hsd))
- ideal world: rix "I want 3 hsd nodes", tmp "I want 2 hsd nodes" and resource management software would:
  - configure the bos
  - dynamically allocate the nodes (not hardwired in the cnf)
  - adds complexity. cpo gut-feeling: design in order to not rule this out, but wait for implementation (later "adiabatic" development).
    - could start by "error checking" bos settings
    - need more experience with bos
DRP Nodes
PGP Lanes
procmgr ports ("platforms"). don't want two people using platform 2.
readout groups (only matter within a single XPM, maybe doesn't matter with per-hutch XPM idea)
special networks connections:
- xtcav
- bld (cmp001, cmp014). currently not special, but maybe not true in the future (1MHz needs a clean network)
- connection to ATCA timing system (ctl001). needed to run all pyxpm
which firmware runs in which node
tprtrig is not an allocatable device because it has hardwired connections to particular detectors. more part of the detector. maybe should have an ioc to control it? might have many ioc's (built in to the detector ioc) controlling it. like the Andor.
front-end machines (e.g. daq-tmo-hsd-01, daq-tmo-andor-01) would NOT be a managed resource (not shared drp). BOS does need to know about them.
ffb connections? drp-neh-ctl002 doesn't have an FFB, but is part of the shared drp.
pgp remoteId information could be useful
two case complexity: e.g. wave8's from different hutches can trivially share a drp node (as long use different pgplanes). but opal's cannot (so currently would need to allocate all lanes)
need to think about 2 KCU case
are there other resources that need to be managed: ram, network I/O, cpu. similar to slurm. could we use slurm?
when we startup drp exe, doesn't it use all the cores on the node? could be oversubscribing cpu's with multiple processes.
right now collection code is merged together with state machine code in control.py. we need to refactor this into two pieces: short-lived control, and long-lived multi-platform collection.

Thoughts:

could we get rid of the idea of platforms?
- an example of how one could do that: could statically allocate ports for each

What Resources are Managed/Shared

Resource Dependency Graph

Introductory slides available here: ppt pdf

Dynamic Resource Allocation Discussion

Aug. 17, 2021 with caf, valmar, cpo, weaver, snelson

Simple-minded picture:

(straw man big-idea) cnf-authors (pcds-poc's and data-systems) write a .cnf file but leave node-name and pgp-lane(s) as "templated" parameters
- should we have a lookup that takes cnf-author-supplied trigger-rate and detector type (opal, epix) and computes a number of lanes? i.e. number-of-pgp-lanes not user-supplied, but computed (alternative is a "sanity check" on trigger-rate and number of lanes/nodes). in principle resources needed depend on computing/bandwidth, which makes it more complex.
- for simplicity: we could have detector types which have trigger-rate/computing-resources built into the detector-type name, e.g. epix10k2M_120Hz, epix10k2M_20kHz, epix10k2M_120Hz_high_computing_load. to start, only a "shorthand" for allocating resource: wouldn't enforce consistency with timing-system-configured-trigger-rate, but in principle could do that in the future (could use ts_120Hz category as a first attempt to enforce consistency).
- for ami/shmem the cnf-author would specify number of nodes, since there aren't specific patterns like there are with detectors
  - this number could be determined by the cnf-author with a "show resources" command
  - we should provide guidance for what ami resources are needed for a particular analysis (this is complex since scientists can do anything)
some node-names could be hard-coded (e.g. control.py)
- consider this for user-shmem, so it doesn't move too often. could have two different timescale for filling in cnf templated node-names: "slow" timescale for user-shmem nodes (e.g. once per experiment) and a "fast" timescale for detector nodes (e.g. every daq restart)
- could be useful for debugging hardware-specific problems
need a mechanism to indicate if this cnf line is opal/control/ami/etc.
- some .cnf files have loops and associated lists of nodes (e.g. for ami) which adds complexity. maybe could understand the dictionary that these loops generate
- some chance that with resource management we could eliminate the loops for the templated-cnf (the "loops" would be generated by resource manager)
how do scientists know where their shmem is running?
- we can provide a tool that tells the scientists where it is running, but we can't change it very often
- maybe we a mechanism to "pin" some resources (shouldn't be templated)
consider jinja for handling templated parameters?
dynamic allocation would require control of the BOS
- for the camlink-converter the BOS needs to manage both the timing/data fibers
- all other detectors just need the data fibers
- need to fix serious any firmware glitches that would happen when fibers are moved (hard)
- api's: REST, T1 or TL1, SNMP
feels like resources would be managed via a database
- for each node database would have a "detector type" and number of lanes that are free
question: with low-rate detectors we could squeeze many into one node (e.g. 120Hz epics/timing-system/pvadetectors)
- (heuristic: not precise) perhaps cnf-authors (e.g. pcds-poc, data-systems) would indicate "exclusive access", "as many you like", "somewhere in the middle". Maybe just a number-of-lanes to reserve? this is error prone, e.g. if the scientists change the trigger rate
- to help avoid resource over-utilization would be good if power-on defaults (e.g. for trigger rates) are conservative
- can we do a sanity check
currently the .cnf files are wasteful of resources: people leave detectors in .cnf that they do not use. make it easier to comment-out items in cnf? a big change ("selecting detectors before start of cnf") but still maybe still worth doing.

Possible Workflow

at the beginning of the shift expt does "resource_manager_alloc tmo_template.cnf > tmo.cnf"
- this would ideally move all the BOS fibers
at the end of the shift: "resource_manager_dealloc tmo.cnf". show who is responsible for conflicts.
- provide a "resource_manager_kill" command so that someone can seize control if they need it.
- provide a "resource_manager_list" command to show available/allocated resources

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Red Hat Discussion

SLAC Discussion

Resource Dependency Graph

Dynamic Resource Allocation Discussion

Possible Workflow

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

Red Hat Discussion

SLAC Discussion

Resource Dependency Graph

Dynamic Resource Allocation Discussion

Possible Workflow