Resource Management

To support a shared DRP. One process that understands all the resources that are in-use and prevents conflicts. Two possibilities:

make the collection software long-lived and multi-platform
- cpo currently leaning this way: "fewer moving parts"
- suggested naming: "collection"
add a layer on top of control.py (merged collection/control code)

Requirements

resources managed: node/kcu-variant/pgplane/cores
a service: two cnf files can have same resources but would be prevented from starting if another cnf already has it allocated.
make sure right process ("opal") runs on node with right firmware
allocation request is made before the processes are launched
- maybe a new "resource check" transition in the state machine before processes are launched?
- implementation possibility: parse the cnf file to determine node/kcu-variant/pgplane/cores
  - cnf: "cmp001: drp -l 0x5 /dev/datadev_0 -D opal" would be translated to the resource request (need to add "cores")
  - would tweak cnf syntax to make easier, but cpo believes we should leave procmgr roughly as-is
  - procmgr is a static configuration, if we use it the resource manager would essentially be checking for conflicts and not for dynamic allocation
which resources are
- allocated
  - don't want to start the process until we know there is node/pgplane for it. an ordering problem.
- deallocated
- crashed
killPartition
try to keep the idea of "resource" generic. today: node/kcu/pgplane but could add more resources in the future (ram, network I/O). like slurm.
there would be limits on node resources: e.g. 2 kcu's, 8 pgplanes, 64 cores that would be enforce by the resource manager.

"Risk Management" thought: if the sharing doesn't work we can buy DRP nodes and fibers to keep .cnf files orthogonal (obviates the need for the BOS). Expensive and harder to maintain.

Considerations:

BOS (dynamically changes connections, of devices of the same type (fim, opal, hsd))
- ideal world: rix "I want 3 hsd nodes", tmp "I want 2 hsd nodes" and resource management software would:
  - configure the bos
  - dynamically allocate the nodes (not hardwired in the cnf)
  - adds complexity. cpo gut-feeling: design in order to not rule this out, but wait for implementation (later "adiabatic" development).
    - could start by "error checking" bos settings
    - need more experience with bos
DRP Nodes
PGP Lanes
procmgr ports ("platforms"). don't want two people using platform 2.
readout groups (only matter within a single XPM, maybe doesn't matter with per-hutch XPM idea)
special networks connections:
- xtcav
- bld (cmp001, cmp014). currently not special, but maybe not true in the future (1MHz needs a clean network)
- connection to ATCA timing system (ctl001). needed to run all pyxpm
which firmware runs in which node
tprtrig is not an allocatable device because it has hardwired connections to particular detectors. more part of the detector. maybe should have an ioc to control it? might have many ioc's (built in to the detector ioc) controlling it. like the Andor.
front-end machines (e.g. daq-tmo-hsd-01, daq-tmo-andor-01) would NOT be a managed resource (not shared drp). BOS does need to know about them.
ffb connections? drp-neh-ctl002 doesn't have an FFB, but is part of the shared drp.
pgp remoteId information could be useful
two case complexity: e.g. wave8's from different hutches can trivially share a drp node (as long use different pgplanes). but opal's cannot (so currently would need to allocate all lanes)
need to think about 2 KCU case
are there other resources that need to be managed: ram, network I/O, cpu. similar to slurm. could we use slurm?
when we startup drp exe, doesn't it use all the cores on the node? could be oversubscribing cpu's with multiple processes.
right now collection code is merged together with state machine code in control.py. we need to refactor this into two pieces: short-lived control, and long-lived multi-platform collection.

Thoughts:

could we get rid of the idea of platforms?
- an example of how one could do that: could statically allocate ports for each

What Resources are Managed/Shared

Page tree

Resource Management