Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • make the collection software long-lived and multi-platform
    • cpo currently leaning this way: "fewer moving parts"
    • suggested naming: "collection"
  • add a layer on top of control.py (merged collection/control code)

...

  • resources managed:  node/kcu-variant/pgplane/cores
  • a service:  two cnf files can have same resources but would be prevented from starting if another cnf already has it allocated.
  • make sure right process ("opal") runs on node with right firmware
  • allocation request is made before the processes are launched
    • maybe a new "resource check" transition in the state machine before processes are launched?
    • implementation possibility: parse the cnf file to determine node/kcu-variant/pgplane/cores
      • cnf: "cmp001: drp -l 0x5 /dev/datadev_0 -D opal" would be translated to the resource request (need to add "cores")
      • would tweak cnf syntax to make easier, but cpo believes we should leave procmgr roughly as-is
      • procmgr is a static configuration, if we use it the resource manager would essentially be checking for conflicts and not for dynamic allocation
  • which resources are
    •  allocated
      • don't want to start the process until we know there is node/pgplane for it.  an ordering problem.
    • deallocated
    • crashed
  • killPartition
  • try to keep the idea of "resource" generic.  today: node/kcu/pgplane but could add more resources in the future (ram, network I/O).  like slurm.
  • there would be limits on node resources: e.g. 2 kcu's, 8 pgplanes, 64 cores that would be enforce by the resource manager.

...