Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When we attempted to test the DAQ in SRCF with a couple of HSDs, we initially had trouble getting phase 2 of transitions through.  The two "sides" 1_DA:A and B behaved differently with pgpread.  Sometimes some events came through one side but not the other, but with significant delay from when the groupca Events tab Run box was checked and not when the transition buttons were clicked.  Also some of the entries in hsdpva's tabs were odd (Buffers:raw:freesz reset to 65535 for one, 4094 for the other).  Some work had been done on the PV gateway.  hsdioc on daq-tmo-hsd-01 had been restarted and got into a bad state.  Restarting it again cleared up the problem.

BUG: Soft lockup messages

If there is significant deadtime coming from an HSD pair when running at high rate and 'kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]' messages appear in dmesg, etc., it may be that the interrupts are being handled for both datadev devices by one core, usually CPU0.  To avoid this, the interrupt handling can be moved to two different cores, e.g. CPU4 and 5.  First, disable the irqbalance service or tell it to avoid the two chosen cores:

sudo systemctl stop irqbalance

or, edit /etc/sysconfig/irqbalance and add the cores to IRQBALANCE_BANNED_CPUS:

IRQBALANCE_BANNED_CPUS = 18

Restart the irqbalance service in the latter case.  Then set the /proc/irq/<datadev_N IRQ>/smp_affinity values, e.g.:

sudo sh -c "echo 4 > /proc/irq/369/smp_affinity_list"
sudo sh -c "echo 5 > /proc/irq/370/smp_affinity_list"

These values will stick so long as the datadev driver is not reloaded and irqbalance is not restarted or misconfigured.  We'll need to find a way to do this as part of the datadev driver service (tdetsim.service or kcu.service) startup.

Cable Swaps

hsd cables can be plugged into the wrong place (e.g. "pairs" can be swapped).  They must match the mapping documentation Matt has placed at the bottom of hsd.cnf (which is reflected in the lines in hsd.cnf that start up processes, making sure those are consistent is a manual process).  Matt has the usual "remote link id" pattern that can be used to check this, by using "kcuStatus" on the KCU end and "hsdpva" on the other end. e.g.

...