Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If there is significant deadtime coming from an HSD pair when running at high rate and 'kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]' messages appear in dmesg, etc., it may be that the interrupts are being handled for both datadev devices by one core, usually CPU0.  To avoid this, the interrupt handling can be moved to two different cores, e.g. CPU4 and 5.  First, disable the irqbalance service or tell it to avoid the datadev's IRQ(s):

sudo systemctl stop irqbalance

Then set the IRQ affinity values, e.g.:

sudo sh -c "echo 4 > /proc/irq/369/smp_affinity_list"
sudo sh -c "echo 5 > /proc/irq/370/smp_affinity_list"

For a more permanent solution, it may be better to modify or, edit /etc/sysconfig/irqbalance and add the cores to IRQBALANCE_ARGS to avoid the datadev's IRQs:

IRQBALANCE_ARGS=--banirq=369 --banirq=370

Restart the irqbalance service in the latter case.  Then set the /proc/irq/<datadev_N IRQ>/smp_affinity_list valuesAdd the following to the kcu or tdetsim service file, e.g.:

sudo ExecStartPost=/usr/bin/sh -c "/usr/bin/echo 4 > /proc/irq/369/smp_affinity_list"
sudo ExecStartPost=/usr/bin/sh -c "/usr/bin/echo 5 > /proc/irq/370/smp_affinity_list"

These values will stick so long as the datadev driver is not reloaded and irqbalance is not restarted or misconfigured.  We'll need to find a way to do this as part of the datadev driver service (tdetsim.service or kcu.service) startup.  Since Note that since nothing sets the datadev's IRQ numbers to any particular value , (as far as I can tell), I think we may need to consider the possibility that they can be different change from system (or driver?) restart to restart or be different from node to node.  However, I've seen no evidence that this is the case.

Cable Swaps

hsd cables can be plugged into the wrong place (e.g. "pairs" can be swapped).  They must match the mapping documentation Matt has placed at the bottom of hsd.cnf (which is reflected in the lines in hsd.cnf that start up processes, making sure those are consistent is a manual process).  Matt has the usual "remote link id" pattern that can be used to check this, by using "kcuStatus" on the KCU end and "hsdpva" on the other end. e.g.

...