FFB Zeroed Read Bytes

System: SRCF FFB

Problem Overview

Reading data in live-mode while files are being written can cause segmentation fault in psana2. This occurs when readers (bd cores) are as fast (or faster) than writers. This problem was reported to file-system vendor (current Weka) and helps are en-route. Observed behavior is depicted in the diagram below when one bd core reads a chunk of bytes into its memory but part of this chunk is zeroed out.

a.xtc2 offset: 1000 chunk size: 20

| x x x x x x x x x x x x x x x 0 0 0 0 0]

We expect to see data from the last five bytes but they appear to be zero. This can cause different failures in psana2 including segmentation fault e.g.:

*** /cds/home/m/monarin/lcls2/install/include/xtcdata/xtc/ShapesData.hh:355: incorrect TypeId 0
[drp-srcf-cmp048:209412] *** Process received signal ***
[drp-srcf-cmp048:209412] Signal: Aborted (6)
[drp-srcf-cmp048:209412] Signal code:  (-6)

Steps to Reproduce The Problem

Both writers and readers have to be fast enough. It seems like we can not trigger the problem when the test run is small (a few xtc2 files) or no. of bd cores is not large enough. The setup below produces the problem.

DAQ

We use rix timing system (XPM 3) and all available lanes on drp-srcf-cmp029. (one for timing and 7 for test detectors). The full cnf file that was used in this test is available in the appendix below.

PSANA2

We use 396 cores on 9 srcf nodes (32 eventbuilder cores) to keep up with all the eight xtc2 files. Below shows python and bash/job scripts for slurm.

test_live.py

import time
import os,sys
from psana import DataSource
import numpy as np
import vals
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

def test_standard():
    batch_size = 1000
    max_events = 0

    hutch='tst'
    exp=sys.argv[1]
    runno=int(sys.argv[2])

    xtc_dir=f'/cds/data/drpsrcf/{hutch}/{exp}/xtc/'

    ds = DataSource(exp=exp,
                    run=runno,
                    batch_size=batch_size,
                    max_events=max_events,
                    dir=xtc_dir,
                    live=True
                    )

    sendbuf = np.zeros(1, dtype='i')
    recvbuf = None
    if rank == 0:
        recvbuf = np.empty([size, 1], dtype='i')

    st = time.time()
    for run in ds.runs():
        for nevt, evt in enumerate(run.events()):
            if nevt % 1000 == 0 and nevt > 0:
                en = time.time()
                print(f'RANK: {rank:4d} EVENTS: {nevt:10d} RATE: {(1000/(en-st))*1e-3:.2f}kHz', flush=True)
                st = time.time()
            sendbuf += 1

    # Count total no. of events
    comm.Gather(sendbuf, recvbuf, root=0)
    if rank == 0:
        n_events = np.sum(recvbuf)
    else:
        n_events = None
    n_events = comm.bcast(n_events, root=0)
    return n_events


if __name__ == "__main__":
    comm.Barrier()
    t0 = MPI.Wtime()

    n_events = test_standard()

    comm.Barrier()
    t1 = MPI.Wtime()
    if rank == 0:
        n_eb_nodes = int(os.environ.get('PS_EB_NODES', '1'))
        print(f'TOTAL TIME:{t1-t0:.2f}s #EB: {n_eb_nodes:3d} EVENTS:{n_events:10d} RATE:{(n_events/(t1-t0))*1e-6:.2f}MHz', flush=True)

submit_slac.sh

#!/bin/bash
#SBATCH --partition=anaq
#SBATCH --job-name=psana2
#SBATCH --nodes=9
#SBATCH --ntasks=396
##SBATCH --ntasks-per-node=50
#SBATCH --output=%j.log
#SBATCH --exclusive


t_start=`date +%s`


source setup_hosts.sh
echo SLURM_HOSTFILE $SLURM_HOSTFILE SLURM_NTASKS $SLURM_NTASKS 


export PS_EB_NODES=32
MAX_EVENTS=0
EXP="tstx00817"
RUNNO=55
srun ./run_slac.sh $MAX_EVENTS $EXP $RUNNO


t_end=`date +%s`
echo PSJobCompleted TotalElapsed $((t_end-t_start))

*** /cds/home/m/monarin/lcls2/install/include/xtcdata/xtc/ShapesData.hh:355: incorrect TypeId 0
[drp-srcf-cmp048:209412] *** Process received signal ***
[drp-srcf-cmp048:209412] Signal: Aborted (6)
[drp-srcf-cmp048:209412] Signal code:  (-6)

Page tree

FFB Zeroed Read Bytes

Problem Overview

Steps to Reproduce The Problem

DAQ

PSANA2