You can run the batchtest quickly on a couple of nodes like this:
psanagpu105:~$ source /cds/sw/ds/ana/conda1/manage/bin/psconda.sh (ana-4.0.42) psanagpu105:~$ python /cds/group/psdm/utils/batchtest_slurm/batchtest.py psana1203 psana1205 ('launched', 'psana1203_psana1205_06_17_091719.log') srun: do not allocate resources (ana-4.0.42) psanagpu105:~$ head -5 /cds/group/psdm/utils/batchtest_slurm/logs/psana1203_psana1205_06_17_091719.log slurmstepd: error: psana1203 [0] pmixp_utils.c:471 [pmixp_p2p_send] mpi/pmix: ERROR: send failed, rc=111, exceeded the retry limit slurmstepd: error: psana1203 [0] pmixp_server.c:1578 [_slurm_send] mpi/pmix: ERROR: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.-19544.1733666360, size = 266, hostlist: (null) (ana-4.0.42) psanagpu105:~$
This has to be done on psanagpu105 (psanaq) or drp-srcf-eb003 (ffb) because it requires this:
(ana-4.0.42) psanagpu105:~$ sudo cat /etc/sudoers.d/srun [sudo] password for cpo: cpo ALL=(slurm) NOPASSWD:SETENV: /usr/bin/srun wilko ALL=(slurm) NOPASSWD:SETENV: /usr/bin/srun (ana-4.0.42) psanagpu105:~$
You can run the batchtest on all nodes (launched from the above two nodes) with this command. There is a 5-minute timeout in the script while it waits for jobs to run and should automatically send you email after that time summarizing the failures:
/cds/group/psdm/utils/batchtest_slurm/batchtest.sh
Overview
Content Tools