You can run the batchtest quickly on a couple of nodes like this:

psanagpu105:~$ source /cds/sw/ds/ana/conda1/manage/bin/psconda.sh 
(ana-4.0.42) psanagpu105:~$ python /cds/group/psdm/utils/batchtest_slurm/batchtest.py psana1203 psana1205
('launched', 'psana1203_psana1205_06_17_091719.log')
srun: do not allocate resources
(ana-4.0.42) psanagpu105:~$ head -5 /cds/group/psdm/utils/batchtest_slurm/logs/psana1203_psana1205_06_17_091719.log
slurmstepd: error: psana1203 [0] pmixp_utils.c:471 [pmixp_p2p_send] mpi/pmix: ERROR: send failed, rc=111, exceeded the retry limit
slurmstepd: error: psana1203 [0] pmixp_server.c:1578 [_slurm_send] mpi/pmix: ERROR: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.-19544.1733666360, size = 266, hostlist:
(null)
(ana-4.0.42) psanagpu105:~$ 


This has to be done on psanagpu105 (psanaq) or drp-srcf-eb003 (ffb) because it requires this:

(ana-4.0.42) psanagpu105:~$ sudo cat /etc/sudoers.d/srun
[sudo] password for cpo: 
cpo ALL=(slurm) NOPASSWD:SETENV: /usr/bin/srun
wilko ALL=(slurm) NOPASSWD:SETENV: /usr/bin/srun
(ana-4.0.42) psanagpu105:~$ 


You can run the batchtest on all nodes (launched from the above two nodes) with this command.  There is a 5-minute timeout in the script while it waits for jobs to run and should automatically send you email after that time summarizing the failures:

/cds/group/psdm/utils/batchtest_slurm/batchtest.sh



  • No labels