...
- single batch host is killing several jobs ('rogue' LSF host)
- group of batch hosts crashed or went offline
- afs crashed on one of the host machines
- scratch disk is full on one or more of the host machines - see what to do here
- staging disk is full
- /nfs/farm/g/glast/u52 or /nfs/farm/g/glast/u15 are is full
How to recognize infrastructure failures: they usually affect a large number of jobs, either on the same LSF host or on different LSF hosts.
...