...
For long running jobs, you can submit jobs into our batch farm environment to make use of the high performance compute and storage at SLAC.
There is a batch queue for GPU systems.
In order to submit a job into the batch queues, it it recommend that you test out the job(s) on the interactive node at
Code Block |
---|
|
ssh ocio-gpu01.slac.stanford.edu
first. This machine provide multiple k80 class gpus as an interactive machine that you can play around with. it is not recommended to long jobs on this machine.
When you're ready and have a command line or series of commands that you need to perform; you should create a batch script (using vi, emacs, nano etc) with something like the following:
Code Block |
---|
#!/bin/sh bash -l #BSUB -a mympi #BSUB -P cryoem #BSUB -J my_batch_job_name #BSUB -q ocio-gpuslacgpu #BSUB -Wn 168:04 #BSUB -R "selectspan[ngpushosts=21]" #BSUB -o ~/gpu-out.log.%JW 72:00 #BSUB -e run.err #BSUB -o run.out #BSUB -e ~/gpu-err.log.%J B # setup env source /etc/profile.d/modules.sh export MODULEPATH=/usr/share/Modules/modulefiles:/opt/modulefiles:/afs/slac/package/singularity/modulefiles module purge module load cuda-8.0.61-gcc-PrgEnv-gcc/4.98.4-pe64s225 module load ... /afs/slac.stanford.edu/package/spack/opt/spack/linux-rhel6-x86_64/gcc-4.9.4/cuda-8.0.61-pe64s22ajz3fi37zdcjgtpnpcbooinum/samples/1_Utilities/deviceQuery/deviceQuery -noprompt |
this will submit into the ocio-gpu queue requesting 2 gpu's for upto 168 hours. the outputs from the script will be dumped into ~/gpu-out.log.nnnnn where nnnnn is your jobid.
the rest of the script will be the code that will be executed on the machine. in the example above, we are basically dumping out the details of the gpus.
you will likely need to load a bunch of modules for your code, the module load line provides access to various programs that are installed, eg
Code Block |
---|
# for relion
module load relion-2.1.b1-gcc-4.8.5-aaq4a56
# for icon-gpu
module load icon-gpu-1.2.5-gcc-4.8.5-k5sn3yw
# for ctffind
module load ctffind4-4.1.8-intel-17.0.2-gfcjad5 |
relion/3.0
# change working directory
cd <datadir>
# run the command
relion_reconstruct --i Refine3D/job103/run_ct31_data.star --o Reconstruct/ewald1/half1_class001_unfil.mrc --subset 1 --angpix 2.14 --ctf --ewald --mask_diameter 696 --sectors 2 --width_mask_edge 5 --sym I1 > Reconstruct/ewald1/reconstruct_half1_ewald.log |
The precise content of the batch script depends on what you wish to do; in the above, we've request 4 'slots' (which maps ~1:1 to a cpu core) with the -n flag and we've also requested them all all on the same host (span[hosts=1]).
we are submitting into the slacgpu queue, have requested no gpus and said that we don't expect this job to run longer than 72 hours.
You should change the last two lines to suit the job that you are wanting to run.
If you are running a singularity container, you should use something like this:
Code Block |
---|
singularity exec -B /gpfs,/scratch <path_to_container_file> python <python script> <args> |
When you're ready to run, use the following (note the linux redirection of batch file into the bsub command).then to submit your job you run:
Code Block |
---|
bsub < batch_file |
...
you will get email notification of the progress of your job to your SLAC unix account (with the -B argument).