Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For long running jobs, you can submit jobs into our batch farm environment to make use of the high performance compute and storage at SLAC.

There is a batch queue for GPU systems. 

In order to submit a job into the batch queues, it it recommend that you test out the job(s) on the interactive node at 

 

Code Block
 

ssh ocio-gpu01.slac.stanford.edu

 

first. This machine provide multiple k80 class gpus as an interactive machine that you can play around with. it is not recommended to long jobs on this machine.

 

When you're ready and have a command line or series of commands that you need to perform; you should create a batch script (using vi, emacs, nano etc) with something like the following:

 

Code Block
#!/bin/sh
bash -l




#BSUB -a mympi
#BSUB -P cryoem
#BSUB -J my_batch_job_name
#BSUB -q ocio-gpuslacgpu
#BSUB -Wn 168:04
#BSUB -R "selectspan[ngpushosts=21]"
#BSUB -o ~/gpu-out.log.%JW 72:00
#BSUB -e run.err
#BSUB -o run.out
#BSUB -e ~/gpu-err.log.%J
B


# setup env
source /etc/profile.d/modules.sh
export MODULEPATH=/usr/share/Modules/modulefiles:/opt/modulefiles:/afs/slac/package/singularity/modulefiles
module purge
module load cuda-8.0.61-gcc-PrgEnv-gcc/4.98.4-pe64s225
module load ...

/afs/slac.stanford.edu/package/spack/opt/spack/linux-rhel6-x86_64/gcc-4.9.4/cuda-8.0.61-pe64s22ajz3fi37zdcjgtpnpcbooinum/samples/1_Utilities/deviceQuery/deviceQuery -noprompt

 

this will submit into the ocio-gpu queue requesting 2 gpu's for upto 168 hours. the outputs from the script will be dumped into ~/gpu-out.log.nnnnn where nnnnn is your jobid.

the rest of the script will be the code that will be executed on the machine. in the example above, we are basically dumping out the details of the gpus.

you will likely need to load a bunch of modules for your code, the module load line provides access to various programs that are installed, eg

Code Block
# for relion
module load relion-2.1.b1-gcc-4.8.5-aaq4a56
 
# for icon-gpu
module load icon-gpu-1.2.5-gcc-4.8.5-k5sn3yw
 
# for ctffind
module load ctffind4-4.1.8-intel-17.0.2-gfcjad5

 

relion/3.0

# change working directory
cd <datadir>


# run the command
relion_reconstruct --i Refine3D/job103/run_ct31_data.star --o Reconstruct/ewald1/half1_class001_unfil.mrc --subset 1 --angpix 2.14 --ctf --ewald --mask_diameter 696 --sectors 2 --width_mask_edge 5 --sym I1 > Reconstruct/ewald1/reconstruct_half1_ewald.log

 

The precise content of the batch script depends on what you wish to do; in the above, we've request 4 'slots' (which maps ~1:1 to a cpu core) with the -n flag and we've also requested them all all on the same host (span[hosts=1]).

we are submitting into the slacgpu queue, have requested no gpus and said that we don't expect this job to run longer than 72 hours.

You should change the last two lines to suit the job that you are wanting to run.

If you are running a singularity container, you should use something like this:

 

Code Block
singularity exec -B /gpfs,/scratch <path_to_container_file> python <python script> <args>

 

When you're ready to run, use the following (note the linux redirection of batch file into the bsub command).then to submit your job you run:

Code Block
bsub < batch_file

...

you will get email notification of the progress of your job to your SLAC unix account (with the -B argument).