Page History

...

The input files have been copied over from CERN's eos to Rachel's gpfs data area here: /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial/

Submitting Training Jobs with Slurm

DO NOT TRAIN ON INTERACTIVE SDF MACHINES.

Edit whichever configuration file (i.e. SubjetXbb.yaml ) with the correct paths for the training data on SDF:

...

To train, you'll need to submit a batch job to slurm. This requires a .sh script, such as the one below (edited from the submit_slurm.sh script in the Salt repository):

Code Block

#!/bin/bash

# Job name
#SBATCH --job-name=salt

# choose the GPU queue
# #SBATCH -p atlas
#SBATCH -p shared

# requesting one node
#SBATCH --nodes=1
#SBATCH --exclusive

# keep environment variables
#SBATCH --export=ALL

# requesting 4 V100 GPU
# (remove the "v100:" if you don't care what GPU)
# #SBATCH --gres=gpu:a100:4
#SBATCH --gpus=1

# note! this needs to match --trainer.devices!
#SBATCH --ntasks-per-node=1

# number of cpus per task
# useful if you don't have exclusive access to the node
# #SBATCH --cpus-per-task=10

# request enough memory
#SBATCH --mem=200G

# Change log names; %j gives job id, %x gives job name
#SBATCH --output=/sdf/home/r/rhyneman/salt_tutorial/salt/salt/out/slurm-%j.%x.out
#SBATCH --error=/sdf/home/r/rhyneman/salt_tutorial/salt/salt/out/slurm-%j.%x.err

# Comet 
export COMET_API_KEY=5k4oCPRq8rFcxn5BShm83X0FnMYKEY
export COMET_WORKSPACE=rhyneman
export COMET_PROJECT_NAME=salt-tutorial

# speedup
export OMP_NUM_THREADS=1

echo "CPU count: $(cat /proc/cpuinfo | awk '/^processor/{print $3}' | tail -1)"

echo "Current node: ${SLURMD_NODENAME}"

# move to workdir
cd /sdf/home/r/rhyneman/salt_tutorial/salt/salt/
echo "Moved dir, now in: ${PWD}"

# activate environment
source /sdf/home/r/rhyneman/miniconda3/etc/profile.d/conda.sh
conda activate salt
echo "Activated environment ${CONDA_DEFAULT_ENV}"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"

# run the training
echo "Running training script..."
srun salt fit \
    --config configs/GN1Xbb.yaml \
    --data.num_jets_train 10000 \   # This is a very small number, just for testing! Change me :)

...

Note that you should change the Comet settings based on your own Comet account info. Also, I used "salt-tutorial" (instead of just "salt") as my project name; I also changed the project_name in the logger block of the base.yaml config file accordingly. You should change the output and error directories to your own spaces (#SBATCH --output=... and #SBATCH --error=... ). You also need to use the "salt" directory of your own Salt installation for the work directory (the cd command). Lastly, you may or may not want to use your own miniconda installation (though Rachel's should work). If you do, change the path under the "activate environment" comment.

If you want to run with more than one GPU, make sure to edit the #SBATCH --gpus=N line (for N GPUs). I believe you also should edit the #SBATCH --ntasks-per-node=N option in the above script, as well as the devices: 1 option from the base.yaml config file (in the trainers section).

With the above script edited as needed, you should be able to submit by doing:

Code Block
sbatch submit_slurm.sh

Space shortcuts

Child pages

Versions Compared

Old Version 1

New Version 2

Key

Submitting Training Jobs with Slurm