Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
sbatch submit_slurm.sh


Submitting Testing Jobs with Slurm

Testing on slurm is essentially the same as training, but with a few key changes needed to the above submission script, as shown below:

Code Block
#!/bin/bash

# Job name
#SBATCH --job-name=salt_GN1Xbb

# choose the GPU queue
# #SBATCH -p atlas
#SBATCH -p shared

# requesting one node
#SBATCH --nodes=1
#SBATCH --exclusive

# keep environment variables
#SBATCH --export=ALL

# requesting 4 V100 GPU
# (remove the "v100:" if you don't care what GPU)
# #SBATCH --gres=gpu:a100:4
#SBATCH --gpus=1     ### NOTE! The training method in Salt can only use one GPU!

# note! this needs to match --trainer.devices!
#SBATCH --ntasks-per-node=1     ### NOTE! This needs to be set to 1, since we can only use one GPU!

# number of cpus per task
# useful if you don't have exclusive access to the node
# #SBATCH --cpus-per-task=10

# request enough memory
#SBATCH --mem=200G

# Change log names; %j gives job id, %x gives job name
#SBATCH --output=/sdf/home/r/rhyneman/salt_tutorial/salt/salt/out/slurm-%j.%x.out
#SBATCH --error=/sdf/home/r/rhyneman/salt_tutorial/salt/salt/out/slurm-%j.%x.err

# Comet 
export COMET_API_KEY=5k4oCPRq8rFcxn5BShm83X0Fn
export COMET_WORKSPACE=rhyneman
export COMET_PROJECT_NAME=salt-tutorial

# speedup
export OMP_NUM_THREADS=1

echo "CPU count: $(cat /proc/cpuinfo | awk '/^processor/{print $3}' | tail -1)"

echo "Current node: ${SLURMD_NODENAME}"

# move to workdir
cd /sdf/home/r/rhyneman/salt_tutorial/salt/salt/
echo "Moved dir, now in: ${PWD}"

# activate environment
source /sdf/home/r/rhyneman/miniconda3/etc/profile.d/conda.sh
conda activate salt
echo "Activated environment ${CONDA_DEFAULT_ENV}"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"

# run the testing
echo "Running testing script..."
srun salt test \
	--config logs/MYMODEL/config.yaml \     # Replace MYMODEL with your model name!
	--data.test_file /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial/inclusive_testing_Mix.h5 \
	--data.num_jets_test 1000 \      # This is a very small number, just for testing! Change me :) 
  	--trainer.devices 1 \

The big changes to the Slrum options of the script are to set #SBATCH --gpus=1 and #SBATCH --ntasks-per-node=1 (these both must be set to 1 to reflect the fact that Salt can only use 1 GPU in testing, unlike in the training loop). The other change is (of course) to replace the "train" command with a "test" command (the srun salt test part). The --trainer.devices 1 option in the salt command is there for the same reason.

The testing loop requires you to point to the config file in the logs  area (the output area, which is a subdirectory of the salt/salt directory). This should look something like "GN1Xbb" or "SubjetXbb", followed by a set of numbers, like: 20230216-T200512. The first part gives the name of the algorithm model you were using, while the second is a time and date stamp. Make sure to replace "MYMODEL" in the above testing script with whichever model subdirectory you want to run the testing loop with. The testing loop also requires you to point to the training dataset (specify with the --data.test_file option). Feel free to keep using the one stored in Rachel's GPFS area.

Once the testing loop is complete, you can find the new, output .h5 file in the logs/MYMODEL/ckpts/ subdirectory. The above script allows salt to just use whichever checkpoint model in the training which has the lowest validation loss, which may or may not be that of the final epoch from the training. You can also specify which checkpoint to use by using the --ckpt_path argument. Regardless, the output .h5 file will be named something like "epoch=XXX-val_loss=XXX__test_Mix.h5".


Finally, to plot, just use the plotting script from the tutorial as usual (no need to run this on Slurm!).