SALT on SLAC SDF

This page gives some basic instructions to running the SALT Tutorial on SDF. The Salt framework is a framework for training ML-based flavor tagging algorithms in ATLAS. The tutorial page gives instructions for downloading the Salt package from gitlab (note that the tutorial uses Tag 0.1).

The Input Samples

The input files have been copied over from CERN's eos to Rachel's gpfs data area here: /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial/

Submitting Training Jobs with Slurm

DO NOT TRAIN ON INTERACTIVE SDF MACHINES.

Edit whichever configuration file (i.e. SubjetXbb.yaml ) with the correct paths for the training data on SDF:

  train_file: /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial/Xbb-hybrid-resampled_scaled_shuffled.h5
  val_file: /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial/Xbb-hybrid-validation-resampled_scaled_shuffled.h5
  scale_dict: /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial//Xbb-scale_dict.json

SLAC currently has limited GPU resources. You may want to only train on a single GPU. To do so, edit the base.yaml config file, setting the following: devices: 1 (under the trainer block).

To train, you'll need to submit a batch job to slurm. This requires a .sh script, such as the one below (edited from the submit_slurm.sh script in the Salt repository):

#!/bin/bash

# Job name
#SBATCH --job-name=salt

# choose the GPU queue
# #SBATCH -p atlas
#SBATCH -p shared

# requesting one node
#SBATCH --nodes=1
#SBATCH --exclusive

# keep environment variables
#SBATCH --export=ALL

# requesting 4 V100 GPU
# (remove the "v100:" if you don't care what GPU)
# #SBATCH --gres=gpu:a100:4
#SBATCH --gpus=1

# note! this needs to match --trainer.devices!
#SBATCH --ntasks-per-node=1

# number of cpus per task
# useful if you don't have exclusive access to the node
# #SBATCH --cpus-per-task=10

# request enough memory
#SBATCH --mem=200G

# Change log names; %j gives job id, %x gives job name
#SBATCH --output=/sdf/home/r/rhyneman/salt_tutorial/salt/salt/out/slurm-%j.%x.out
#SBATCH --error=/sdf/home/r/rhyneman/salt_tutorial/salt/salt/out/slurm-%j.%x.err

# Comet 
export COMET_API_KEY=MYKEY
export COMET_WORKSPACE=rhyneman
export COMET_PROJECT_NAME=salt-tutorial

# speedup
export OMP_NUM_THREADS=1

echo "CPU count: $(cat /proc/cpuinfo | awk '/^processor/{print $3}' | tail -1)"

echo "Current node: ${SLURMD_NODENAME}"

# move to workdir
cd /sdf/home/r/rhyneman/salt_tutorial/salt/salt/
echo "Moved dir, now in: ${PWD}"

# activate environment
source /sdf/home/r/rhyneman/miniconda3/etc/profile.d/conda.sh
conda activate salt
echo "Activated environment ${CONDA_DEFAULT_ENV}"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"

# run the training
echo "Running training script..."
srun salt fit \
    --config configs/GN1Xbb.yaml \
    --data.num_jets_train 10000 \   # This is a very small number, just for testing! Change me :)

Note that you should change the Comet settings based on your own Comet account info. Also, I used "salt-tutorial" (instead of just "salt") as my project name; I also changed the project_name in the logger block of the base.yaml config file accordingly. You should change the output and error directories to your own spaces (#SBATCH --output=... and #SBATCH --error=... ). You also need to use the "salt" directory of your own Salt installation for the work directory (the cd command). Lastly, you may or may not want to use your own miniconda installation (though Rachel's should work). If you do, change the path under the "activate environment" comment.

If you want to run with more than one GPU, make sure to edit the #SBATCH --gpus=N line (for N GPUs). I believe you also should edit the #SBATCH --ntasks-per-node=N option in the above script, as well as the devices: 1 option from the base.yaml config file (in the trainers section).

With the above script edited as needed, you should be able to submit by doing:

sbatch submit_slurm.sh

Space shortcuts

Child pages

The Input Samples

Submitting Training Jobs with Slurm