This page gives some basic instructions to running the SALT Tutorial on SDF. The Salt framework is a framework for training ML-based flavor tagging algorithms in ATLAS. The tutorial page gives instructions for downloading the Salt package from gitlab (note that the tutorial uses Tag 0.1).
The Input Samples
The input files have been copied over from CERN's eos to Rachel's gpfs data area here: /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial/
Submitting with Slurm
DO NOT TRAIN ON INTERACTIVE SDF MACHINES.
Edit whichever configuration file (i.e. SubjetXbb.yaml
) with the correct paths for the training data on SDF:
train_file: /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial/Xbb-hybrid-resampled_scaled_shuffled.h5 val_file: /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial/Xbb-hybrid-validation-resampled_scaled_shuffled.h5 scale_dict: /gpfs/slac/atlas/fs1/d/rhyneman/salt_tutorial//Xbb-scale_dict.json
SLAC currently has limited GPU resources. You may want to only train on a single GPU. To do so, edit the base.yaml
config file, setting the following: devices: 1
(under the trainer
block).
To train, you'll need to submit a batch job to slurm. This requires a .sh script, such as the one below:
#!/bin/bash # Job name #SBATCH --job-name=salt # choose the GPU queue # #SBATCH -p atlas #SBATCH -p shared # requesting one node #SBATCH --nodes=1 #SBATCH --exclusive # keep environment variables #SBATCH --export=ALL # requesting 4 V100 GPU # (remove the "v100:" if you don't care what GPU) # #SBATCH --gres=gpu:a100:4 #SBATCH --gpus=1 # note! this needs to match --trainer.devices! #SBATCH --ntasks-per-node=1 # number of cpus per task # useful if you don't have exclusive access to the node # #SBATCH --cpus-per-task=10 # request enough memory #SBATCH --mem=200G # Change log names; %j gives job id, %x gives job name #SBATCH --output=/sdf/home/r/rhyneman/salt_tutorial/salt/salt/out/slurm-%j.%x.out #SBATCH --error=/sdf/home/r/rhyneman/salt_tutorial/salt/salt/out/slurm-%j.%x.err # Comet export COMET_API_KEY=5k4oCPRq8rFcxn5BShm83X0Fn export COMET_WORKSPACE=rhyneman export COMET_PROJECT_NAME=salt-tutorial # speedup export OMP_NUM_THREADS=1 echo "CPU count: $(cat /proc/cpuinfo | awk '/^processor/{print $3}' | tail -1)" echo "Current node: ${SLURMD_NODENAME}" # move to workdir cd /sdf/home/r/rhyneman/salt_tutorial/salt/salt/ echo "Moved dir, now in: ${PWD}" # activate environment source /sdf/home/r/rhyneman/miniconda3/etc/profile.d/conda.sh conda activate salt echo "Activated environment ${CONDA_DEFAULT_ENV}" echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES" # run the training echo "Running training script..." srun salt fit \ --config configs/GN1Xbb.yaml \ --data.num_jets_train 10000 \ # This is a very small number, just for testing! Change me :)
Note that you should change the Comet settings based on your own Comet account info. Also, I used "salt-tutorial" (instead of just "salt") as my project name; I also changed the project_name
in the logger
block of the base.yaml
config file accordingly. You should change the output and error directories to your own spaces (#SBATCH --output=...
and #SBATCH --error=...
). You also need to use the "salt" directory of your own Salt installation for the work directory (the cd
command). Lastly, you may or may not want to use your own miniconda installation (though Rachel's should work). If you do, change the path under the "activate environment
" comment.