Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

NOTE: we have not yet used umami/puma for plotting or preprocessing, but it can in principle be done and we will probably try to get this going soon.

Plotting with umami

Umami (which relies on puma internally) is capable of producing plots based on yaml configuration files. 
The best (read: only) way to use umami out of the box is via a docker container. To configure on SDF following the docs, add the following to your .bashrc:

...

Plotting exercises can be followed in the umami tutorial.

Plotting with puma

...

standalone

Puma can be used to produce plots in a more manual way. To install, there was some difficulty following the nominal instructions in the docs. What I (Brendon) found to work was to do:

...

This took quite some time to run, so (again) save yourself the effort and use the precompiled environments.

...

Pre-processing

SALT likes to take preprocesses data file formats from Umami (though in principle the format is the same as what's produced by the training dataset dumper).

...

The slac-bjr git project contains a fork of SALT. One can follow the SALT documentation for general installation/usage. Some specific notes can be found below:

Creating conda environment

One can use the environment salt that has been setup in /gpfs/slac/atlas/fs1/d/bbullard/conda_envs,
otherwise you may need to use conda install -c conda-forge jsonnet h5utils. Note that this was built using the latest master on Aug 28, 2023.

...

Code Block
conda install chardet
conda install --force-reinstall -c conda-forge charset-normalizer=3.2.0

Interactive testing

It : it is suggested not to use interactive nodes to do training, but instead to open a terminal on an SDF node by starting :

  • Starting a Jupyter

...

  • notebook (with a conda environment that has jupyter notebook installed, not necessarily the SALT conda env).
    A singularity container used in the jupyter notebook will propagate to the terminal session, which is incompatible with SALT.
  • New > Terminal. From here, you can test your training configuration.

There are some additional points to be aware of:

  • Running a terminal on an SDF node will reduce your priority when submitting jobs to slurm

...

  • Be careful about the number

...

  • of workers you select (in the PyTorch trainer object) which should be <= the number of CPU cores you're using (using more CPU cores parallelizes the data loading,
    which can be the primary bottleneck in training)

...

  • The number of requested GPUs should match the number of devices used in the training.

...

  • The number of jets you use in the test training should be larger then the batch size (2x works fine)

The following code can be used to test SALT training (it's basically the same thing you have in the slurm submission script):

Code Block
salt fit -c configs/<my_config>.yaml --data.num_jets_train <small_number> --data.num_workers <num_workers>

For training on slurm

See the SALT on SDF documentation (also linked at the top of this page) and example configs in the SALT fork in the slac_bjr GitLab project.
For additional clarity, see the following description of the submission scripts. Change the submit_slurm.sh script as follows

Code Block
languagebash
#SBATCH --output="<your_output_path>/out/slurm-%j.%x.out"
#SBATCH --error="<your_output_path>/out/slurm-%j.%x.err

export COMET_API_KEY=<your_commet_key>
export COMET_WORKSPACE=<your_commet_workspace>
export COMET_

For training on slurm

Change the submit_slurm.sh script as follows

Code Block
languagebash
#SBATCH --output="<your_output_path>/out/slurm-%j.%x.out"
#SBATCH --error="<your_output_path>/out/slurm-%j.%x.err

export COMET_API_KEY=<your_commet_key>
export COMET_WORKSPACE=<your_commet_workspace>
export COMET_PROJECT_NAME=<your_project_name>

cd <your_path_to_salt_directory>

...

from top level salt directory can use following command to launch a slurm training job in sdf

...

You can use standard sbatch commands from SDF documentation to understand the state of your job. 

Comet Training Visualization

In your comet profile, you should start seeing the live update for the training which looks as follows. The project name you have specified in the submit script appears under your
workspace which you can click to get the graphs of live training updates.

...

Follow salt documentation to run the evaluation of the trained model in the test dataset. This can also be done using a batch system. I (Prajita) personally have a separate bash script similar to the training script, for evaluating the training. The script looks like the following.

Code Block
languagebash
#!/bin/bash
 
# Job name
#SBATCH --job-name=salt-bjr
 
# choose the GPU queue
#SBATCH -p usatlas
# #SBATCH -p shared
 
# requesting one node
#SBATCH --nodes=1
#SBATCH --exclusive
 
# keep environment variables
#SBATCH --export=ALL
 
# requesting 4 V100 GPU
# (remove the "v100:" if you don't care what GPU)
# #SBATCH --gres=gpu:v100:4
#SBATCH --gpus=1
 
# note! this needs to match --trainer.devices!
#SBATCH --ntasks-per-node=1
 
# number of cpus per task
# useful if you don't have exclusive access to the node
#SBATCH --cpus-per-task=22
 
# request enough memory
#SBATCH --mem=200G
 
#CHANGE HERE log names; %j gives job id, %x gives job name
#SBATCH --output=<path_to_where_you_want_to_store_evaluate_output>slurm-%j.%x.out
#SBATCH --error=<path_to_where_you_want_to_store_evaluate_output>slurm-%j.%x.err


# speedup
export OMP_NUM_THREADS=1
 
echo "CPU count: $(cat /proc/cpuinfo | awk '/^processor/{print $3}' | tail -1)"
 
echo "Current node: ${SLURMD_NODENAME}"

# CHANGE THIS :: move to workdir
cd <path_to_your_salt_dir>
echo "Moved dir, now in: ${PWD}"
 
# activate environment
source /sdf/group/atlas/sw/conda/etc/profile.d/conda.sh
#setup_conda
conda activate salt
echo "Activated environment ${CONDA_DEFAULT_ENV}"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
 
# evaluate the training
echo "Evaluating the trained model on test sample..."
srun salt test --config <your_path_to>/logs/GNX-bjr_20230918-T120112/config.yaml --data.test_file <your_path_to_test_file> --data.num_workers 20

There is a separate batch submission script used for the model evaluation, but is very similar to what is used in the model training batch script.
The main difference is in the salt command that is run (see below). It will produce a log in the same directory as the other log files, and will produce a new output h5 file alongside the one you pass in for evaluation.

The important points are the following:

  • --data.test_suff: Specify a suffix for the sample that is produced from the PredictionWriter callback specified in the model config. There is a separate list of features to be saved to this new file, along with the model output,
    that can be used for studies of the model performance
  • --data.num_workers: you should use the same number of workers for the evaluation as for the training, since both are bottlenecked by the loading of the data
  • --data.test_file: technically can be either the training, testing, or evaluation h5 file. In principle the testing file is created for this purpose. Philosophically, you want to keep this dataset separate so that you don't induce some kind of bias as you manually determine the hyper parameter optimization
  • --ckpt_path: the specific model checkpoint you want to use. Out of the box, SALT should be picking the checkpoint with the lowest validation loss, but this has been found to not be very reliable. So always do it manually to be sure you know what model state you are actually studying. 
Code Block
srun salt test -c logs/<model>/config.yaml --data.test_file <path_to_file_for_evaluation>.h5 --data.num_jets_train <num_jets> --data.num_workers 20 --trainer.devices 1 --data.test_suff eval --ckpt_path logs/<model>/ckpts/epoch=<epoch>.ckpt


We are developing some baseline evaluation classes in the analysis repository to systematically evaluate model performanceThis produces a log in the same directory as all the log files, which can be read/compared using h5

To do: add a link to a simple notebook that analyzes the evaluation file. the 

Training Optimization

Miscellaneous tips

...