Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note that you must be careful about the number of workers you select (in the PyTorch trainer object) which should be <= the number of CPU cores you're using (using more CPU cores parallelizes the data loading,
which can be the primary bottleneck in training). The number of requested GPUs should match the number of devices used in the training. 


For training on slurm

Change the submit_slurm.sh script as follows

Code Block
languagebash
#SBATCH --output="<your_output_path>/out/slurm-%j.%x.out"
#SBATCH --error="<your_output_path>/out/slurm-%j.%x.err

export COMET_API_KEY=<your_commet_key>
export COMET_WORKSPACE=<your_commet_workspace>
export COMET_PROJECT_NAME=<your_project_name>

cd <your_path_to_salt_directory>


Submitting a job to slurm

from top level salt directory can use following command to launch a slurm training job in sdf

Code Block
languagebash
sbatch salt/submit_slurm_bjr.sh 

Both slurm-%j.%x.err andslurm-%j.%x.out files will start to fill up 

You can use standard sbatch commands from SDF documentation to understand the state of your job. 


Comet Training Visualization

In your comet profile, you should start seeing the live update for the training which looks as follows. The project name you have specified in the submit script appears under your workspace which you can click to get the graphs of live training updates.

Image AddedImage Added

Training Evaluation

Follow salt documentation to run the evaluation of the trained model in the test dataset. This can also be done using a batch system. I (Prajita) personally have a separate bash script similar to the training script, for evaluating the training. The script looks like the following.

Code Block
languagebash
#!/bin/bash
 
# Job name
#SBATCH --job-name=salt-bjr
 
# choose the GPU queue
#SBATCH -p usatlas
# #SBATCH -p shared
 
# requesting one node
#SBATCH --nodes=1
#SBATCH --exclusive
 
# keep environment variables
#SBATCH --export=ALL
 
# requesting 4 V100 GPU
# (remove the "v100:" if you don't care what GPU)
# #SBATCH --gres=gpu:v100:4
#SBATCH --gpus=1
 
# note! this needs to match --trainer.devices!
#SBATCH --ntasks-per-node=1
 
# number of cpus per task
# useful if you don't have exclusive access to the node
#SBATCH --cpus-per-task=22
 
# request enough memory
#SBATCH --mem=200G
 
#CHANGE HERE log names; %j gives job id, %x gives job name
#SBATCH --output=<path_to_where_you_want_to_store_evaluate_output>slurm-%j.%x.out
#SBATCH --error=<path_to_where_you_want_to_store_evaluate_output>slurm-%j.%x.err


# speedup
export OMP_NUM_THREADS=1
 
echo "CPU count: $(cat /proc/cpuinfo | awk '/^processor/{print $3}' | tail -1)"
 
echo "Current node: ${SLURMD_NODENAME}"

# CHANGE THIS :: move to workdir
cd <path_to_your_salt_dir>
echo "Moved dir, now in: ${PWD}"
 
# activate environment
source /sdf/group/atlas/sw/conda/etc/profile.d/conda.sh
#setup_conda
conda activate salt
echo "Activated environment ${CONDA_DEFAULT_ENV}"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
 
# evaluate the training
echo "Evaluating the trained model on test sample..."
srun salt test --config <your_path_to>/logs/GNX-bjr_20230918-T120112/config.yaml --data.test_file <your_path_to_test_file> --data.num_workers 20

This produces a log in the same directory as all the log files, which can be read/compared using h5. 

To do: add a link to a simple notebook that analyzes the evaluation file. 

Training Optimization

Miscellaneous tips

You can grant read/write access for GPFS data folder directories to ATLAS group members via the following (note that this does not work for SDF home folder)

...