Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Documents and notes

...

  • Understanding bias in JES in low truth-jet pT
  • Truth-jet matching (Brendon)
    •  Check different truth jet matching schemes for response definition
    •  Performance as a function of dR quality cut
  • Evaluating the impact of soft lepton information (Prajita)
    •  Performance with LeptonID
    •  Apply additional selection on soft electron
  • Auxiliary tasks (Brendon)
    •  Fixing weights of TruthOrigin classes
    •  Add TruthParticleID
  • Model architecture (Brendon)
    •  Attention mechanisms
    •  Hyperparameter optimization

...

Code Block
### Trained on UPP_1
# No auxiliary tasks
/gpfs/slac/atlas/fs1/d/bbullard/BjetRegression/models/GNX-bjr-ftag1_1p5mil_NoAux_21param_20230918-T082841
/gpfs/slac/atlas/fs1/d/bbullard/BjetRegression/models/GNX-bjr-ftag1_1p5mil_NoAux_21param_20230918-T174839

### Trained on UPP_2
# No auxiliary tasks, trained in 2 separate batches
/gpfs/slac/atlas/fs1/d/bbullard/BjetRegression/models/GNX-bjr-physval_80mil_NoAux_21param_v1_20231013-T121459
/gpfs/slac/atlas/fs1/d/bbullard/BjetRegression/models/GNX-bjr-physval_80mil_NoAux_21param_v2_20231013-T121459

# With auxiliary tasks
/gpfs/slac/atlas/fs1/d/bbullard/BjetRegression/models/GNX-bjr-physval_80mil_Aux_21param_20231023-T080935

### Trained on UPP_3
/gpfs/slac/atlas/fs1/d/bbullard/BjetRegression/models/GNX-bjr-physval_80mil_NoAux_21param_PtResampled


...

SDF preliminaries

Migration to S3DF

In Jan/Feb 2024, we migrated from SDF to S3DF. Here are some helpful documentations:

  • S3DF page, including conda installation (link)
  • S3DF in the ATLAS grid (Wei Yang)

Note that our installation of conda is at /sdf/group/atlas/sw/conda.

Compute resources

SDF has a shared queue for submitting jobs via slurm, but this partition has extremely low priority. Instead, use the usatlas partition or request Michael Kagan to join the atlas partition.

...

atlas has 2 CPU nodes (2x 128 cores) and 1 GPU node (4x A100); 3 TB memory
usatlas has 4 CPU nodes (4x 128 cores) and 1 GPU node (4x A100, 5x GTX 1080. Ti, 20x RTX 2080 Ti); 1.5 TB memory

Environments

For sdf

An environment needs to be created to ensure all packages are available. We have explored some options for doing this.

Option 1: stealing instance built for SSI 2023. This installs most useful packages but uses python 3.6, which leads to issues with h5py.
Starting Jupyter sessions via SDF web interface

...

Code Block
export PATH="/sdf/group/atlas/sw/conda/bin:$PATH"
export TMPDIR="${SCRATCH}"

Producing H5 samples with Training Dataset Dumper

Introduction

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Overview (FTag Workshop Sept. 2023): presentation by Nikita Pond
Tutorial: TDD on FTag docs
General introduction to Athena in Run 3 (component accumulator): ATLAS Software Docs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TDD productions:

v2.1: TDD code, BigPanda
Run with grid-submit -c configs/PHYSVAL_EMPFlow_jer.json -e AntiKt4EMPFlow_PtReco_Correction_MV2c10_DL1r_05032020.root -i BTagTrainingPreprocessing/grid/inputs/physval-jer.txt bjer
NOTE: you will need to copy the PtReco histogram into the top level of training-dataset-dumper

  • Issue with datasets not being available on the grid.. need to request a new production

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


For S3DF 

Setting up the conda environment has steps similar to those of SDF. To use the atlas group conda, use following steps

1.) Add the following in your .bashrc

Code Block
export PATH="/sdf/group/atlas/sw/conda/bin:$PATH"

2.) Add the following to your .condarc

Code Block
pkgs_dirs:
  - /fs/ddn/sdf/group/atlas/d/<user>/conda_envs/pkgs
envs_dirs:
 - /fs/ddn/sdf/group/atlas/d/<user>/conda_envs/envs
channels:
  - conda-forge
  - defaults
  - anaconda
  - pytorch

3.) source conda.sh 

Code Block
> source /sdf/group/atlas/sw/conda/etc/profile.d/conda.sh

4.) cross check if conda is set correctly 

Code Block
> which conda 
#following should appear if things went smoothly
conda ()
{
   \local cmd="${1-__missing__}";
   case "$cmd" in
       activate | deactivate)
           __conda_activate "$@"
       ;;
       install | update | upgrade | remove | uninstall)
           __conda_exe "$@" || \return;
           __conda_reactivate
       ;;
       *)
           __conda_exe "$@"
       ;;
   esac
}


5.) Now follow the same steps as SDF to create and activate the conda environment

...

Producing H5 samples with Training Dataset Dumper

Introduction

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Overview (FTag Workshop Sept. 2023): presentation by Nikita Pond
Tutorial: TDD on FTag docs
General introduction to Athena in Run 3 (component accumulator): ATLAS Software Docs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TDD productions:

v2.1: TDD code, BigPanda
Run with grid-submit -c configs/PHYSVAL_EMPFlow_jer.json -e AntiKt4EMPFlow_PtReco_Correction_MV2c10_DL1r_05032020.root -i BTagTrainingPreprocessing/grid/inputs/physval-jer.txt bjer
NOTE: you will need to copy the PtReco histogram into the top level of training-dataset-dumper

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We are using a custom fork of training-dataset-dumper, which is used to extract h5 files from derivations - we are currently using FTAG1 / PHYSVAL / JETM2.
The fork in slac-bjr has our working code to dump datasets based on the We are using a custom fork of training-dataset-dumper, which is used to extract h5 files from derivations - we are currently using FTAG1 / PHYSVAL / JETM2.
The fork in slac-bjr has our working code to dump datasets based on the AntiKt4TruthDressedWZJets container and EMPFlow as well as UFO jets.

...

Follow salt documentation to run the evaluation of the trained model in the test dataset. There is a separate batch submission script used for the model evaluation, but is very similar to what is used in the model training batch script.
The main difference is in the salt command that is run (see below). It will produce a log in the same directory as the other log files, and will produce a new output h5 file alongside the one you pass in for evaluation.

The important points are the following:

  • --data.test_suff: Specify a suffix for the sample that is produced from the PredictionWriter callback specified in the model config. There is a separate list of features to be saved to this new file, along with the model output,
    that can be used for studies of the model performance
  • --data.num_workers: you should use the same number of workers for the evaluation as for the training, since both are bottlenecked by the loading of the data
  • --data.test_file: technically can be either the training, testing, or evaluation h5 file. In principle the testing file is created for this purpose. Philosophically, you want to keep this dataset separate so that you don't induce some kind of bias as you manually determine the hyper parameter optimization
  • --ckpt_path: the specific model checkpoint you want to use. Out of the box, SALT should be picking the checkpoint with the lowest validation loss, but this has been found to not be very reliable. So always do it manually to be sure you know what model state you are actually studying. 
Code Block
srun salt test -c logs/<model>/config.yaml --data.test_file <path_to_file_for_evaluation>.h5 --data.num_jets_train <num_jets> --data.num_workers 20 --trainer.devices 1 --data.test_suff eval --ckpt_path logs/<model>/ckpts/epoch=<epoch>.ckpt

We are developing some baseline evaluation classes in the analysis repository to systematically evaluate model performance. 

To do: develop this, and add some notes on it (Brendon/Prajita)

Miscellaneous tips

You can grant read/write access for GPFS data folder directories to ATLAS group members via the following (note that this does not work for SDF home folder)

will produce a new output h5 file alongside the one you pass in for evaluation.

The important points are the following:

  • --data.test_suff: Specify a suffix for the sample that is produced from the PredictionWriter callback specified in the model config. There is a separate list of features to be saved to this new file, along with the model output,
    that can be used for studies of the model performance
  • --data.num_workers: you should use the same number of workers for the evaluation as for the training, since both are bottlenecked by the loading of the data
  • --data.test_file: technically can be either the training, testing, or evaluation h5 file. In principle the testing file is created for this purpose. Philosophically, you want to keep this dataset separate so that you don't induce some kind of bias as you manually determine the hyper parameter optimization
  • --ckpt_path: the specific model checkpoint you want to use. Out of the box, SALT should be picking the checkpoint with the lowest validation loss, but this has been found to not be very reliable. So always do it manually to be sure you know what model state you are actually studying. 
Code Block
srun salt test -c logs/<model>/config.yaml --data.test_file <path_to_file_for_evaluation>.h5 --data.num_jets_train <num_jets> --data.num_workers 20 --trainer.devices 1 --data.test_suff eval --ckpt_path logs/<model>/ckpts/epoch=<epoch>.ckpt

We are developing some baseline evaluation classes in the analysis repository to systematically evaluate model performance. 

To do: develop this, and add some notes on it (Brendon/Prajita)


...

Miscellaneous tips

You can grant read/write access for GPFS data folder directories to ATLAS group members via the following (note that this does not work for SDF home folder)

Code Block
groups <username> # To check user groups
cd <your_directory> 
find . -type d|xargs chmod g+rx # Need to make all subdirectories readable *and* executable to the group

You can setup passwordless ssh connection following this page to create an ssh key pair. Then add this block to your ~/.ssh/config file (allows to tunnel directly to iana, but you can adapt to other hosts):

Code Block
Host sdfiana
  HostName iana
  User <yourusername>
  IdentityFile ~/.ssh/<your private key>
  proxycommand ssh s3dflogin.slac.stanford.edu nc %h %p
  ForwardX11 yes
  Host s3dflogin.slac.stanford.edu
Code Block
groups <username> # To check user groups
cd <your_directory> 
find . -type d|xargs chmod g+rx # Need to make all subdirectories readable *and* executable to the group


Discussion Summary 

Suggestions from M. Kagan Jan 23, 2023

  • Not to regress the neurtino pT directly, neutrino pT directly, as the pT distribution might not be unimodal. Instead, try two different approaches 
    • Quantile regression with the median
    • Auxilary task based on classification of neutrino pT class prediction, i.e., predict the bin 0-1 for the ratio of neutrino pT & jet pT. Also, add overflow bin 
  • loss function based on median rather than mean square error, ie, MAE rather than MSE